Blog - David Orolin | Senior Programmer
An increasing number of software developers are embracing generative AI tools in their everyday work process. And it’s natural to be concerned and ask questions about AI generated content, such as:
What happens to the data sent to the AI model? Is code from our close sourced application going to appear in development tools of a programmer on the other side of the planet? Who owns the content generated by an AI tool? And who is to blame for the eventuality where an AI tool suggests someone else’s code verbatim and breaches their intellectual property?
We usually can’t see behind the closed doors of these tools and, with some exceptions (like Tabnine’s Local Machine Mode in Tabnine Pro), we only get limited control over what happens with the data we share with them. At that point, there is only one place to turn - the user agreements and privacy policies of the individual tools. (In this article, we’re going to focus on ChatGPT, Google Bard ,Tabnine and GitHub Copilot.)
I have read through the agreements and other documents that you agree to when you start using these tools. I was pleasantly surprised by how well most of them are written. (Well, except for Tabnine. The Tabnine documents were clearly written by Real Lawyers). I have then consulted my notes with a custom GPT I have created for this purpose. I’d also like to point out that this article is applicable at the time of its writing, which is January of 2024. The information provided here will evolve along with AI and legal rulings surrounding it.
I have chosen these tools due to their popularity and prevalence on today’s market. Two of them are specifically created to increase the effectiveness of writing code and the other two are so widespread that they’re almost certainly used to that end.
GitHub Copilot and Tabnine are tools (copilots) which are directly integrated into the working environment of developers and, when used correctly, they can dramatically improve their productivity. They automatically read relevant context (like the currently edited file), it gets sent to the model and developer gets generated content such as a finalized code block.
ChatGPT (by OpenAI) and Google Bard are tools with a broader focus, but they can still be efficiently used when developing software, whether for the purpose of research, generating scripts or code snippets. Data accessible to these tools is explicitly shared by inputting them in the chat window by the user.
Both copilot tools clearly specify how they approach data from the developer's environment. First, they get sent to the model, there they get analyzed and after finishing the requested operation, the data gets deleted1, 2. Additionally, GitHub Copilot claims that the model never trains on the user data1.
The terms for Tabnine are a bit more involved - the user can pick whether they want the model to train on their data (at the time of writing, this option was only available in the Pro version of Tabnine). When a user opts into this feature, the generated content based on that data will only be available to that specific user. The content will never find its way to other Tabnine users2, 3.
In case of individual licenses, OpenAI retains their data and they can be processed further as de-identified information4. While ChatGPT does train on user data, this behavior can be disabled by opting out of it5.
Owners of the Enterprise licenses then get to choose for how long is user data retained and OpenAI also claims that the stored data is encrypted6. OpenAI claims that the stored data is encrypted 6 and that it’s only used to train the customized enterprise model6.
"Plaintiff can point to no case in which a court has recognized copyright in a work originating with a nonhuman. We are approaching new frontiers in copyright."
Judge Beryl A. Howell | US District Court for the District of Columbia
Before delving deeper into this specific can of worms I’d like to emphasize that there is no international standard for AI generated content. There are some court precedents, but we can’t base any assumptions on those yet. Right now a whole array of court cases are underway and these will dictate the future of content that AI trains on. This is a quickly evolving topic and so it’s highly likely that the terms for AI tool usage will also change as the time goes on.
The only thing we can currently analyze is the intent of creators of these tools. Luckily, that seems quite permissive at this point. GitHub Copilot7, Tabnine3 and ChatGPT5 all grant users full ownership of the generated content.
Which brings us to the more complex question:
Albeit the models should behave in such a way that doesn’t allow them to use the source data in the generated content literally, it turned out that situations like that may arise.
Tabnine solved this issue by only training on code with permissive licenses8. In other words, even if it generated content that’s a complete copy of source data, there would be no breach of intellectual property laws, because licenses of said data allow its distribution in this manner. However, in spite of that, Tabnine still makes it very clear that they are not to be held responsible for the tool generating copyrighted material9.
GitHub claims that copilot was trained on code available in public repositories10. On one hand, there’s no mention of licenses of said repositories. On the other hand, it does allow users to “Block suggestions matching public code”, which should actively compare the generated content with publicly available code and avoid such situations. I strongly suggest using this setting - GitHub even waives their Defense of Third Party Claims agreements in case this content is not blocked11.
Analyzing what exactly did ChatGPT train on is wildly out of scope for this article, but at this point it’s more or less certain that it also used copyrighted material. Although there is very little chance that such material would ever be used in generated content, OpenAI accommodates this possibility in their “Copyright Shield” program, where they offer owners of the enterprise license and users of the API payment for expenses on copyright lawsuits. And just like Tabnine, ChatGPT holds the user responsible for damages caused by generated content12. (Incidentally, I have not found a single mention of Copyright Shield in the ChatGPT’s customer agreements.)
Now we’re going to find out who’s been paying attention, because when I listed the tools to talk about in this article, there were four of them. But I have only talked about three in the previous chapters. There’s a reason for that: I couldn’t find any official answers to these questions from Google. So I’d lean towards the safe approach and assume the worst, albeit we don’t know the answers to any of the questions posed here for sure.
Actually, when it comes to the Google Terms of Service, I think what they’re implying is that Google is allowed to use user data for improvement of its services. I could not find anything about ownership of the content generated by Google Bard.
The first important aspect that (almost) all AI tools share is that all original content they generate belongs exclusively to the user. They are also rather careful and leave all responsibility for the aforementioned content on the shoulders of the user as well, including things the user can’t possibly be aware of. Such as generating copyrighted content.
When it comes to the tools focused directly at programmers (GitHub Copilot, Tabnine), the tools clearly take data protection very seriously. That includes a guarantee that the user data do not get stored server-side and that the copilots don’t train on the data. For chat-based tools, data protection is much more complex, albeit ChatGPT specifically does offer some options and use cases which prohibits the model from training on user input.
There’s still no way around the fact that implementation of all of these tools is closed and we can’t access it. Big tech companies have proven many times in the past that there’s a big difference in what they claim to do with our data and what’s actually happening. That is why their promises regarding behavior of AI tools need to be taken with a grain of salt. The best we can do is to meticulously follow security standards and never keep passwords or other secrets in our code bases, never send them to the chat tools and generally make sure that the AI tool doesn’t get access to sensitive information.