AI Training and Copyright: Legal Considerations in the Age of Artificial Intelligence

19 Mar 2024 | Ben Smalberger

Preliminary note: the following article is not intended as legal advice to any reader. Any similarity to real life events or persons is purely coincidental. It is intended for informational and academic purposes only and represents the author’s personal views and opinions.

The use of human made content to train artificial intelligence (“AI”) systems is challenging existing copyright laws. While new products like ChatGPT have the potential to revolutionize information access, legal questions are being asked regarding copyright infringement.

This Blog Post, as part of a series about technology trends and law, examines recent cases shaping the development of AI. We specifically discuss the upcoming case between OpenAI and The New York Times (“The Times”), consider the interaction between Large Language Models (“LLMs”) and copyright law, and assess the potential implications of generative AI on intellectual property rights.

The New York Times and OpenAI

In late 2023, The Times filed a lawsuit against OpenAI, the creator of ChatGPT. It alleged copyright infringement regarding OpenAI’s method of generating content using large-language models (LLMs) — a type of AI system designed to generate human-like text.

The complaint specifically alleged copyright infringement based on OpenAI’s unauthorized use of the Times articles to train its LLM, as well as ChatGPT’s propensity to generate content resembling that of The Times.

The Times’ Complaint

The Times made its complaint against OpenAI at the Federal District Court in Manhattan, with a court date yet to be determined.

First, The Times expressed concerns about OpenAI’s use of “scraped” Times articles used to train its LLM. This involves OpenAI collecting data from websites using web scraping techniques without any authorization from The Times. They argue OpenAI’s LLM was “built using and copying millions of Times’s news articles” . This action represents a potential breach of The Times’ copyright, because under U.S. copyright law, creators are granted exclusive rights over their original works, extending to their reproduction, distribution, and display. Pursuant to the complaint, OpenAI’s disregard for these exclusive rights suggests a possible infringement of The Times’ copyright through the unauthorized use of its articles.

Secondly, The Times argues that ChatGPT’s outputs —the content generated after users insert a prompt— are infringing derivative works, which basically are new works which incorporate elements from pre-existing copyrighted works. The Times claims that ChatGPT “outputs near-verbatim copies of significant portions of The Times’ articles when prompted to do so” . Recently, the Californian District Court in Kadrey v. Meta Platforms discussed the interpretation of a derivative work. The case involved allegations of copyright infringement against Meta Platforms for its use of an image created by the plaintiff, Richard Kadrey, in a virtual reality space. The court’s judgement emphasized that a work is not considered a derivative work unless it has been “substantially copied from a prior work.”

Within this framework, The Times suggests that ChatGPT frequently copies whole swathes of articles on request, rather than generating new content. For instance, similarities were noted between a Times article and a ChatGPT output regarding the New York taxi industry. The Times noted the red-highlighted words (below), indicating that readers could access The Times’ articles in essentially the same form through ChatGPT, without payment.

OpenAI’s arguments

In February 2024, OpenAI filed a motion seeking to dismiss The Times lawsuit (a formal request asking the court to throw out The Times’ case against OpenAI before the case goes to court).

OpenAI’s motion notes that the “fair use” defense shields it from liability for both the input and output argument noted above.

The fair use defense, as per section 17 of the U.S. Code § 107, takes into account several factors, including: i) Whether the use of copyrighted work is commercial, or for other purposes, such as research; ii) The nature of the copyrighted work, assessing the level of creativity involved in the original content that was copied; iii) The extent to which the copyrighted material was used; and iv) The potential impact on the market for the original work .

Lastly, case law highlights the importance of transformative uses of copyrighted content, suggesting they’re more likely to be considered fair use. The more transformative the use i.e., using the works to develop a new technology, the more likely that fair use will be found by the court.

Preliminary thoughts

OpenAI’s use of the Times articles in training its LLM may qualify for fair use. This is “because the use of [Times content] could potentially be deemed to serve a new “transformative” purpose.” Instead of merely copying the articles, it could be argued, OpenAI uses the articles to help in the creation of a new technology and product, in ChatGPT.

Secondly, OpenAI can rely on the “fair use” defense to counter The Times’ suggestion that its outputs resemble copied versions of its articles. This defense hinges on the argument that ChatGPT’s output serves a new transformative purpose by presenting information to users in a new and distinct way.

The precedent noted in Authors Guild v. Google supports OpenAI’s arguments. In that case, heard by the U.S. Court of Appeals in October 2015, Google relied on a fair use argument when scanning online books and making them accessible to the public. Despite outright copying, fair use was upheld because Google transformed the books into searchable databases, thereby benefiting the public with improved access to “information about those books” . Similarly, it could be reasoned that OpenAI’s use of The Times’ articles serves a transformative purpose, because it offers the public a new way to access information.

However, predicting the outcome of The Times’ case is tricky since there’s no clear ruling from the US Supreme Court on AI-related copyright issues. This means there’s no precedent for this kind of case, making it new territory for the courts. Plus, there are already conflicting rulings in similar cases. Take the Supreme Court case of May 2023, Warhol v. Goldsmith. In that case, the court decided that Andy Warhol’s use of a photograph by Lynn Goldsmith was copyright infringement — not fair use. The case complicates matters surrounding the interpretation of the Google judgment concerning fair use, particularly regarding situations where copying is perceived as competing directly with the original work. If The Times can demonstrate that ChatGPT competes with The Times content by attracting readers away from it and, thus, OpenAI might struggle to use fair use as a defense.

Large Language Models and Other Thoughts

Some advocates of AI suggest that due to the mechanics of how LLMs work, they cannot generate derivative works, regardless of their outputs. However, there is often a perception that when users insert prompts into chatbots like ChatGPT, the resulting outputs stem from the LLM delving into a ‘database’ containing scraped materials, such as The Times’ articles. It is then presumed that the LLM retrieves and reproduces this data, presenting it as original content.

This viewpoint misunderstands how LLMs work. ChatGPT does not access a ‘database’ of scraped copyrighted materials which it uses to create outputs. While a ChatGPT output can look similar to a copyrighted work, it is nearly impossible for ChatGPT to reproduce any one article that was used in training the model. This is because the LLM is not delving into a database of articles to create outputs. Instead, the LLM creates an output by applying a learned set of probabilities associated with a given prompt to then predict and generate words sequentially.

Crucially, the LLM translates the prompt words into maths, so it can then predict the best words to align with the prompt. It is statistics that creates the output; not content – the content is simply used to improve the LLMs ability to predict words that correspond with the given prompt.

Accordingly, ChatGPT does not literally copy anyone’s work — instead, the output reflects how the LLM has been programmed to write. It’s arguably just a form of learning, “not unlike a student devouring books” of a famous author. Suggesting that LLMs create derivative works may overlook how these works are created and highlights a misalignment between the nature of generative AI and current copyright laws.

Arguably, the chatbot user is the enabler of copyright infringement. By writing the perfect prompt, given the LLMs power of prediction, it might be tricked into writing an almost identical, copyrighted article. As mentioned in the recent OpenAI motion to dismiss, “the Times paid someone to make tens of thousands of attempts to generate the highly anomalous results”. So, the key may lie in the prompt itself, and The Times may have achieved near-perfect replicas of their articles through numerous attempts to produce look-alike content.

Going forward

If The Times defeats OpenAI in court, its victory could be Pyrrhic. Long term, authors and news organizations are likely to face an uphill battle in safeguarding their intellectual property rights.

AI is having its “Wild West” moment with regulators struggling to keep up with rapid technological advancements. Tech companies have every incentive to scrape vast amounts of data to fuel their LLMs. The next AI company to achieve the status of Facebook in social media or Apple in smartphones stands to gain billions or even trillions of dollars in market value.

Why does this matter? The point of copyright law is to allow creators to enjoy exclusive rights over their original works, empowering them and media organizations to produce informative content, which attracts consumers and ultimately advertisers. However, the rise of generative AI threatens this structure. Organizations like The Times are undermined by losing control of their intellectual property, with consumer habits potentially shifting to chatbots for news consumption.

Conversely, it’s arguable generative AI chatbots complement traditional media by offering new avenues for content distribution. Rather than undermining control of intellectual property, AI-driven platforms potentially extend the reach of original works and increase audience engagement.

Moving forward, it’s clear that the emergence of generative AI will continue to be a focal point of legal debates. As technology evolves, it will become increasingly important for stakeholders to develop frameworks that strike a balance between fostering innovation and protecting the rights of content creators. These frameworks may include updating copyright laws that account for advancements in AI (as illustrated by the recently passed European Union AI Act which seeks to “foster trustworthy AI in Europe and beyond, by ensuring that AI systems respect fundamental rights, safety, and ethical principles and by addressing risks of very powerful and impactful AI models” ). By working together to address these challenges, all parties can ensure continued innovation while respecting the rights and interests of all parties involved.

The unfolding Times v. OpenAI case, speculated by some to potentially reach a settlement, serves as a pivotal moment. Its outcome could significantly influence the shaping of these necessary frameworks, marking a critical point in the intersection of AI development and copyright law.

References

The New York Times v. Microsoft Corporation, OpenAI Inc., 1:23-cv-11195 (S.D. NY. Dec. 27, 2023),
Kadrey v. Meta Platforms, Inc., 23-cv-03417-VC (N.D. Cal. Nov. 20, 2023), page 2, pa 2
Grimmelmann, J, Patterns of Information Law, 2022, Chapter 4, p. 78.
Andy Warhol Found. for the Visual Arts v. GoldSmith, 143 S. Ct. 1258, 1292 (2023)
Authors Guild v. Google, Inc., 804 F.3d 202, 217 (2d Cir. 2015), para 217
Oremus, H, ‘AI’s Future Could Hinge on One Thorny Legal Issue’, Washington Post, Jan 4, 2024 https://www.washingtonpost.com/technology/2024/01/04/nyt-ai-copyright-lawsuit-fair-use/
“AI ACT”, Website of the European Union, https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai#:~:text=The%20AI%20Act%20is%20the%20first%2Dever%20comprehensive%20legal%20framework,powerful%20and%20impactful%20AI%20models.