JULIA STEINER—After failed licensing negotiations beginning in April 2023, the New York Times sued OpenAI and Microsoft, the company’s largest investor, last month for copyright infringement. While this is only the latest in a bout of litigation facing OpenAI, the Times is the first news publisher to sue. The complaint alleges that Microsoft’s Copilot and OpenAI’s ChatGPT, generative AI tools that rely on large-language models (“LLM’s”) to produce “human-sounding” text, improperly usurped Times news articles to train their programs, resulting in verbatim reproductions of Times content on the models. The Times alleges a loss of customer subscriptions, licensing revenues, and advertising deals.
While the defendants have not yet filed an answer, OpenAI issued a statement categorically rejecting the allegations. It claimed that using news content to train its AI models constitutes fair use under “long-standing and widely accepted precedents,” pointing to international laws permitting the use of copyrighted works for AI training, including in the European Union and Israel.
The Times anticipated a fair use defense, arguing in its complaint that the use of its media was not “transformative” because ChatGPT produced virtually wholesale copies of Times articles in its outputs, which directly compete with the news giant. OpenAI responded in its statement that copies produced in ChatGPT outputs, known in the industry as “regurgitation,” are the result of a “rare bug” during the AI training process by which the LLM memorizes verbatim content from a particular source and reproduces it. It further stated that it attempts to “limit inadvertent memorization” during its training process, but that intentional user manipulation to produce regurgitated content violates its terms of use. OpenAI cast the Times’ exhibits in its complaint as just that—manipulated prompts intended to compel ChatGPT to regurgitate Times articles. It reassured the public that such intentional manipulation is atypical and a misuse of ChatGPT. OpenAI also claimed that its models are not substitutive for the Times or other news publishers, rather the result of transformative collaboration between the two industries.
While this is an early stage of litigation, the parties’ statements illuminate the growing tension between copyright owners and the demand for robust, cutting-edge generative AI. On one end of the spectrum, copyright owners cling to the time-honored principle penned into our Nation’s founding document that copyright law exists “[t]o promote the Progress of Science and useful Arts.” Fair use, they believe, is not a blanket defense to unauthorized encroachments on an author’s bundle of exclusive rights. On the other end, proponents of sweeping AI advancements view the use of copyrighted material in training AI models as easily falling within the parameters of § 107. The right answer falls somewhere in between.
Fair use may be harder to mount in this case than in other lawsuits facing AI moguls. Although the law is uncertain in this emerging area, legal academics discussing the delicate balance between AI and copyright point out a key distinction: copyright law is unlikely to be implicated where AI models use authors’ works solely in the training process, while it is likely to be implicated where the models “embody the training data” in their outputs.
At the training, or “input” step, the Second Circuit’s decision in Author’s Guild v. Google likely applies. There, the court held that Google’s blanket copying of books to enable online searching constituted fair use because its purpose was highly transformative and was unlikely to serve as a market substitute for books. On its face, machine training seems to easily constitute a highly transformative purpose. The problem arises if, at the “output” step, instead of producing an original response to the user’s query, the model has “ingested” or memorized large portions of its inputs from copyrighted sources and regurgitates them without producing its own unique response. In that case, a copyright owner may state a valid infringement claim.
Unlawful reproduction at the output stage is particularly relevant in the realm of news reporting, where the model’s regurgitated response provides the user with the essence of the report, effectively steering traffic away from news sites and serving as a near-perfect substitute. In contrast to the book authors in Author’s Guild and a recent decision dismissing author Sarah Silverman’s case against Meta, the Times alleges that the defendants’ outputs reproduced large portions of its articles. While an AI-generated summary of an author’s book may not serve as a market substitute, large regurgitations of news articles very well may.
Ultimately, the outcome of the case will hinge on whether OpenAI and Microsoft used copyrighted works solely at the input stage or whether their outputs produced substantially similar—or wholesale reproductions—of the Times’ articles. The face of the complaint suggests that the latter is true.
Even if OpenAI is correct in its assertion that the Times improperly prompted its model to regurgitate, it could still potentially face secondary liability for assisting its users in copyright infringement. In a leading case on contributory infringement, Perfect 10 v. Amazon, the Ninth Circuit held that computer system operators can be contributorily liable if they have actual knowledge of infringing material and can take measures to prevent it, yet do not. According to the Times’ allegations, OpenAI’s model seems to fit within this test. The complaint alleges that the defendants were “fully aware” that their models were “capable of distributing unlicensed copies or derivatives of copyrighted” works. OpenAI did not dispute that allegation in its statement. Rather, it relied on the fact that prompts intended to regurgitate copyrighted works violate its terms of use. Without more, that fact is likely insufficient to excuse OpenAI from secondary liability.
Some experts weighing in on the case believe the litigants will likely settle and reach a licensing agreement, preventing a precedential outcome. Yet, the parties already persisted in months of negotiations to no avail. Regardless of whether the case settles or an opinion is published, the case illustrates the urgent need for guidance, either in the form of legislation or judicial interpretation, in striking a fair balance between copyright law and artificial intelligence—both of which aim to produce meaningful innovation in our world.