OpenAI sued for copyright infringement
On Wednesday 28 June, US authors Paul Tremblay and Mona Awad (the plaintiffs) filed a class action complaint in the San Francisco federal court against OpenAI, for copyright infringement when training its auto-generative artificial intelligence system known as ChatGPT. The proposed class action alleges copyright infringement, violations of the Digital Millennium Copyright Act, unjust enrichment and negligence, among other claims, on behalf of themselves and all others similarly situated.
As we already know, Chat GPT is an auto-generative chat that extracts data from different sources and then processes it using Natural Language Processing (NLP). Since the launch of ChatGPT, there has been a lot of discussion about its relationship with intellectual property, specifically with copyright: when is the output inspired from existing works and when is it actually infringing them? We discussed it in this post.
Generative AI companies are facing a barrage of numerous legal actions. Earlier this year, Getty Images sued the company Stability IA for training on millions of its pictures without consent. The proposed class action filed in the San Francisco federal court last Wednesday is based on the claim that OpenAI infringed copyright at two points: first, when it illegally downloaded copies of novels to train its artificial intelligence system, and second, because ChatGPT's responses (output) are themselves infringing the rights in such works.
As to the first issue.
The plaintiffs alleged that much of the material in OpenAI's training datasets comes from copyrighted works, including books, which were copied by OpenAI without consent, without credit, and without compensation. Books have always been a key ingredient in training datasets for large language models, as they provide the best examples of high-quality extensive writing. In the July 2020 GPT-3 paper, OpenAI revealed that 15% of GPT-3's huge training dataset came from "two internet-based book corpora", which can be estimated at around 300,000 titles. The plaintiffs claimed that the only internet-based book corpuses that have ever offered so much material are the notorious "shadow library" websites, which are blatantly illegal.
As evidence of infringement, the plaintiffs argued that when ChatGPT was asked to summarise the books written by each of them it generated very accurate summaries and that the reason it could do it is because the books were copied by OpenAI and ingested by the language model as part of its training data. The two authors alleged that OpenAI made copies of their books during the process of training OpenAI’s language models without their permission. Therefore, they sought damages and restitution of profits.
As to the second issue.
The plaintiffs argued that because the output of the OpenAI Language Models is based on expressive information extracted from the plaintiffs' works, each output of the OpenAI Language Models is an infringing derivative work, without permission from the authors and in violation of their exclusive rights. They alleged that OpenAI has benefited economically from the infringing results of the OpenAI Language Models as each result of the auto-generative chat constitutes an act of contributory copyright infringement. They also sought damages and restitution of profits.
This class action figures in around 300.000 books that could have been victims of plagiarism and seeks to represent the hundreds of thousands of US authors whose copyrights may have been infringed — in many of these cases, through websites that offer this content illegally.
Also on Wednesday, another class-action suit was filed against OpenAI in the California federal court by Clarkson, a public-interest law firm, on behalf of anonymous clients. They accuse OpenAI of stealing and misappropriating vast swathes of personal data from the Internet.
-
Previous:
-
Next: