Adobe Hit: AI Training Lawsuit Over Copyrighted Authors’ Work
“`html
Adobe Sued for Allegedly Training AI Model on Pirated Books
Table of Contents
A class-action lawsuit claims Adobe used copyrighted books, including works by author Elizabeth Lyon, to train its SlimLM AI model, raising questions about data sourcing in AI advancement.
The Allegation: Copyright Infringement in AI training
Adobe, a leading software company, is facing a proposed class-action lawsuit centered around the training data used for its SlimLM AI model. The lawsuit, filed on behalf of author Elizabeth Lyon, alleges that Adobe utilized pirated copies of numerous books - including Lyon’s own works - to train the model. This raises meaningful concerns about copyright infringement and the ethical sourcing of data in the rapidly evolving field of artificial intelligence.
Lyon,a non-fiction writing guidebook author,claims her works were included within the pretraining dataset used by Adobe for SlimLM. The lawsuit asserts that Adobe knowingly or negligently incorporated copyrighted material obtained through illicit means into its AI training process.
Understanding SlimLM and its Training Data
SlimLM is described by Adobe as a series of small language models designed for “document assistance tasks on mobile devices.” The company states that SlimLM was pre-trained on SlimPajama-627B, an open-source dataset released by Cerebras in june 2023.SlimPajama-627B is presented as a “deduplicated, multi-corpora” dataset, meaning it was compiled from various sources and efforts were made to remove redundant information.
However, the lawsuit challenges the claim of proper deduplication and lawful sourcing. The core argument is that despite being labeled “open-source,” the dataset contained copyrighted material obtained through unauthorized channels. The lawsuit doesn’t directly accuse Cerebras of wrongdoing, but focuses on Adobe’s use of the dataset knowing, or having reason to know, it contained infringing material.
The size of SlimPajama-627B – 627 billion tokens – is substantial, making a manual review for copyright violations impractical. This highlights the difficulty in ensuring the legality of large datasets used for AI training.
The Legal Landscape: AI, Copyright, and fair Use
This lawsuit is part of a growing trend of legal challenges concerning the use of copyrighted material in AI training. Several key questions are at the forefront of these debates:
- Is AI training considered “fair use” under copyright law? This is a central point of contention. Arguments for fair use often center on the transformative nature of AI training – that the AI is not simply reproducing the original work, but using it to learn patterns and generate new content.
- Does the source of the training data matter? Even if AI training is deemed fair use, using illegally obtained copyrighted material could still be a violation of copyright law.
- What is the duty of AI developers to ensure the legality of their training data? The lawsuit suggests Adobe had a duty to verify the source of the data used to train SlimLM.
Previous
