Meta sued for using copyrighted books for AI training
Did Meta break the law? Authors are claiming that the company used their books without permission. According to a recent filing in a copyright infringement lawsuit, Meta Platforms, the parent company of Facebook and Instagram, allegedly ignored legal warnings about the risks of using thousands of pirated books to train its AI models.
Comedian Sarah Silverman, Pulitzer Prize winner Michael Chabon, and other authors filed two lawsuits against Meta, accusing the company of using their works without permission to train its AI language model, Llama. Although a California judge recently dismissed part of Silverman’s lawsuit, the authors were granted permission to amend their claims, Reuters reported.
In the latest filing on Monday, the authors presented chat logs of a Meta-affiliated researcher discussing the procurement of the dataset in a Discord server. This evidence suggests that Meta was aware that its use of the books might not be protected by U.S. copyright law. In the chat logs, researcher Tim Dettmers mentioned discussions with Meta’s legal department about the legality of using book files as training data.
Dettmers stated in 2021 that at Facebook, there was interest in working with the dataset, known as The Pile, but it couldn’t be used in its current form due to legal reasons. The complaint also highlighted Dettmers’ statement that Meta’s lawyers had informed him that the data could not be used or models published if trained on that data.
“At Facebook, there are a lot of people interested in working with (T)he (P)ile, including myself, but in its current form, we are unable to use it for legal reasons,” Dettmers wrote.
While the chat logs do not provide details about the lawyers’ concerns, there is mention of worries related to “books with active copyrights.” The researchers in the chat suggested that training on the data should fall under fair use, a U.S. legal doctrine that protects some unlicensed use of copyrighted works.
Tech giants, including Meta, have faced numerous lawsuits this year from content creators who accuse them of using copyrighted works to build generative AI models. Successful outcomes in these cases could increase the cost of building such models, as companies may be required to compensate content creators for using their works.
Additionally, new rules in Europe regulating artificial intelligence may force companies to disclose the data used to train their models, exposing them to further legal risks.
As we reported earlier this year, Meta released the first version of its Llama language model in February. Llama was designed to generate text and conversations, summarize written material, and perform complicated tasks like solving math theorems or predicting protein structures.
During the launch, Meta also disclosed the use of datasets like “the Books3 section of ThePile.” However, the company did not reveal training data for the latest version, Llama 2, which became available for commercial use in the summer. Llama 2 is free for companies with fewer than 700 million monthly active users and is considered a potential game-changer in the market for generative AI software.