Copyright protected material for AI training as fair use and public domain still not settled:
Meta Beats Copyright Suit From Authors Over AI Training on Books
https://tech.slashdot.org/story/25/06/2 ... inlinkanon
An anonymous reader shares a report: Meta escaped a first-of-its-kind copyright lawsuit from a group of authors who alleged the tech giant hoovered up millions of copyrighted books without permission to train its generative AI model called Llama. San Francisco federal Judge Vince Chhabria ruled Wednesday that Meta's decision to use the books for training is protected under copyright law's fair use defense, but he cautioned that his opinion is more a reflection on the authors' failure to litigate the case effectively. "This ruling does not stand for the proposition that Meta's use of copyrighted materials to train its language models is lawful," Chhabria said.
Microsoft Sued By Authors Over Use of Books in AI Training
https://news.slashdot.org/story/25/06/2 ... inlinkanon
Microsoft has been hit with a lawsuit by a group of authors who claim the company used their books without permission to train its Megatron artificial intelligence model. From a report: Kai Bird, Jia Tolentino, Daniel Okrent and several others alleged that Microsoft used pirated digital versions of their books to teach its AI to respond to human prompts. Their lawsuit, filed in New York federal court on Tuesday, is one of several high-stakes cases brought by authors, news outlets and other copyright holders against tech companies including Meta Platforms, Anthropic and Microsoft-backed OpenAI over alleged misuse of their material in AI training. [...] The writers alleged in the complaint that Microsoft used a collection of nearly 200,000 pirated books to train Megatron, an algorithm that gives text responses to user prompts.
Anthropic Bags Key 'Fair Use' Win For AI Platforms, But Faces Trial Over Damages For Millions of Pirated Works
https://yro.slashdot.org/story/25/06/24 ... ated-works
A federal judge has ruled that Anthropic's use of copyrighted books to train its Claude AI models constitutes fair use, but rejected the startup's defense for downloading millions of pirated books to build a permanent digital library.
U.S. District Judge William Alsup granted partial summary judgment to Anthropic in the copyright lawsuit filed by authors Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson. The court found that training large language models on copyrighted works was "exceedingly transformative" under Section 107 of the Copyright Act. Anthropic downloaded over seven million books from pirate sites, according to court documents. The startup also purchased millions of print books, destroyed the bindings, scanned every page, and stored them digitally.
Both sets of books were used to train various versions of Claude, which generates over $1 billion in annual revenue. While the judge approved using books for AI training purposes, he ruled that downloading pirated copies to create what Anthropic called a "central library of all the books in the world" was not protected fair use. The case will proceed to trial on damages related to the pirated library copies.
Meanwhile there are datasets (and models) with free training data:
Harvard's Book Bonanza: 1 Million Public-Domain Books Unleashed for AI Training
https://opentools.ai/news/harvards-book ... i-training
Eleuther AI releases 8TB collection of licensed and open training data
https://www.computerworld.com/article/4 ... -data.html
--
Srdja