Prominent authors including master of horror and hysterical leftist Stephen King are finding themselves at the center of a debate on the ethical use of copyrighted materials in training advanced AI models. Authors have filed a lawsuit claiming Facebook (now known as Meta) and a laundry list of other AI giants have ripped off their copyrighted books to train AI systems.
The Atlantic reports that the latest AI controversy revolves around the use of copyrighted literary works to train sophisticated language models. These chatbots, designed to mimic human-like responses, rely on vast amounts of written content. But the sources of these training materials have largely remained a mystery, raising eyebrows and concerns in the literary community.
Facebook CEO Mark Zuckerberg arrives for the 8th annual Breakthrough Prize awards ceremony at NASA Ames Research Center in Mountain View, California on November 3, 2019. (Photo by JOSH EDELSON / AFP) (Photo by JOSH EDELSON/AFP via Getty Images)
OpenAI founder Sam Altman, creator of ChatGPT (TechCrunch/Flickr)
Stephen King, Zadie Smith, and Michael Pollan are among a growing list of authors whose works, they claim, have been used without permission. The essence of the debate is not just about copyright infringement but also about the transparency and ethics surrounding the development of AI.
The Atlantic writes:
Some training text comes from Wikipedia and other online writing, but high-quality generative AI requires higher-quality input than is usually found on the internet—that is, it requires the kind found in books. In a lawsuit filed in California last month, the writers Sarah Silverman, Richard Kadrey, and Christopher Golden allege that Meta violated copyright laws by using their books to train LLaMA, a large language model similar to OpenAI’s GPT-4—an algorithm that can generate text by mimicking the word patterns it finds in sample texts. But neither the lawsuit itself nor the commentary surrounding it has offered a look under the hood: We have not previously known for certain whether LLaMA was trained on Silverman’s, Kadrey’s, or Golden’s books, or any others, for that matter.
The Atlantic highlighted the sheer amount of data that AI Large Language Models (LLMs) are trained on, writing:
Upwards of 170,000 books, the majority published in the past 20 years, are in LLaMA’s training data. In addition to work by Silverman, Kadrey, and Golden, nonfiction by Michael Pollan, Rebecca Solnit, and Jon Krakauer is being used, as are thrillers by James Patterson and Stephen King and other fiction by George Saunders, Zadie Smith, and Junot Díaz. These books are part of a dataset called “Books3,” and its use has not been limited to LLaMA. Books3 was also used to train Bloomberg’s BloombergGPT, EleutherAI’s GPT-J—a popular open-source model—and likely other generative-AI programs now embedded in websites across the internet. A Meta spokesperson declined to comment on the company’s use of Books3; a spokesperson for Bloomberg confirmed via email that Books3 was used to train the initial model of BloombergGPT and added, “We will not include the Books3 dataset among the data sources used to train future versions of BloombergGPT”; and Stella Biderman, EleutherAI’s executive director, did not dispute that the company used Books3 in GPT-J’s training data.
Despite the fact that Facebook’s LLaMA AI model is seemingly trained with copyrighted material, the company takes a harsh stance on others using its own copyrighted material. The Atlantic explains:
Meta’s proprietary stance with LLaMA suggests that the company thinks similarly about its own work. After the model leaked earlier this year and became available for download from independent developers who’d acquired it, Meta used a DMCA takedown order against at least one of those developers, claiming that “no one is authorized to exhibit, reproduce, transmit, or otherwise distribute Meta Properties without the express written permission of Meta.” Even after it had “open-sourced” LLaMA, Meta still wanted developers to agree to a license before using it; the same is true of a new version of the model released last month. (Neither the Pile nor Books3 is mentioned in a research paper about that new model.)
Read more at The Atlantic here.
Lucas Nolan is a reporter for Breitbart News covering issues of free speech and online censorship. Follow him on Twitter @LucasNolan