Billion Dollar Companies Like Apple And Nvidia Are Swiping YouTube Content To Train Their AI

Business July 17, 2024

Apple, Nvidia and Salesforce are using content on YouTube to train their AI.

Subtitles from 173,536 YouTube videos spread across 48,000 YouTube channels were used by these companies as training data despite YouTube's rules against harvesting information, according to Proof News and Wired.

The dataset - called YouTube Subtitles - includes transcripts from educational channels like Khan Academy, MIT, and Harvard, as well as media outlets such as The Wall Street Journal, NPR, and the BBC.

Late-night shows like The Late Show, Last Week Tonight, and Jimmy Kimmel Live were also used, thge report says.

Additionally, Proof News found that popular YouTubers like MrBeast, Marques Brownlee, Jacksepticeye, and PewDiePie had their videos included.

David Pakman, host of The David Pakman Show, which sports more than 2 million subscribers and more than 2 billion views, commented: “No one came to me and said, ‘We would like to use this.”

“This is my livelihood, and I put time, resources, money, and staff time into creating this content. There’s really no shortage of work,” he added, arguing that if AI companies are paid, he should be compensated for his data.

billion dollar companies like apple and nvidia are swiping youtube content to train their ai

Dave Wiskus, the CEO of Nebula, didn't mince words: “It’s theft. Will this be used to exploit and harm artists? Yes, absolutely.”

The data was part of 'The Pile', a compilation of data released that includes content from YouTube, the European Parliament, English Wikipedia and corporate emails.

Apple utilized the Pile for OpenELM before adding new AI features to its products. Bloomberg and Databricks also leveraged the Pile, according to their publications. Anthropic, an AI company backed by a $4 billion Amazon investment, confirmed its use of the Pile for its AI assistant, Claude, while emphasizing compliance with YouTube’s terms, Wired wrote.

Salesforce used the Pile for an AI model intended for academic and research purposes, releasing it publicly in 2022. This model has been downloaded over 86,000 times.

Litigation against companies using unauthorized data for AI training is ongoing. Authors have sued over the use of works in datasets like Books3, another Pile component. Tech companies argue their actions fall under fair use, but legal battles are ongoing.

Read Wired's full story here.

Authored by Tyler Durden via ZeroHedge July 17th 2024