Apple, NVIDIA, and Anthropic allegedly used YouTube transcripts to train their AIs without permission

Alcides19 Luglio 2024

It is composed of transcribed subtitles of YouTube videos from the most subscribed content creators of the said social media.

Poor standards applied in the training of Artificial Intelligence from the big shots in the technology world
A new expose by Proof News pointed that some of the tech titans such as Apple and NVIDIA are exposed to train their AI models on data that is prohibited for them to access. They have used a dataset based on more than 173 thousand of video mainstreams of YouTube, copied without permission.

A dataset which was formed and compiled without people’s permission
These transcripts are derived from over 48,000 Youtubers from creation personalities like Marques Brownlee and MrBeast to sociable media distributers such as The New York Times, the BBC, as well as ABC News. We also observe subtitles of the videos that are under the Engadget channel.

Marques Brownlee says on the X platform: Apple has sourced data for their AI through various other companies, Below are some of the companies.; One of them scraped a large amount of data/transcripts from YouTube, including my videos. To him, the problem is going to recur in future.

One can speak about the complete absence of transparency that was felt when using the application.
It is noted that the majority of companies that have engaged on the development of AI models have not been specific on the sources of data used for training. Guarino pointed out that end- January this year, artists/photographers have expressed their unacceptable feeling with Apple as the latter failed to explain where data utilized to train Apple Intelligence, new AI for content (creation), available in millions of Apple devices, was retrieved.

The aforementioned violations necessitate the need to regulate the information which is released in the public domain.
As the largest video sharing platform in the world, YouTube as rich source of data (transcripts, audio, video, images) is highly attractive for businesses which are keen on training their AI models. According to the Google official and OpenAI, it is unlawful to employ data from YouTube for this function since it is a breach of YouTube’s policies and regulations.

Such revelations are highly pertinent to the questions that come up regarding the lawfulness and even morality, of data accumulating. It is a question that indeed requires an answer and it remains now for the tech companies to seek how they can make use of data in the right manner.

Alcides19 Luglio 2024