Amazon, CNN, New York Times and more websites block ChatGPT robot

Alcides1 Settembre 2023

ChatGPT is able to answer numerous questions with reasonable accuracy, as it has been trained with large amounts of text such as books, articles, and also websites. The latter, however, seem quite worried. Surveys show that more than 15 out of the 100 most accessed websites have blocked GPTBot, OpenAI’s robot responsible for collecting content. On that list are Amazon, the New York Times, CNN and others.

The data is from an analysis of Originality.ai, a company that specializes in checking whether content was generated by artificial intelligence or plagiarized.

Among the 100 most accessed websites on the internet, at least 15 have already blocked the robot. Among the 1000 most accessed, more than 70 took this same measure.

Among the more than 1000 websites that are blocking GPTBot, are famous names such as:

Amazon
The New York Times
CNN
Wikihow
Shutterstock
Quora
Bloomberg
Scribd
Reuters
Ikea
Airbnb
Coursera
ChatGPT and other AIs are accused of copyright infringement
Blocking OpenAI’s robot is one way to prevent the use of copyrighted content.

“Intellectual property is the lifeblood of our business, and we need to protect the copyright of our content,” a Reuters spokeswoman told the Guardian.

The New York Times updated the terms of service to include an item that prohibits scraping content for training and development of artificial intelligence.

This has been a topic of debate since ChatGPT and other generative artificial intelligence tools were launched.

Image bank Getty Images, for example, sued the creators of Stable Diffusion for training AI with copyrighted photographs. Some of the tool’s creations even show Getty’s watermark.

Writers took a similar path and sued OpenAI, while a class action lawsuit was filed against Microsoft, GitHub, and OpenAI for disrespecting open source attribution licenses used in training the tools.

Sites also block public file crawler
GPTBot is OpenAI’s crawler. This name is given to robots that “crawl” around the web indexing and collecting information. Google and Bing, for example, have their own, which catalog web pages to show search results.

OpenAI’s idea is to gather information to train the large-scale language model that makes ChatGPT work.

GPTBot was announced in early August 2023. OpenAI also made available information on how websites could prevent it from collecting content: simply remove the permission on the robots file.txt or block the IP.

Some (but not all) sites on the list have also blocked CCBot, the crawler of the nonprofit Common Crawl, whose goal is to create public archives for anyone to access.

Some of the data used in training ChatGPT—and also from models from Google and other companies—comes from Common Crawl.

Alcides1 Settembre 2023