8.5 C

The New York Times Updates Terms of Service to Prevent AI Scraping Its Content



Among the early use cases of AI within newsrooms appears to be fighting AI itself.

The New York Times updated its terms of services Aug. 3 to forbid the scraping of its content to train a machine learning or AI system.

The content includes but is not limited to text, photographs, images, illustrations, designs, audio clips, video clips, “look and feel” and metadata, including the party credited as the provider of such content.

The updated TOS also prohibits website crawlers, which let pages get indexed for search results, from using content to train LLMs or AI systems.

Defying these rules could result in penalties, per the terms and services, although it’s unclear what the penalties would look like. When contacted for this piece, The New York Times said that it didn’t have any additional comment beyond the TOS.

“Most boilerplate terms of service include restrictions on data scraping, but the explicit reference to training AI is new,” said Katie Gardner, partner at Gunderson Dettmer.

AI models rely on content and data, including journalism pieces and copyrighted art, as a main source of information to output results. In some cases, this content is replicated verbatim. Publishers, especially those with paywalls and healthy subscription businesses, are concerned that AI models will undermine their revenue streams by publishing repurposed content without credit, and contribute to misinformation, degrading people’s trust in news.

The confusing case of creepy crawlers

LLMs like ChatGPT work similarly to website crawlers which scan content on publishers’ sites and feed their information to inform search results.

While publishers can see crawlers visiting their sites, they cannot know their exact purposes, whether for search engine optimization or training AI models. Some paywall tech companies are looking at ways to block crawlers, according to Digiday’s reporting.

Crawlers like CommonCrawl, with a data set of 3.15 billion web pages, have brokered deals with OpenAI, Meta, and Google for AI training, per The Decoder.

Earlier this week, OpenAI launched GPTBot, a web crawler to improve AI models. This will let publishers control GPTBot’s access to their website content. Still, significant players in the field, namely Microsoft’s Bing and Google’s Bard, have not added this functionality to their bots, leaving publishers struggling to control what the crawlers scrape.

Subscribe to our magazine

━ more like this

The new WGA contract will change how Hollywood works

(Disclosure: The Verge's editorial staff is also unionized with the Writers Guild of America, East.)The WGA dropped a summary of the contract tonight and it's...

The Writers Guild of America’s strike ends at midnight

After a nearly five-month-long strike, the Writers Guild of America (WGA) is telling its members to lower their picket signs. In a post,...

Democrats have started pressuring Sen. Menendez to resign after ‘shocking’ bribery allegations involving gold, luxury cars, and cash

Sen. Bob Menendez came under heavy pressure to resign Tuesday as a surging number of his Democratic colleagues in the Senate, including fellow New Jersey...

Biden’s family dog, Commander, adds to his count of White House victims by biting yet another Secret Service employee

A uniformed division officer was bitten by the dog around 8 p.m. Monday at the White House, and was treated on-site by medical...

Eufy’s new X8 Pro robot vacuum can detangle hair automatically

Anker’s Eufy division has a new robot vacuum cleaner with more powerful suction than its other offerings, and which can automatically detangle hair...