Among the early use cases of AI within newsrooms appears to be fighting AI itself.
The New York Times updated its terms of services Aug. 3 to forbid the scraping of its content to train a machine learning or AI system.
The content includes but is not limited to text, photographs, images, illustrations, designs, audio clips, video clips, “look and feel” and metadata, including the party credited as the provider of such content.
The updated TOS also prohibits website crawlers, which let pages get indexed for search results, from using content to train LLMs or AI systems.
Defying these rules could result in penalties, per the terms and services, although it’s unclear what the penalties would look like. When contacted for this piece, The New York Times said that it didn’t have any additional comment beyond the TOS.
“Most boilerplate terms of service include restrictions on data scraping, but the explicit reference to training AI is new,” said Katie Gardner, partner at Gunderson Dettmer.
AI models rely on content and data, including journalism pieces and copyrighted art, as a main source of information to output results. In some cases, this content is replicated verbatim. Publishers, especially those with paywalls and healthy subscription businesses, are concerned that AI models will undermine their revenue streams by publishing repurposed content without credit, and contribute to misinformation, degrading people’s trust in news.
The confusing case of creepy crawlers
LLMs like ChatGPT work similarly to website crawlers which scan content on publishers’ sites and feed their information to inform search results.
While publishers can see crawlers visiting their sites, they cannot know their exact purposes, whether for search engine optimization or training AI models. Some paywall tech companies are looking at ways to block crawlers, according to Digiday’s reporting.
Earlier this week, OpenAI launched GPTBot, a web crawler to improve AI models. This will let publishers control GPTBot’s access to their website content. Still, significant players in the field, namely Microsoft’s Bing and Google’s Bard, have not added this functionality to their bots, leaving publishers struggling to control what the crawlers scrape.