The Data That Powers A.I. Is Disappearing Fast
New research from the Data Provenance Initiative has found a dramatic drop in content made available to the collections used to build artificial intelligence.
Credit…Raven Jiang
For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.
Now, that data is drying up.
Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group.
The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.
The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.
The study also found that as much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service.
“We’re seeing a rapid decline in consent to use data across the web that will have ramifications not just for A.I. companies, but for researchers, academics and noncommercial entities,” said Shayne Longpre, the study’s lead author, in an interview.
Data is the main ingredient in today’s generative A.I. systems, which are fed billions of examples of text, images and videos. Much of that data is scraped from public websites by researchers and compiled in large data sets, which can be downloaded and freely used, or supplemented with data from other sources.
Learning from that data is what allows generative A.I. tools like OpenAI’s ChatGPT, Google’s Gemini and Anthropic’s Claude to write, code and generate images and videos. The more high-quality data is fed into these models, the better their outputs generally are.
For years, A.I. developers were able to gather data fairly easily. But the generative A.I. boom of the past few years has led to tensions with the owners of that data — many of whom have misgivings about being used as A.I. training fodder, or at least want to be paid for it.
As the backlash has grown, some publishers have set up paywalls or changed their terms of service to limit the use of their data for A.I. training. Others have blocked the automated web crawlers used by companies like OpenAI, Anthropic and Google.
Sites like Reddit and StackOverflow have begun charging A.I. companies for access to data, and a few publishers have taken legal action — including The New York Times, which sued OpenAI and Microsoft for copyright infringement last year, alleging that the companies used news articles to train their models without permission.
Companies like OpenAI, Google and Meta have gone to extreme lengths in recent years to gather more data to improve their systems, including transcribing YouTube videos and bending their own data policies.
More recently, some A.I. companies have struck deals with publishers including The Associated Press and News Corp, the owner of The Wall Street Journal, giving them ongoing access to their content.
But widespread data restrictions may pose a threat to A.I. companies, which need a steady supply of high-quality data to keep their models fresh and up-to-date.
They could also spell trouble for smaller A.I. outfits and academic researchers who rely on public data sets, and can’t afford to license data directly from publishers. Common Crawl, one such data set that comprises billions of pages of web content and is maintained by a nonprofit, has been cited in more than 10,000 academic studies, Mr. Longpre said.
It’s not clear which popular A.I. products have been trained on these sources, since few developers disclose the full list of data they use. But data sets derived from Common Crawl, including C4 (which stands for Colossal, Cleaned Crawled Corpus) have been used by companies including Google and OpenAI to train previous versions of their models. Spokespeople for Google and OpenAI declined to comment.
Yacine Jernite, a machine learning researcher at Hugging Face, a company that provides tools and data to A.I. developers, characterized the consent crisis as a natural response to the A.I. industry’s aggressive data-gathering practices.
“Unsurprisingly, we’re seeing blowback from data creators after the text, images and videos they’ve shared online are used to develop commercial systems that sometimes directly threaten their livelihoods,” he said.
But he cautioned that if all A.I. training data needed to be obtained through licensing deals, it would exclude “researchers and civil society from participating in the governance of the technology.”
Stella Biderman, the executive director of EleutherAI, a nonprofit A.I. research organization, echoed those fears.
“Major tech companies already have all of the data,” she said. “Changing the license on the data doesn’t retroactively revoke that permission, and the primary impact is on later-arriving actors, who are typically either smaller start-ups or researchers.”
A.I. companies have claimed that their use of public web data is legally protected under fair use. But gathering new data has gotten trickier. Some A.I. executives I’ve spoken to worry about hitting the “data wall” — their term for the point at which all of the training data on the public internet has been exhausted, and the rest has been hidden behind paywalls, blocked by robots.txt or locked up in exclusive deals.
Some companies believe they can scale the data wall by using synthetic data — that is, data that is itself generated by A.I. systems — to train their models. But many researchers doubt that today’s A.I. systems are capable of generating enough high-quality synthetic data to replace the human-created data they’re losing.
Another challenge is that while publishers can try to stop A.I. companies from scraping their data by placing restrictions in their robots.txt files, those requests aren’t legally binding, and compliance is voluntary. (Think of it like a “no trespassing” sign for data, but one without the force of law.)
Major search engines honor these opt-out requests, and several leading A.I. companies, including OpenAI and Anthropic, have said publicly that they do, too. But other companies, including the A.I.-powered search engine Perplexity, have been accused of ignoring them. Perplexity’s chief executive, Aravind Srinivas, told me that the company respects publishers’ data restrictions. He added that while the company once worked with third-party web crawlers that did not always follow the Robots Exclusion Protocol, it had “made adjustments with our providers to ensure that they follow robots.txt when crawling on Perplexity’s behalf.”
Mr. Longpre said that one of the big takeaways from the study is that we need new tools to give website owners more precise ways to control the use of their data. Some sites might object to A.I. giants using their data to train chatbots for a profit, but might be willing to let a nonprofit or educational institution use the same data, he said. Right now, there’s no good way for them to distinguish between those uses, or block one while allowing the other.
But there’s also a lesson here for big A.I. companies, who have treated the internet as an all-you-can-eat data buffet for years, without giving the owners of that data much of value in return. Eventually, if you take advantage of the web, the web will start shutting its doors.
A version of this article appears in print on July 22, 2024, Section B, Page 1 of the New York edition with the headline: The Data That Powers A.I. Is Disappearing Fast. Order Reprints | Today’s Paper | Subscribe