Data Is on the Menu — and AI’s Market Price Is High
Based on research by David Sénécal
The world is data hungry
Years ago, it was said that “data is the new oil,” but the AI generation has made data more like the new grain. The world is hungry for data, and hunger creates opportunity. While we don’t physically need data to survive, our technological dependence has elevated our mental need for it. Humans like to learn, and there is no better place than the internet to learn — aka find data.
As part of having the coolest job in the world, I had the opportunity to read the brilliant David Sénécal’s latest research piece on the convergence of AI, data collection, and bots. And I couldn’t help but write about it — I mean, he goes into header interpretation of several major large language model (LLM) crawlers and that’s only one of the fresh technical facets he discusses. (His subhead “The world is data hungry,” which I’ve co-opted here, also gives me the opportunity to make a bunch of food and cooking puns. Chef’s kiss!)
This blog post will give you just a taste of the main course: David’s white paper Unpacking the Complex Dynamic Behind AI Models’ Data Needs. For those of you who are looking for some light reading on AI and data collection, I’ve got you covered here — but if flow charts of JavaScript fingerprinting are more your jam, you’ll want to read the whole paper.
Botnets’ set the table
A botnet is the sous chef in the data collection strategy that is feeding AI systems. Botnets can be programmed as scrapers to train models on just about anything, including competitive intelligence, academic research, and even software development.
Whether you’re a Reddit lurker casually looking for information on human behavior or a Ph.D candidate writing a dissertation on climate change, manually searching for the specific content you need can be very time-consuming. Even worse, it can be terribly tedious — and people typically don’t care for tedium nowadays.
AI models, however, can process massive amounts of data to derive the sought-after content, with analysis and insight, in a matter of seconds. Enter the scraping botnet/AI model duo (or as I like to call it: the AI combo, with fries and a scrape).
Scrapers are dishing out concerns
This combo is worrisome to the security community. As is often the case with rapidly evolving technology, the botnet/AI duo is only understood by a handful of people, and legislation hasn’t caught up yet. There are lax legal guidelines related to scraping, and there are some groups lobbying to normalize scraping. This dishes out some understandable hesitation from website owners to include their data in these large AI models.
In addition to the data use concerns, the site owners also have an à la carte menu of other concerns to worry about with regard to crawlers and scrapers. Operational cost increases, reliability and performance, and inaccurate metric reporting are also on the table. David’s full white paper dives into each of these consequences and what impacts they can have on affected site owners.
Data, glorious data — AI collection make us flustered
David’s paper provides a detailed analysis of the web data collection industry (or scraping as a service) that has emerged in recent years. These companies specialize in collecting data on the internet at scale, customized to the customer requesting it.
The fact that it’s way more cost-effective and efficient to outsource the collection and analysis has created this “business intelligence” market. There is Michelin Star money in this industry, y’all.
Although this can have serious business value, the privacy and operational concerns should not be ignored. Some data is so sensitive that it shouldn’t be accessible at all without explicit user permission, even if it is relevant to the requested information. Ethical scraping will be in the test kitchens of the tech and security crowd as the lines between digital and physical become more blurry every day.
Pay as you scrape
If there’s a way to monetize a service, capitalism will find it. I was surprised to learn about a new development: Companies are offering site owners a way to profit off the scrapers that are using their data. Pricing models are currently unclear, but the emergence of this development is a shiny new centerpiece for the table.
This solution is certainly more attractive to the website owners: If they can’t stop it, they might as well get paid for it, right? Well, maybe. This service will open up its own menu of questions to be answered:
- What represents a fair price for the data from both sides?
- Will the data collection vendors play fair?
- Will there be transparency on data use?
David addresses these questions in the paper. This model would require a pretty big mindset shift and a lot of collaboration with parties that may have conflicting interests. Monetizing the service will not address every industry’s concern with scraping.
Money talks, but it does not speak the same language to everyone. Take media companies, for example: getting paid for the scraped content is not as important as getting attribution for the content. Enabling the connection with the content publisher and its author leads to recognition, subscription, and, ultimately, ad revenue.
There’s always room for dessert
There is so much more in David’s paper I haven’t mentioned here — there’s even a whole course of bot management, including technical detection workflows. A problem as enormous and complex as AI security and privacy requires an equally enormous amount of information to understand it.
Ethical scraping and ethical AI training are likely going to be workshopped for a while. They require collaboration, knowledge, and an understanding that it will take time to get them right. Now is the time to find ingredients (resources), test recipes (try new things), and soak up all the sauce (knowledge) we can about AI and how it affects the security community and beyond.
Learn more
Here at Akamai research, we are committed to providing the community with powerful research to help fight the constant onslaught of malicious activity. Whether you are a defender, lawmaker, or even a researcher yourself, we try to provide you with valuable insights into all aspects of the security lifecycle.
You can see David Sénécal present on this topic at the upcoming RSAC event on Wednesday, April 30, at 6PM in the Akamai booth: N-6245.
For our most recent research findings, follow the Akamai Security Intelligence Group on social media.