Rage against the machines

What can sloppy AI crawlers do to preserve the web

May 09, 2025

Over the past year, a revolt has been brewing across the web. Small website owners, independent bloggers, and known media publishers have grown increasingly vocal about the disruptive behavior of a new wave of AI web crawlers that scrape web content with unprecedented aggression and recklessness.

Unlike the relatively disciplined traditional bots of search engines, many AI crawlers flood servers with requests with a startling lack of restraint.

Website operators report crippling bandwidth spikes and inflated infrastructure costs due to the sheer volume of automated traffic.

Even Wikimedia, one of the largest open data projects in the world, is straining under the weight. Since January 2024, they noticed that the bandwidth used to download images from Wikimedia Commons has surged by 50%, largely due to automated systems collecting openly licensed content—often valuable for training AI models. A recent internal review found that 65% of their most resource-intensive traffic comes from agents that don’t behave like typical web browsers.

To further illustrate the scope of the problem, here is a compilation of notable examples: A single webmaster received 81,000 requests in just two and a half hours—yet only 3% of those requests passed a basic proof-of-work challenge, implying that 97% of those traffic are automated. Another operator measured that 70% of his total traffic now comes from AI-linked user agents, with OpenAI accounting for roughly a quarter, Amazon 15%, and Anthropic over 4%. In contrast, Google and Bing crawlers barely register—together making up less than one percent of total requests.

These numbers paint a troubling picture: the majority of crawling traffic is no longer coming from conventional search engines or archival bots, but from a loosely coordinated wave of AI collectors acting without clear guardrails.

Crawling and scraping public web data are foundational to academic research and business decision-making. They are also the invisible fodder behind many platforms. But many report that AI crawlers today behave like bad neighbors. They show up unannounced, take food from your fridge, and let themselves out—only to be back an hour later expecting a restock.

The tension raises deeper questions about the ethics and governance of web crawling. Should there be stricter standards for AI crawlers to prevent harm, and what would those look like? Are technical countermeasures alone sufficient, or do we need stronger legal protections for website operators? As AI continues to rely on web data, these issues demand attention for the sake of sustainability for the online ecosystem.

Getting to know the different crawlers

Before we move on, it’s worth getting clear on some definitions.

A crawler is a program that systematically browses the web. It follows links, downloads pages, and collects content.

What do I mean, AI crawlers? Here I am referring to a specific kind: systems run by AI companies to gather training data for machine learning models. These include models behind large language models (LLMs), hardware-based voice assistants, and other AI systems that rely on huge volumes of web-based text, code, audio, and images.

This is a different class from everyday tools like curl or wget, which fetch individual pages. It also differs from AI search or assistant tools that retrieve data on a user’s behalf. What sets these crawlers apart is the scale and intent: they’re built to indiscriminately sweep vast portions of the web, often with little regard for the sites they hit.

There is also another layer to this. Some traffic tagged as coming from “AI crawlers” may not be what it seems. Operators looking to fly under the radar might pretend to be well-known AI crawlers—spoofing user agents while routing requests through residential IP addresses.

Why AI crawlers are disruptive

AI-powered web crawlers don’t behave like traditional, well-engineered scrapers.

Traditional scrapers—used for price monitoring, lead generation, or news aggregation—typically target known, structured data sources and respect crawl etiquette. They aim for precision, efficiency, and sustainability.

AI crawlers, by contrast, pursue exhaustiveness. Many operate with blunt, repetitive, and poorly optimized strategies. Instead of respecting rate limits or site-specific restrictions, they overwhelm servers with high-frequency requests, draining resources at unsustainable levels.

The difference isn’t just scale. It’s intent and behavior.

People are noticing—and not in a good way. Reports describe buggy crawlers that get stuck in loops, downloading the same pages over and over. One observer remarked that these bots are “DDoS-ing the entire internet.”

Some called out the lack of consistency between AI companies' stated policies and their observed actions. This has fueled mistrust that not only strains web infrastructure but also forces site owners to implement blanket anti-bot measures that harm even well-behaved scrapers.

What happened to basic scraping courtesy?

In the Web Scraping 2025 report, I offered a hypothesis that these firms already have the technical talent and compute to build their own data collection systems. But a deeper problem might not be obvious, based on their current failure modes: they assumed crawling the web is easy.

To be fair, brute-force crawling is easy—especially if you can afford to burn compute and ignore the cost to others. But efficient, respectful, and sustainable crawling is not easy. It takes engineering effort, operational nuance, and domain expertise.

During my years leading solutions architecture team at a web scraping consultancy, I made sure every data collection project is designed with sustainability and ethics in mind. I set up principles and tools to always account for how much traffic is acceptable to each target site while still achieving the contractual commitments to the clients. And to the credit of my employer, this mindset was a big part of our value. We were very mindful of collecting the public data without abusing and burdening the infrastructure that hosts the information.

Generated image — Courtesy ChatGPT. Oh yes, the irony.

So what’s behind the sloppiness?

Is it carelessness? Inexperience and incompetence? Disregard and disrespect? Corner-cutting in a high-pressure fast-paced foundational model space? All of them?

Never attribute to malice what could be adequately explained by stupidity. Is Hanlon's razor appropriate here?

Still, none of that excuses it. These companies could crawl smarter. But they don’t. Some scrape low-value data: logs, diffs, duplicates, again and again.

At the heart of it all is this: an insatiable hunger for data.

At the moment, AI companies run on the scaling hypothesis: more data leads to better models. That belief fuels indiscriminate mindless scraping. To add to this, unlike traditional data extraction methods that mostly focus on structured datasets, AI crawlers target both structured and unstructured content. Their goal isn’t to extract clean data points. It’s to absorb as much raw input as possible. If they can get to it, it’s scraped. No prioritization. No filtering. Just scale.

How website owners are fighting back

When neighbors turn into neighbots, people put up fences.

As AI crawlers consume bandwidth, overwhelm servers, and ignore protocols, website owners are striking back—with code, friction, and deception.

Right now, most website owners face a blunt choice: block entirely or absorb the cost. Some outright ban entire cloud IP ranges. Some set honeypots, tarpits, and proof-of-work challenges to waste compute time and slow the crawlers down. Some employ aggressive tactics like content poisoning, where junk data are served to corrupt large-scale training sets. Micropayments, login walls, and CAPTCHAs are spreading, too—adding layers of friction that turn public content into gated content.

Then there are micropayments, which flip the default: pay to access, or get blocked. Login walls and CAPTCHAs are spreading as well—adding layers of friction that turn public content into gated content.

But one of the biggest shifts came from one search giant. Did you notice? Google has started shielding its search result pages behind JavaScript — a move that likely reflects this new AI-driven pressure.

Even infrastructure providers are adapting. Cloudflare, for example, has started deploying AI traps that force the crawlers to ingest AI-generated data. It’s the Turing Tango.

As these measures proliferate, the open web is becoming less accessible.

A sustainable path forward for AI crawling

How can crawler authors make their bots play nicer?

If the current trajectory continues, the internet risks devolving into a battleground between the scrapers and the scraped and. To prevent this, AI companies must adopt ethical crawling practices that balance their need for data with the health of the open web.

If AI companies wish to maintain access to public data, they must adopt sustainable and respectful scraping practices.

These code of conduct for AI crawlers seem like a good place to start:

1. Respect site boundaries
Yes, robots.txt is an honor-based system and not legally binding, but many websites still rely on it as a signal of consent. Ignoring it fuels distrust and invites defensive measures.

2. Identify yourself
AI crawlers should operate transparently by using functional and verifiable browser user agents. If you're proud of your crawler, say who you are.

3. Crawl efficiently
Practice restraint. AI crawlers should limit request rates, avoid redundant access, and skip low-value or duplicate pages. Mindless collection burns bandwidth and wastes compute.

4. Ask permission, don’t beg forgiveness

Negotiate access models whenever possible. Partner with websites or data collection experts instead of relying on rudimentary, do-it-yourself, scraping whether it’s licensing agreements, official APIs, or revenue-sharing models.

5. Give back to the commons
The open web feeds AI models. If AI companies benefit commercially from public data, they must find ways to give back. Support it through attribution, open data contributions, or infrastructure funding.

6. Follow and promote responsible norms
Adopt well-established protocols for ethical crawling—like the RFC2616Policy extension built into Scrapy. Building on community standards helps align incentives and reduce adversarial behavior.

Is the closed web inevitable?

A more resilient, open web is in everyone’s interest, especially those building on top of it.

If AI companies continue treating the public web as an infinite free-for-all, they’ll accelerate its collapse. Ironically, this self-destructive behavior threatens the very thing these companies depend on: diverse, relevant, accurate, and valuable data. As access shrinks, model quality will decline.

What can we, as an industry, do to stop this from becoming the default future?

AI firms are now at a crossroads. The choice is clear: cooperate with the ecosystem, or exploit it until it collapses. The choice is theirs, but the cost of inaction is clear—and possibly be paid by all.

The web deserves better citizens. And better neighbors are possible.

Philosoraptech

Discussion about this post