The Machines That Read the Web: Inside the Class-Action Lawsuit Accusing Google, Meta, and Perplexity of Mass Content Theft

A federal class-action lawsuit filed in the Northern District of California is taking direct aim at three of the most powerful names in artificial intelligence — Google, Meta, and Perplexity AI — accusing them of systematically scraping copyrighted content from thousands of websites to train their large language models without permission, compensation, or even acknowledgment.

The complaint, brought by a coalition of website operators and content creators, doesn’t mince words. It alleges that these companies built their AI empires on the backs of publishers who never consented to having their work ingested by machine learning systems. And it arrives at a moment when the tension between AI companies and the media industry has reached a boiling point.

According to Android Authority, the lawsuit was filed by representatives of multiple websites whose content was allegedly scraped at industrial scale. The plaintiffs argue that the defendants violated copyright law, the Computer Fraud and Abuse Act, and various state laws by accessing and copying web content far beyond what any reasonable interpretation of terms of service would allow. The scale of the alleged scraping is staggering — billions of pages of text, images, and data vacuumed up to feed models like Google’s Gemini, Meta’s LLaMA, and Perplexity’s AI-powered search engine.

This isn’t a nuisance suit from a lone blogger. It’s a coordinated legal action that could reshape how AI companies acquire training data.

The Core Allegation: Your Website Was the Training Set

At the heart of the complaint is a simple but explosive claim: these companies treated the open web as a free buffet. Every article, every product description, every piece of original reporting — all of it scraped, tokenized, and fed into neural networks that now compete directly with the very sources they consumed.

The plaintiffs contend that web crawlers operated by or on behalf of Google, Meta, and Perplexity systematically ignored robots.txt files and other technical signals that publishers use to restrict automated access. In some cases, the lawsuit alleges, the companies deployed crawlers that deliberately disguised their identity to avoid detection and blocking. That’s not passive data collection. That’s active circumvention.

Perplexity AI has faced particularly pointed criticism on this front. The startup, which positions itself as an AI-powered answer engine, has been accused by multiple publishers of scraping their content and then presenting AI-generated summaries that effectively replace the need for users to visit the original source. Android Authority reported that the lawsuit specifically highlights how Perplexity’s model generates responses that closely mirror copyrighted articles, sometimes reproducing key facts, figures, and even distinctive phrasing without attribution.

For publishers, this creates a devastating feedback loop. Their content trains the AI. The AI then answers user queries with that content. The user never clicks through to the publisher’s site. Ad revenue evaporates. Subscription value erodes.

Google and Meta face similar allegations, though the mechanics differ. Google’s Gemini models were trained on massive datasets that the plaintiffs say included copyrighted web content obtained without licensing agreements. Meta’s LLaMA models — some of which were released as open-source — allegedly incorporated scraped data from news sites, forums, and specialty publications. The complaint argues that Meta’s decision to open-source certain models compounded the harm, because it effectively distributed the fruits of the alleged theft to the entire world.

And here’s where the legal theory gets interesting. The plaintiffs aren’t just arguing copyright infringement. They’re invoking the Computer Fraud and Abuse Act, a federal statute originally designed to combat hacking. The argument: by using automated systems to access websites in ways that violated those sites’ terms of service and technical restrictions, the defendants engaged in unauthorized access to protected computer systems. It’s an aggressive legal theory, but not without precedent. Courts have increasingly grappled with whether violating a website’s terms of service constitutes unauthorized access under the CFAA.

The state law claims add another layer. Unjust enrichment. Unfair business practices. The kitchen sink, as litigation attorneys like to say — but each claim is calibrated to a specific harm.

The timing of this lawsuit matters enormously. It lands amid a broader wave of litigation over AI training data that has already ensnared OpenAI, Stability AI, and Anthropic. The New York Times sued OpenAI and Microsoft in December 2023, alleging that ChatGPT was trained on millions of Times articles. That case is still working its way through the courts. A coalition of authors including Sarah Silverman and Michael Chabon has filed separate suits against Meta and OpenAI. Getty Images sued Stability AI over the use of its photographs to train image generators.

But this new class action is different in scope. Rather than representing a single major publisher or a handful of prominent authors, it seeks to represent a class of thousands of website operators — the long tail of the internet. Small publishers. Niche sites. Independent creators who lack the resources to sue individually but whose collective output forms the bedrock of AI training datasets.

The Defendants’ Likely Playbook — and Why Fair Use Is No Sure Thing

Google, Meta, and Perplexity have not yet filed formal responses to the complaint, but their likely defense is already well-telegraphed. Fair use.

Under U.S. copyright law, fair use permits limited use of copyrighted material without permission for purposes such as commentary, criticism, education, and research. AI companies have consistently argued that training a model on copyrighted data is “transformative” — that the model doesn’t reproduce the original works but rather learns patterns and generates new content. This argument has intuitive appeal. A model that reads a million articles about climate change doesn’t regurgitate any single article; it synthesizes information into novel outputs.

But the fair use defense faces serious headwinds. Courts evaluate fair use on four factors: the purpose and character of the use, the nature of the copyrighted work, the amount used, and the effect on the market for the original. On that fourth factor, plaintiffs have a compelling story. If an AI system can answer a user’s question by summarizing a copyrighted article — effectively substituting for the original — that’s a direct market harm. Publishers lose traffic. They lose revenue. They lose the economic incentive to create the content in the first place.

Perplexity is especially vulnerable here. Its entire product is built around providing direct answers to queries, often drawing on journalistic content. When a user asks Perplexity a question and receives a comprehensive answer synthesized from news articles, the user has little reason to visit any of the underlying sources. That’s not transformation. That’s substitution.

Google faces a more nuanced situation. Its search engine has always displayed snippets of content from third-party sites, but those snippets were designed to drive traffic to publishers, not replace them. The introduction of AI Overviews — Google’s feature that generates AI-powered summaries at the top of search results — has shifted that calculus dramatically. Publishers have reported significant traffic declines since AI Overviews launched. The lawsuit appears to draw a direct line between the scraping of content, the training of models, and the deployment of AI features that cannibalize publisher traffic.

Meta’s position is complicated by its open-source strategy. By releasing LLaMA models publicly, Meta arguably amplified whatever copyright harm occurred during training. If the training data included copyrighted content obtained without authorization, then every downstream user of LLaMA — every startup, every researcher, every hobbyist — is potentially benefiting from that unauthorized use. The plaintiffs will likely argue that Meta can’t claim its use was limited or contained when it literally gave the models away.

Recent developments add context. In May 2025, multiple reports indicated that AI companies are increasingly seeking licensing deals with publishers, a tacit acknowledgment that the scrape-first-ask-later approach carries legal risk. OpenAI has signed deals with the Associated Press, Axel Springer, and several other publishers. Google has launched an AI licensing program. But Perplexity has been slower to formalize such arrangements, and Meta’s open-source approach makes licensing structurally more complicated.

The class-action mechanism itself is significant. If certified, the class could include thousands of website operators, dramatically increasing the potential damages. Copyright law provides for statutory damages of up to $150,000 per work infringed willfully. Multiply that across thousands of websites and millions of pages, and the numbers become staggering — potentially billions of dollars.

But certification is far from guaranteed. Defendants will argue that each website operator’s situation is too different to be adjudicated as a class. Some sites had permissive robots.txt files. Some had restrictive ones. Some had terms of service that arguably permitted scraping; others explicitly prohibited it. The variations could defeat class certification, forcing plaintiffs to litigate individually — a far more expensive and less threatening proposition for the defendants.

What This Means for the Future of AI and Publishing

The broader implications extend well beyond this single lawsuit. The question of who owns the raw material of artificial intelligence is one of the defining legal and economic questions of the decade. If courts rule that mass scraping of copyrighted content for AI training constitutes infringement, it would force a fundamental restructuring of how models are built. Companies would need to license training data, build models on public-domain or permissively licensed content, or develop synthetic training data.

Some in the AI industry argue that such a ruling would cripple innovation. That without access to the broad corpus of human knowledge available on the web, AI models would be less capable, less accurate, and less useful. There’s some truth to that. But it’s also true that an entire industry built on unpaid appropriation of others’ work is not a sustainable model — economically, legally, or ethically.

The publishing industry, for its part, is watching these cases with existential urgency. Traffic from search engines has long been the lifeblood of ad-supported media. If AI systems intercept that traffic by providing answers directly, the economic model that supports journalism and online content creation collapses. Not slowly. Rapidly.

So the stakes in this class action are not merely financial. They’re structural. They go to the question of whether the open web will continue to be a place where creators are incentivized to publish original work, or whether it will become a passive reservoir — mined for data by AI systems that return nothing to the sources they depend on.

For now, the case is in its earliest stages. Motions to dismiss will come. Discovery battles will follow. The road to trial — if it gets there — is long. But the filing itself sends a clear signal to Silicon Valley: the era of treating the internet’s content as a free resource for AI training may be coming to an end.

And for those of us who grew up believing the internet was built on a social contract — you create, you share, you get credit and traffic in return — this lawsuit feels like a reckoning that’s been a long time coming.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top