Reddit vs. AI Giants: The Data War That Will Define the Internet
On October 23, 2025, Reddit filed suit against Perplexity, alleging the AI search company scraped Reddit posts without authorization. This follows Reddit's June lawsuit against Anthropic, and ongoing disputes with data scraping companies Oxylabs and SerpAPI.
Notably absent from the defendant list: OpenAI and Google. Both companies have licensing agreements with Reddit. Google's deal is worth $60 million annually, while OpenAI announced its partnership with Reddit in May 2024. This creates a two-tier system where companies with resources can license data legally, while smaller AI firms face litigation for accessing the same public information.
The legal questions raised extend beyond AI training. What rights do platforms have over public user content? Can tech-savvy individuals scrape websites for personal use or research? The inconsistent legal precedent makes answers unclear.
Who Reddit Is Suing (And Who They're Not)
Reddit's legal strategy reveals an interesting pattern. The company has filed suits against:
- Anthropic (June 2025) - Claude AI developer
- Perplexity (October 2025) - AI search engine
- Oxylabs - Data scraping infrastructure provider
- SerpAPI - Search API service used for data extraction
Notably, OpenAI and Google are not being sued. Both companies secured licensing agreements with Reddit. Google's deal costs $60 million annually, while OpenAI's partnership provides Reddit API access for OpenAI products.
The Two-Tier System
Companies with licensing deals: OpenAI and Google can legally train on Reddit data. They have the resources to negotiate multi-million dollar agreements.
Everyone else: Smaller AI companies, researchers, and scraping services face potential litigation for accessing the same publicly available content. The barrier is financial, not technical.
This raises questions about market consolidation. If only well-funded companies can legally access training data, it creates structural advantages for incumbents like OpenAI and Google, potentially limiting competition in AI development.
The Confusing Legal Landscape
Data scraping cases have produced contradictory outcomes, making it difficult to predict how courts will rule on Reddit's lawsuits.
When Scraping Was Allowed: LinkedIn vs. hiQ Labs
In 2022, LinkedIn won its case against hiQ Labs after years of litigation. The Ninth Circuit Court initially ruled in favor of hiQ, stating that scraping publicly accessible data did not violate the Computer Fraud and Abuse Act (CFAA). However, the case was remanded multiple times, and LinkedIn ultimately prevailed when hiQ ceased operations.
When Scraping Was Blocked: Meta vs. Bright Data
In October 2024, Meta lost a legal battle against Bright Data (formerly Luminati Networks), a company that provides web scraping tools. The court ruled that scraping publicly available data from Meta's platforms did not constitute unauthorized access under the CFAA.
The Inconsistency Problem
LinkedIn eventually won its case, while Meta lost against scrapers. The outcomes depended on specific legal arguments, circuit court jurisdictions, and how CFAA was interpreted. This creates uncertainty: is scraping public data legal or not?
The answer appears to depend on factors like whether users must authenticate to view content, whether scraping violates Terms of Service in ways courts recognize, and which legal jurisdiction hears the case.
What About Individual Researchers and Developers?
These cases focused on commercial scraping operations. But what about tech-savvy individuals who scrape Reddit for personal projects, academic research, or building datasets?
The legal status remains murky. A developer scraping Reddit to build a sentiment analysis tool for their thesis faces the same Terms of Service as Anthropic. The difference is scale and commercial intent, but the legal framework does not clearly distinguish between them.
This ambiguity creates a chilling effect. Researchers may avoid scraping out of legal uncertainty, even when their use could be considered fair use or protected academic inquiry. Meanwhile, well-funded companies can pay for licensing and operate without legal risk.
Does Reddit's Contributor Program Go Far Enough?
Reddit does have a monetization program for users. The Contributor Program, launched in 2023, allows eligible users to earn money from their contributions. But the question is: is this enough, especially given Reddit's new AI licensing revenue streams?
How the Contributor Program Works
Reddit's Contributor Program operates on a Gold-based system:
- Eligibility: Account must be 30+ days old with at least 100 karma. User must be 18+ and in an eligible country.
- How you earn: Other users award your posts/comments with Reddit Gold. You convert Gold to cash.
- Payout rates: Contributors (100-4,999 karma) earn $0.90 per Gold. Top Contributors (5,000+ karma) earn $1.00 per Gold.
- Reality: User reports show modest earnings. Top contributors report payouts around $11/month. The program relies on other users giving Gold, which happens infrequently.
The Critical Gap
The Contributor Program is based on Gold awards from other users, not Reddit's revenue. When Reddit signs a $60 million annual licensing deal with Google, that money does not flow through the Contributor Program. Users whose content is included in AI training datasets receive no compensation from these licensing deals, regardless of their karma or contributor status.
The Platform Position
Reddit argues it provides infrastructure, moderation, hosting, and community management. Users benefit from network effects and the platform's tools. The Terms of Service users agree to grant Reddit broad rights to use, distribute, and monetize content.
This is standard across platforms. Twitter, Facebook, Stack Overflow, and others maintain similar terms. Users trade content rights for platform access.
Is the Current System Sufficient?
The argument for the status quo: Reddit provides infrastructure worth hundreds of millions annually. Users voluntarily post content knowing the terms. The Contributor Program offers a path to monetization for active users.
The counterargument: AI licensing represents a new, massive revenue stream that didn't exist when most content was created. A program that pays $11/month to top contributors while Reddit earns $60M+ annually from AI licensing of that same content seems disproportionate.
What Could Change
Several modifications to the current system could better align user compensation with AI licensing revenue:
- 1.AI licensing revenue pool: Allocate a percentage (e.g., 20%) of AI licensing revenue to a creator fund. Distribute based on content engagement metrics (upvotes, saves, citations in training data if technically feasible).
- 2.Contributor Program enhancement: Top contributors whose content is in training datasets receive bonus payouts or premium features funded by licensing revenue.
- 3.Opt-out with compensation tradeoff: Users can opt out of AI training but lose access to Contributor Program earnings. Those who opt in get higher payouts funded by licensing deals.
- 4.Collective bargaining model: Similar to music streaming royalties, users collectively negotiate through representative groups for a share of licensing revenue.
Reddit has little financial incentive to implement these voluntarily. The current Contributor Program satisfies the minimum requirement of "users can earn money" without meaningfully sharing AI licensing revenue. Regulatory pressure or competitive dynamics (if other platforms offer better terms) would likely be needed to shift this balance.
Data Scraping: The Broader Context
Reddit's lawsuits are part of a larger trend where platforms are asserting control over their data in response to AI training. But data scraping itself is not new, nor is it limited to AI companies.
Who Scrapes Data and Why
- Search engines: Google, Bing, and DuckDuckGo scrape and index web content to power search results. This has been accepted practice since the 1990s.
- Price comparison services: Sites like Google Shopping scrape e-commerce platforms to aggregate pricing data.
- Academic researchers: Scientists scrape social media data to study social dynamics, public health trends, and information spread.
- Journalists: Investigative reporters scrape public records and social media to uncover stories.
- Developers: Engineers build tools and apps that aggregate data from multiple sources.
- AI companies: Training large language models requires vast amounts of text data scraped from the web.
The difference with AI training is scale and commercial value. Scraping a few thousand pages for research differs from systematically extracting billions of posts to train models worth billions of dollars.
The Platform Perspective Shift
For years, platforms tolerated scraping or selectively enforced anti-scraping policies. Search engine crawling was explicitly allowed via robots.txt. Researchers operated in a gray area. Developers built third-party apps on unofficial APIs.
AI training changed the calculus. Platforms realized scraped data has significant commercial value for AI training. What was previously tolerated (or ignored) became a potential revenue stream worth tens of millions annually.
Now platforms want compensation for what they previously provided freely (or at least, did not actively block). This creates tension between the open web culture that enabled their growth and the commercial interests that drive their current strategy.
What's Actually At Stake
The outcomes of these lawsuits will shape several critical issues:
1. Market Consolidation in AI
If platforms successfully enforce licensing requirements, AI development becomes more expensive. OpenAI and Google can afford $60 million annual licensing deals. Startups, academic researchers, and smaller companies cannot.
This creates structural barriers to entry. The companies with existing resources and relationships can secure training data. New entrants face legal risk or prohibitive costs. The result: less competition, more concentration in AI capabilities.
2. The Death of the Open Web (Or Its Preservation?)
Two competing narratives exist about what the "open web" means:
Narrative 1: The open web means anyone can access public information. If Reddit wins, platforms will lock down data behind paywalls and authentication, destroying the accessibility that made the internet valuable.
Narrative 2: The open web means anyone can publish without platforms extracting value without compensation. If AI companies win, platforms will have no incentive to host user content since it can be freely taken and monetized elsewhere.
Both narratives have merit. The tension is real: enabling access vs. protecting creator value.
3. Platform Power vs. User Rights
Whether Reddit or AI companies win, users likely lose. If Reddit wins, platforms gain more control over user content and can monetize it without meaningfully sharing revenue. If AI companies win, user content can be extracted and used commercially without consent or fair compensation.
Neither outcome addresses the core issue: users who create valuable content receive minimal economic benefit compared to the scale at which that content is monetized. The existing Contributor Program pays top contributors around $11/month while Reddit earns $60M+ annually from AI licensing of that same content.
4. Chilling Effect on Research and Innovation
The legal ambiguity creates risk for legitimate use cases. Academic researchers studying social media behavior, developers building aggregation tools, journalists investigating public discourse—all face potential legal exposure under current Terms of Service enforcement.
Some Questions
These lawsuits force us to confront questions that Terms of Service have obscured:
- Who owns public content? The person who wrote it, the platform hosting it, or no one?
- What is "public" data? If content requires no authentication to view, can it be freely used for any purpose?
- Should scale matter? Is there a meaningful difference between an individual scraping for research and a corporation scraping for billions in revenue?
- Can Terms of Service unilaterally grant platforms unlimited monetization rights? Or should there be limits on how companies can use content created by users?
- What about consent? Users agreed to Terms of Service, but did they truly consent to having their content fuel AI models that didn't exist when they posted?
The inconsistent legal precedent (Meta losing to scrapers, LinkedIn eventually winning) shows courts are struggling with these questions. The law has not caught up to the technology or business models.
What Happens Next
Realistically, these lawsuits will settle or result in narrow rulings that don't resolve the broader questions. Reddit may extract licensing fees from Anthropic and Perplexity. A licensing market will emerge where platforms charge for data access.
This benefits well-capitalized platforms and AI companies. It does little for users who created the content or smaller companies trying to compete. Without regulatory intervention or new legislation specifically addressing creator rights, the status quo remains: platforms control user content through Terms of Service, and economic power determines who can legally access training data.
The more interesting question is whether we're comfortable with that outcome. Should multi-billion dollar licensing markets exist for content that users created freely, expecting no compensation? Or is this simply the deal we implicitly agreed to when we chose free platforms over paid alternatives?
A Personal Take
The fact that OpenAI and Google can legally train on Reddit data (because they paid for licensing) while Anthropic and Perplexity face lawsuits for accessing the same public content suggests this is less about protecting user rights and more about platforms extracting maximum revenue from their data monopolies.
If Reddit truly cared about user consent and content ownership, licensing deals with OpenAI and Google would be contingent on user compensation models. They are not. This is about platform revenue, not user rights.
What I would like to see: a clear opt-in/opt-out policy for data usage in AI training. Users should be able to choose whether their content can be used for this purpose, regardless of when they posted it. Platforms could default to opt-in for new users while giving existing users the choice. This would at least acknowledge that users have some say in how their contributions are monetized at scale.
Start Your Learning System Today
Join 2,500+ engineers who get curated tech content from Netflix, Uber, and Airbnb engineering teams. Plus GitHub Trending projects every Tuesday & Friday.
Written by Benjamin Loh, curator of Tech Upkeep