THE COPYRIGHT HEIST · AUTHORS · AI TRAINING DATA · FAIR USE

◈ THE SCALE · WHAT WAS TAKEN

THE INVENTORY

4M+

BOOKS · LIBGEN + BOOKCORPUS

META LLAMA TRAINING CORPUS

196B

TOKENS · BOOKS3 DATASET

USED BY META, BLOOMBERG, OTHERS

PAID TO AUTHORS

FOR ANY OF THE ABOVE

ACTIVE FEDERAL LAWSUITS

AS OF APRIL 2026

Books3 — a dataset of approximately 196 billion tokens sourced from LibGen, a known piracy repository — was used in the training of multiple major language models including Meta's LLaMA series. LibGen contains copyrighted books obtained without authorization. The researchers who assembled Books3 knew this. The companies that used Books3 knew this. The internal emails, now unsealed in Kadrey v. Meta, confirm it was discussed. They proceeded anyway.

◈ THE DOCKET · CASES ON FILE

THE FILINGS · PUBLIC RECORD

KADREY V. META PLATFORMS · N.D. CAL. · 2023

Richard Kadrey, Sarah Silverman, Christopher Golden v. Meta Platforms

The foundational case. Alleges Meta trained LLaMA on LibGen — a piracy repository — without authorization. Internal Meta emails, unsealed 2024, show employees discussed the legal risk of using LibGen and proceeded. The emails reference the library as "kind of a gray area" and "legally murky." They used it. The model shipped. The unsealed documents are the receipt.

ONGOING

AUTHORS GUILD V. OPENAI · S.D.N.Y. · 2023

Authors Guild, John Grisham, George R.R. Martin, Jodi Picoult, et al. v. OpenAI

The marquee plaintiff list. Grisham. Martin. Picoult. The Authors Guild representing thousands of members. Alleges OpenAI trained GPT on copyrighted books without license or compensation. The complaint documents how GPT can reproduce passages from plaintiffs' works verbatim — evidence the training data included the full text. Seeking damages and injunctive relief.

ONGOING

NEW YORK TIMES V. OPENAI + MICROSOFT · S.D.N.Y. · 2023

The New York Times Company v. OpenAI LLC, Microsoft Corporation

The highest-profile filing. The Times demonstrated that ChatGPT could reproduce NYT articles verbatim — word for word — when prompted. This is direct evidence of memorization: the model stored the text, not just patterns from it. The complaint seeks billions in damages and the destruction of models trained on NYT content. Microsoft is named as a defendant due to its investment and integration with OpenAI. This case changes the fair use calculus significantly.

ONGOING

ANDERSEN V. STABILITY AI · N.D. CAL. · 2023

Sarah Andersen, Kelly McKernan, Karla Ortiz v. Stability AI, Midjourney, DeviantArt

Visual artists. Stability AI trained on billions of images scraped from the internet, including artists' work posted on DeviantArt and other platforms. The model can now generate images "in the style of" named artists — artists who never consented to their work being used as training data. A junior designer can now prompt "in the style of [artist name]" and get something indistinguishable from the artist's portfolio. The artist gets nothing. The platform charges a subscription.

ONGOING

GITHUB COPILOT CLASS ACTION · N.D. CAL. · 2022

Doe v. GitHub, Microsoft, OpenAI

Developers filed. GitHub Copilot was trained on public GitHub repositories — code written by developers under open-source licenses that require attribution. Copilot reproduces licensed code without attribution, violating the terms of the licenses under which that code was shared. Microsoft owns GitHub. Microsoft invested $13 billion in OpenAI. Copilot is sold as a Microsoft product. The developers whose code trained it pay for a subscription to use their own work back.

ONGOING

UMG V. SUNO + UDIO · S.D.N.Y. · 2024

Universal Music Group et al. v. Suno Inc., Udio (Uncharted Labs)

Music. AI music generation platforms trained on copyrighted recordings without license. The models can generate music indistinguishable from specific artists' styles. The major labels — Universal, Sony, Warner — filed simultaneously. This case extends the copyright heist to audio. Every creative domain is now in scope.

SETTLED 2024

◈ THE LEGAL ARGUMENT · HOW THEY DEFEND IT

FAIR USE · THE ARGUMENT

The primary defense across all AI copyright cases is fair use — a doctrine in U.S. copyright law that permits use of copyrighted material without permission in certain circumstances. The four-factor fair use test weighs: purpose and character of use, nature of the copyrighted work, amount used, and effect on the market for the original.

◈ DEFENSE ARGUMENT · THE COMPANIES

Training an AI model is "transformative" — it doesn't reproduce the work, it learns patterns from it. The output is new expression, not a copy. Search engines index the web without paying rights holders. This is the same. Training is research. Research is fair use. The model doesn't contain the books — it contains statistical relationships derived from the books.

◈ PLAINTIFF ARGUMENT · THE AUTHORS

The New York Times demonstrated verbatim reproduction — the model memorized the text, not just patterns. The "output is new" argument fails when the output competes directly with the original market. A model trained on John Grisham's novels that can now write legal thrillers in Grisham's style directly harms Grisham's market. Fair use has never permitted use that substitutes for the original. This does. The scale — millions of works — is also unprecedented. Fair use cases involve individual uses, not systematic mass ingestion for commercial profit.

◈ THE MARKET SUBSTITUTION PROBLEM

The Fourth Factor — The One That Matters Most

Courts weight the fourth fair use factor — effect on the market for the original — most heavily. If the AI product substitutes for the original, fair use fails. A ChatGPT that can write a legal thriller "in the style of Grisham" is a market substitute for Grisham's next book. A Copilot that can write Python functions in the style of a developer's existing codebase is a market substitute for that developer. The substitution is not theoretical. It is happening. Publishers are paying fewer authors. Agencies are signing fewer clients. The market effect is real, ongoing, and measurable.

◈ THE MEMORIZATION PROBLEM

When the Model Reproduces the Text Verbatim

The transformative use argument depends on the model not retaining the original work — just "learning from" it. The NYT demonstrated this argument is false for sufficiently trained models. When prompted with the beginning of a NYT article, GPT-4 reproduced hundreds of words verbatim. This is not pattern learning. This is memorization. The work is in the model. The fair use defense based on transformation collapses when the model can reproduce the original. The companies know this. That is why the strongest cases — the ones with verbatim reproduction evidence — are the ones they are most eager to settle quietly.

◈ THE MARKET EFFECT · WHAT IS ALREADY HAPPENING

THE COST TO WRITERS

The litigation is about the past — the training data already used. The market effect is about the present and future — what happens to writers, artists, musicians, and developers now that the models are trained and deployed.

◈ PUBLISHING

Advances Are Down. AI Submissions Are Up.

Literary agents report a significant increase in AI-generated manuscript submissions. Publishers are receiving more manuscripts than ever — and advancing fewer authors. The mid-list author — the working writer who publishes a book every two years and earns a living from advances and royalties — is the most vulnerable category. The blockbuster survives. The debut author may still break through. The working mid-list writer who trained the model that now competes with their next book is the one who loses their livelihood. No lawsuit addresses this prospective harm. Only regulatory action would.

◈ JOURNALISM

The Newspapers That Trained the Model Now Compete With It

AI summaries — built on journalism — now appear above news articles in search results. Users read the summary, don't click the article. The journalism that trained the model generates the summary that replaces the journalism. The newspaper loses the click. The AI company keeps the subscription revenue. The New York Times lawsuit is partly about this: the model was trained on decades of NYT reporting and now competes with the NYT for the attention of people searching for news. The thing trained on the work now substitutes for the work.

◈ ART + MUSIC

Style Is Not Copyrightable. Livelihood Is.

Copyright protects expression, not style. You cannot copyright "legal thriller" as a genre. You cannot copyright a guitar tone or a color palette. But livelihood depends on style. An illustrator whose distinctive style was used to train Midjourney now competes with Midjourney, which can replicate their style on demand for anyone with a $10/month subscription. The copyright argument is difficult. The economic devastation is real. The law has not caught up to the harm.

◈ THE EMAILS · WHAT THEY SAID INTERNALLY

THE RECEIPTS · UNSEALED

The Kadrey v. Meta case produced the most significant internal documents to date. Unsealed in 2024, the emails show Meta employees discussing the legal risk of using LibGen — a site the U.S. government has identified as a piracy operation — to train LLaMA.

◈ META INTERNAL EMAIL · UNSEALED · 2024

The "Gray Area" Discussion

Meta researchers, discussing whether to use LibGen for LLaMA training: "It's kind of a gray area legally. But everyone's using it."

"Everyone's using it" is not a legal defense. It is a description of an industry norm that is, in aggregate, an industry-wide copyright violation. The fact that other companies were also using LibGen does not make it legal — it makes the harm larger.

The emails also show employees suggesting that using LibGen via a proxy or mirror might provide "plausible deniability." The decision was made to use the data. LLaMA was trained. The model was released.

This is not alleged behavior. It is documented in emails written by the employees of the company, now part of the court record, publicly accessible. The "gray area" framing is the tell: they knew it was not clearly legal. They made a business decision that the value of the training data exceeded the legal risk. The authors are the uncompensated externality of that calculation.

◈ THE KENSHO READ

WHAT THIS ACTUALLY IS

Copyright law exists for one reason: to give creators an economic incentive to create, so that society benefits from the output of their creativity. The bargain is: you create something, you get a limited monopoly on its use, which lets you earn a living, which lets you create more.

The AI training data extraction breaks this bargain at scale. The works were created under the expectation of copyright protection. They were scraped, ingested, and used to train commercial products without the creators' knowledge or consent. The products now compete directly with the creators in their own markets. The creators receive nothing. The companies receive valuations in the hundreds of billions.

If this is fair use, fair use has no meaning. If training on millions of copyrighted works for profit, at a company valued at $300 billion, to build products that compete directly with the source material, with zero compensation to the creators — if that is fair use — then fair use is a doctrine that protects capital, not creativity.

That is what the courts are deciding. Right now. In real time. The outcomes of these cases will define the economic relationship between human creators and AI systems for the next century. The authors who filed understood this. That is why they filed. 925.