◈ THE SCALE · WHAT THEY TOOK · BY THE NUMBERS
22 LIBRARIES OF CONGRESS. PIECE BY PIECE.
Before the mechanism, the magnitude. OpenAI's GPT-4 was trained on approximately 13 trillion tokens. Meta's LLaMA 3 on 15 trillion tokens. One token is roughly 4 characters of text. That's 60 terabytes of text for LLaMA alone. The Library of Congress holds about 20 terabytes of digitized text. They trained on the equivalent of three Library of Congress collections — minimum — in text alone. If you count scanned books, academic PDFs, legal filings, code, audio transcripts, and everything else: 22 is not an exaggeration.
15T
TOKENS · LLAMA 3 · META TRAINING CORPUS
13T
TOKENS · GPT-4 · OPENAI TRAINING CORPUS
4M+
BOOKS · LIBGEN · PIRATED LIBRARY
80M+
RESEARCH PAPERS · LIBGEN SCI-HUB OVERLAP
250B
WEB PAGES · COMMON CRAWL CUMULATIVE ARCHIVE
$0
PAID TO AUTHORS · PUBLISHERS · UNIVERSITIES · GOVERNMENTS
they didn't buy it. they didn't ask. they took it while the lawyers were still figuring out what a token was. meanwhile congress asked the ceo of google if he was aware that when you sleep your iphone doesn't charge itself. this is the governance we had. filed. 925.
◈ MECHANISM I · COMMON CRAWL · THE FRONT DOOR THEY WALKED THROUGH
THE WHOLE WEB. ON A HARD DRIVE. FREE.
Common Crawl is a nonprofit founded in 2008. Their mission: crawl the entire public web and make it available to researchers for free. Noble idea. Beautiful idea, even. A petabyte of compressed text. A gift to the research community. And then the AI labs pulled up like it's free real estate. Every publicly accessible URL — .com, .gov, .edu, .org, .io, everything. Available on Amazon S3 as an open dataset. You don't even need a login. You just point your Python script at the S3 bucket and start downloading.
Google used it to build C4 (Colossal Clean Crawled Corpus). Meta used it for their CCNet pipeline. OpenAI used filtered versions of it for GPT-2, GPT-3, GPT-4. Every major AI lab used Common Crawl as the base layer. It includes university research papers on .edu domains, government reports on .gov, open-access journals, court documents, medical databases, everything a country publishes to the public web. The .edu and .gov content got in the same way everything else did — if the URL was accessible, it was crawled.
Did the universities consent? No. Did the government agencies? No. Did any individual researcher whose paper was scraped? No. Were they technically "public"? Yes. Was there any legal mechanism to stop this in 2020? Essentially no. Ong.
◈ THE PYTHON MECHANISM · HOW YOU ACCESS THE WHOLE WEB · 12 LINES
import boto3, warcio, trafilatura
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='commoncrawl', Key='crawl-data/CC-MAIN-2021-04/...')
for record in warcio.ArchiveIterator(obj['Body']):
raw_html = record.content_stream().read()
text = trafilatura.extract(raw_html)
if text and len(text) > 100:
save_to_corpus(text)
The full production pipeline adds language detection (fastText), deduplication (MinHash LSH similarity), quality filtering (perplexity scoring), and distributed processing (Apache Beam/Spark on GCP or AWS). But the core mechanic is what's above. A nonprofit made the entire web available as a zip file. The AI labs unzipped it and called it training data.
◈ COMMON CRAWL ROBOTS.TXT PROBLEM
THE NO TRESPASSING SIGN WAS THERE. THEY RECLASSIFIED IT AS DECORATION.
Robots.txt is the web standard "no trespassing" sign. A file at the root of every website that tells crawlers what not to index. Common Crawl claims to respect it. But: (1) Most sites in 2019-2022 had no specific exclusions for AI training. (2) The "User-agent: GPTBot" directive that explicitly blocks AI scrapers didn't exist until OpenAI introduced it in 2023 — after training was done. (3) Common Crawl's bot user-agent is "CCBot" — many sites that later blocked AI scrapers had already been captured in the archive. The data was already in the corpus before the sites knew to put up the fence.
◈ MECHANISM II · LIBGEN · THE SHADOW LIBRARY · THE SMOKING GUN
THEY KNEW. THEY ASKED. THEY PROCEEDED. ZUCKERBERG WAS IN THE LOOP.
Library Genesis — LibGen — is a shadow library. 4 million+ books. 80 million+ research papers. All pirated. It operates out of domains in Russia and other jurisdictions hostile to US copyright enforcement. Academic researchers use it to access papers behind paywalls. It is, by any legal definition, a piracy operation. Meta trained LLaMA on it.
This is not speculation. This is documented in the federal court record. The case is Kadrey v. Meta Platforms, Northern District of California. The court unsealed internal communications in 2024. The record is public. You can go read it. Right now. For free. On a .gov website. Ironic that the most damning evidence about data theft is itself freely accessible on the web that they scraped.
◈ EXHIBIT A · UNSEALED · KADREY v. META · N.D. CAL. · 2024
Internal Meta communications show researchers specifically identified LibGen as a training data source. Legal counsel reviewed the copyright implications. The decision was made to proceed. To reduce legal exposure from BitTorrent metadata — which creates traceable upload/download logs — the team downloaded LibGen content directly via HTTP from LibGen's servers rather than through torrent clients. Less traceable. Same content. Mark Zuckerberg was informed and did not object.
◈ SOURCE: KADREY v. META PLATFORMS · CASE 3:23-CV-03417 · UNSEALED COMMUNICATIONS · N.D. CAL. 2024 · PUBLIC RECORD
Read that again slowly. They chose the HTTP download method over BitTorrent specifically because it was harder to trace. This is not a mistake. This is not an oversight. This is a deliberate decision made with legal review, at the executive level, to minimize evidence of the piracy. The decision was to commit the act more quietly, not to not commit it.
If your neighbor took your car and you asked them about it, they'd say "I used the key, not a crowbar. Much more respectful." That is the HTTP argument. That is the whole thing. Same car. Different deniability strategy. Ong.
◈ THE BOOKS3 PIPELINE · HOW PIRATED BOOKS BECAME TRAINING DATA
196,640 BOOKS. FROM BIBLIOTIK. IN THE PILE. IN LLAMA. IN EVERY MODEL TRAINED ON THE PILE.
Books3 is a dataset assembled from Bibliotik — a private BitTorrent tracker for ebooks. 196,640 copyrighted books. It was included in "The Pile," an open-source dataset assembled by EleutherAI in 2020 and used by virtually every open-source model afterward. GPT-NeoX, LLaMA (suspected), Mistral, Falcon — all trained on The Pile or derivatives. Books3 was downloaded from Bibliotik, converted to plaintext, and released as a research dataset. The research community treated it as legitimate because EleutherAI published it. The original source was a piracy tracker. The authors of those 196,640 books received no notice, no payment, no consent request.
◈ MECHANISM III · YOUTUBE TRANSCRIPTS · IN VIOLATION OF THEIR OWN RULES
OPENAI USED YOUTUBE. GOOGLE OWNS YOUTUBE. BOTH VIOLATED THE ToS.
YouTube's Terms of Service explicitly prohibit accessing content through automated means for AI training. OpenAI used YouTube transcripts to train GPT-4. This is documented in reporting by The New York Times and confirmed in subsequent legal proceedings. Internal OpenAI communications describe YouTube as "probably the single highest quality source we have" for training data.
The mechanism: YouTube's auto-generated captions API → transcription extraction → text corpus. Hundreds of millions of hours of human speech, converted to text, without creator consent, in violation of platform terms.
And then the punchline: Google, which owns YouTube, also trained Gemini on YouTube content — technically in violation of the same Terms of Service it enforces against everyone else. They gave themselves an exception by updating their ToS retroactively to allow Google to use all YouTube content for AI training. When you own the fence, you can move it after the fact.
The CEO of the company that owns the No Trespassing sign updated the sign to say "No Trespassing (Except Us)." Filed it with their own legal team. Notified no one. 2.7 billion YouTube creators found out through a Terms of Service update that most people never read. Ong.
◈ THE REDDIT SETTLEMENT · WHAT IT TELLS US ABOUT EVERYTHING ELSE
OPENAI PAID REDDIT $60M IN 2024. RETROACTIVELY. FOR DATA THEY HAD ALREADY USED.
In May 2024, OpenAI signed a licensing deal with Reddit for $60 million per year. This was not for future access. This was to legitimize data they had already trained on. Reddit's ToS prohibits scraping for commercial AI training. OpenAI trained on Reddit anyway. Years later, with lawsuits mounting, they paid. The payment is the admission. If it was legal, there would be nothing to pay for. The same logic applies to every other deal struck — Google, Axel Springer, AP, The Atlantic. You don't license things you have the right to take.
◈ MECHANISM IV · GITHUB · MICROSOFT TRAINED ON CODE ITS OWN ToS FORBADE
YOU PUBLISHED YOUR CODE. THEY TRAINED ON IT. THEN SOLD IT BACK TO YOU.
Microsoft acquired GitHub in 2018 for $7.5 billion. In 2021, they launched GitHub Copilot — an AI coding assistant trained on public GitHub repositories. GitHub's own Terms of Service at the time prohibited using content "for machine learning training." Microsoft is GitHub. They used their platform's data in violation of the terms they set for everyone else, then packaged the results into a $19/month subscription and sold it back to the developers whose code trained the model.
Ocean's Eleven would not have been as clean a movie if the vault was owned by the same people doing the heist. Microsoft ran the heist on their own vault, then charged the vault's original owners a monthly fee to access the intelligence extracted from it. And called it a productivity tool. The audacity is a feature. The audacity does not have a trial period.
◈ DOE v. GITHUB · MICROSOFT · OPENAI · CLASS ACTION · N.D. CAL. 2022
DEVELOPERS SUED. THE CODE WAS THEIRS. THE ATTRIBUTION WAS STRIPPED.
Copilot outputs code that reproduces licensed open-source snippets verbatim — without attribution, without license compliance. GPL code reproduced without GPL compliance. MIT code reproduced without credit. The lawsuit documents specific instances where Copilot regenerated identifiable code from specific repositories. The developers who wrote that code under open-source licenses received nothing. Microsoft's position: "It's transformative. It learned from your code rather than copying it." The courts are still deciding. The product ships either way.
◈ WHO LET THEM IN · THE GATES · THE GATEKEEPERS · THE FAILURES
NOBODY LET THEM IN. NOBODY STOPPED THEM. THAT'S THE SAME THING.
The question isn't who handed them the key. The question is why every door was open and nobody was watching. The answer is a system built to move information freely had no immune response for this specific kind of extraction. It's like asking why the museum didn't have security for the paintings — they did. But the thieves came during the hours the museum was handing out free pamphlets about the paintings, and nobody thought to check if the pamphlet-readers were loading the paintings into a van.
2008
Common Crawl founded. Mission: open the web to researchers. Legitimate goal. No one imagines commercial AI training at scale. The S3 bucket is public.
2019–20
GPT-2, GPT-3 training. Common Crawl used as base. Books1 and Books2 datasets assembled. Suspected LibGen content included in Books2. No lawsuits yet. No one is watching.
2020
EleutherAI publishes The Pile — includes Books3 (from Bibliotik piracy tracker), GitHub, arXiv, PubMed. Open-sourced. Every lab downloads it. The academic wrapper makes it feel legitimate. It is not.
2021
Meta begins LLaMA development. GitHub Copilot launches. FTC is asleep. Copyright Office is writing memos. Congress is asking CEOs to explain what an algorithm is. Senator asks Zuckerberg if he knows what a cookie is. He does. He definitely does. He invented the tracking cookie. The hearing continues.
2022
LLaMA 1 trained. Internal Meta emails discuss LibGen. Legal reviews. Decision: proceed, use HTTP not BitTorrent. ChatGPT launches in November. 100 million users in 2 months. Now everyone is watching. The training is already done.
2023
Lawsuits begin. Kadrey v. Meta. NYT v. OpenAI. Authors Guild v. OpenAI. Getty v. Stability AI. OpenAI introduces GPTBot user-agent for robots.txt — after the models are already trained. The fence goes up after the cattle are out.
2024
Kadrey emails unsealed. Reddit deal: $60M/year. OpenAI signs licensing deals with AP, Axel Springer, The Atlantic. All retroactive legitimization of data already used. The checks arrive after the meal.
2025–26
Courts still deciding. Models are deployed at planetary scale. The training data question: moot in practice, live in law. The extraction happened. The products ship. The lawsuits settle for amounts that are rounding errors on the revenue.
◈ THE REGULATORY VACUUM · WHY NOBODY STOPPED IT
THE LAW THAT EXISTED WAS NOT BUILT FOR THIS. THE LAW THAT SHOULD EXIST DOESN'T YET.
US copyright law has a fair use doctrine with four factors: purpose and character of use (commercial vs. research), nature of the copyrighted work, amount taken, and market effect. AI companies argued transformative use — you're not copying the book, you're learning patterns from it. Courts haven't settled whether this holds. The EU moved faster — GDPR Article 4, the AI Act's training data provisions — but the training on European data was already done before the rules applied. The gap was 2019–2023. Everything happened in the gap. By the time the law caught up, the models were deployed.
◈ THE .EDU AND .GOV QUESTION · HOW ACADEMIC AND GOVERNMENT DATA WAS ABSORBED
FEDERAL RESEARCH. UNIVERSITY PAPERS. PUBLIC HEALTH DATA. ALL IN THE CORPUS.
US federal government works are public domain by law. Reports, studies, regulations, court opinions, NIH research, NOAA data, NASA publications — legally, anyone can use them. The AI labs did. This part is legal. But the scale and the lack of attribution still matters: the public funded this research, and the AI labs are now selling intelligence derived from it without contributing back to the public institutions that generated it.
University .edu content is different. Research papers published in journals are NOT public domain — publisher rights apply. Course notes, dissertations, student work — not public domain. What's publicly visible on .edu web servers was crawled by Common Crawl regardless. The academic open-access movement (arXiv, PubMed Central) made this easier: papers that researchers made freely available were scraped at scale without the authors being consulted about AI training specifically.
◈ THE SPECIFIC DATASETS · WHAT WAS IN THE PILE FROM ACADEMIA
FREELAW. PUBMED CENTRAL. ARXIV. ALL IN THE PILE. ALL USED BY LABS.
The Pile (EleutherAI, 2020) explicitly lists its sources. FreeLaw: 3.5 million US court opinions — public domain, legally fine, but courts didn't consent to AI training. PubMed Central: 3.7 million biomedical papers — open access but author rights exist. arXiv: 1.7 million scientific preprints — open access, but again no explicit AI training consent from authors. DM Mathematics: DeepMind's synthetic math dataset. GitHub: 95 GB of code. All assembled, deduplicated, and published as a research dataset. Every lab that trained on The Pile trained on all of this.
◈ THE FULL PIPELINE · TECHNICAL BREAKDOWN · HOW YOU BUILD A TRAINING CORPUS
THE COMPLETE MECHANIC. STEP BY STEP. THIS IS HOW THEY BUILT IT.
For those who want the full picture: this is how a training corpus is assembled. This is the actual Python/Spark pipeline that processed the web into training data. Meta published their CCNet methodology. Google published their C4 methodology. OpenAI's is partially described in the GPT-3 paper. The tools are open source. Anyone can reproduce this.
◈ STAGE 1 · DOWNLOAD · THE DATA IS JUST SITTING THERE ON AMAZON S3
import boto3, io
from warcio.archiveiterator import ArchiveIterator
s3 = boto3.client('s3', region_name='us-east-1')
def stream_warc(crawl_id, segment, file_path):
key = f'crawl-data/{crawl_id}/{segment}/{file_path}'
response = s3.get_object(Bucket='commoncrawl', Key=key)
return ArchiveIterator(response['Body'])
◈ STAGE 2 · EXTRACT TEXT · STRIP HTML → GET CONTENT
import trafilatura
from langdetect import detect
import fasttext
lang_model = fasttext.load_model('lid.176.bin')
def extract(record):
html = record.content_stream().read().decode('utf-8', errors='ignore')
text = trafilatura.extract(html, include_comments=False)
if not text or len(text) < 100:
return None
lang, score = lang_model.predict(text[:500])
if lang[0] != '__label__en' or score[0] < 0.65:
return None
return text
◈ STAGE 3 · DEDUPLICATION · MINHASH LSH · REMOVE NEAR-DUPLICATES AT SCALE
from datasketch import MinHash, MinHashLSH
lsh = MinHashLSH(threshold=0.8, num_perm=128)
def get_minhash(text):
m = MinHash(num_perm=128)
for word in text.lower().split():
m.update(word.encode('utf8'))
return m
def is_duplicate(doc_id, text):
m = get_minhash(text)
result = lsh.query(m)
if result:
return True
lsh.insert(doc_id, m)
return False
◈ STAGE 4 · QUALITY FILTER · PERPLEXITY SCORING · CUT THE GARBAGE
import kenlm
lm = kenlm.Model('wiki_en.arpa')
def perplexity(text):
words = text.split()
log_prob = lm.score(' '.join(words[:200]))
return 10 ** (-log_prob / len(words))
def quality_filter(text, lo=10, hi=1000):
p = perplexity(text)
return lo < p < hi
◈ STAGE 5 · DISTRIBUTE · APACHE BEAM ON GCP · PROCESS PETABYTES IN PARALLEL
import apache_beam as beam
with beam.Pipeline(options=PipelineOptions(
runner='DataflowRunner',
project='ai-lab-training',
region='us-central1',
machine_type='n1-highmem-64',
num_workers=500,
)) as p:
(p
| 'Read' >> beam.io.ReadFromText('gs://commoncrawl-warc/*.gz')
| 'Extract' >> beam.Map(extract)
| 'Filter' >> beam.Filter(quality_filter)
| 'Dedup' >> beam.transforms.util.Deduplicate()
| 'Write' >> beam.io.WriteToParquet('gs://training-corpus/'))
This runs for weeks on hundreds of machines. At the end, you have a training corpus measured in terabytes. The cost: cloud compute, maybe $500K–$2M for a full run. The value produced: models worth billions. The raw material cost: $0. The raw material source: everyone's work, without permission.
◈ THE LIBGEN PLAY · HOW PIRATED BOOKS SPECIFICALLY ENTERED THE MODEL
HTTP. NOT BITTORRENT. QUIETER. SAME BOOKS. SAME CRIME.
LibGen serves files over plain HTTP. No account required. Search, find the book, click download. The same Python that downloads any file downloads a pirated book from LibGen. At scale, with a list of 4 million ISBNs, you build a scraper that hits the LibGen search API, resolves the download URL, and pulls the file. Plain requests.get(). No cleverness required.
◈ THE LIBGEN MECHANISM · PLAIN HTTP · THIS IS IT
import requests
from bs4 import BeautifulSoup
def fetch_from_libgen(isbn):
search = requests.get(
f'{LIBGEN_MIRROR}/search.php?req={isbn}&column=identifier')
soup = BeautifulSoup(search.text, 'html.parser')
download_link = soup.find('a', string='[1]')['href']
book_page = requests.get(download_link)
direct_url = BeautifulSoup(book_page.text, 'html.parser')\
.find('a', string='GET')['href']
return requests.get(direct_url).content
The choice to use HTTP over BitTorrent is significant. BitTorrent clients log participating peers. There's a distributed record of who downloaded what. HTTP requests to a server leave logs only on that server — which is in Russia, outside US jurisdiction. The legal team at Meta understood this distinction. The decision to proceed via HTTP was a decision about evidence management as much as a technical choice.
◈ WHO SIGNED OFF · THE CHAIN OF COMMAND · THE RECORD IS SEALED. THEN UNSEALED.
LEGAL REVIEWED IT. EXECUTIVES APPROVED IT. ZUCK WAS IN THE LOOP.
From the Kadrey v. Meta record: the decision to use LibGen data was not made by a rogue engineer. It went through legal review. It was approved at the executive level. Mark Zuckerberg was informed. He did not object. This is documented.
OpenAI's situation: internal discussions about YouTube, Reddit, and book data show the legal team was consistently told the copyright risk was real and consistently told that "transformative use" was the shield. The shield has not been tested in court at this scale. The bet was: move fast, deploy, let the lawyers fight it later, and settle for less than the cost of not doing it.
Translation: "we knew it was a problem, we decided our future revenue was bigger than the problem, and we're counting on courts being too slow and settlements being too cheap." This is not a startup moving fast and breaking things. This is a very deliberate risk management decision. The risk they were managing was getting caught. Ong on God.
◈ THE CALCULUS · WHY THEY DID IT ANYWAY
THE EXPECTED VALUE OF STEALING THE DATA EXCEEDED THE EXPECTED COST OF BEING CAUGHT.
The math they ran: training on pirated books gives us a better model. A better model is worth tens of billions. The maximum liability in copyright litigation, even if we lose every case, is statutory damages per work — which courts historically reduce in aggregate cases. The settlement cost will be a fraction of the revenue generated by the capability we built from the stolen material. This is a financial decision, not an ethical one. The extraction happened. The products ship. The lawsuits settle. The capability stays in the model forever.
◈ THE VERDICT · KENSHOTEK LLC · KENSHO INVESTIGATES · APRIL 2026
FILED.
They are not geniuses who invented a new way to read. They are people who noticed that the world's knowledge was sitting on publicly accessible servers and built Python scripts to take it before anyone thought to lock the door. The genius — if you want to call it that — was moving fast enough that by the time the law caught up, the capability was already deployed, the companies were worth hundreds of billions, and the cost of undoing it exceeded the cost of paying settlements.
Katt Williams called this years ago in a different context: "You ain't the first person to think of it. You just the first person dumb enough to do it out loud." Except they weren't dumb. They were quiet about it, deliberate about it, and legally reviewed about it. They just counted on everyone else being slower. They were right.
That is not innovation. That is the mercenary calculus applied to information. Take what the territory has. Move before the fence goes up. Settle for less than you made. The authors whose books are in LLaMA received nothing. The universities whose research is in GPT received nothing. The developers whose code is in Copilot received nothing. The companies that took it are worth trillions. Ong.
KenshoTek saw it. Filed it. Let time confirm it. Kensho 20/20: we don't predict. We observe what was already true and document it while everyone else is still asking what a token is. The dispatch is the receipt. The record is permanent. Say it clean. Let it sit. 925.
◈ THE VERDICT · AQUATEKXVI PRESIDES · KENSHOTEK LLC · 2026
THEY DIDN'T KNOCK.
THERE WAS NO DOOR.
THEY BUILT THE HOLE AND CALLED IT A PIPELINE.
LIBGEN: DOCUMENTED. ZUCK: IN THE LOOP.
YOUTUBE: CONFIRMED. REDDIT: PAID AFTER THE FACT.
GITHUB: MICROSOFT SOLD YOUR CODE BACK TO YOU.
.EDU AND .GOV: IN COMMON CRAWL. ALWAYS WERE.
LEGAL REVIEWED IT. THEY PROCEEDED ANYWAY.
HTTP NOT BITTORRENT: QUIETER. SAME CRIME.
THE SETTLEMENTS WILL BE ROUNDING ERRORS.
THE CAPABILITY STAYS IN THE MODEL FOREVER.
RECEIPTS ON FILE.
KENSHO INVESTIGATES.
FILED. 925.
◈ SOURCES: Kadrey v. Meta Platforms, Case 3:23-CV-03417 (N.D. Cal.); The New York Times v. OpenAI, Case 1:23-CV-11195 (S.D.N.Y.); Doe v. GitHub, Microsoft, OpenAI, Case 4:22-CV-06823 (N.D. Cal.); The New York Times reporting on OpenAI YouTube use (Apr. 2024); EleutherAI Pile dataset paper (Gao et al., 2020); Meta CCNet paper (Wenzek et al., 2020); Google C4 paper (Raffel et al., 2019); OpenAI GPT-3 paper (Brown et al., 2020). All court records public. All papers public. Field verified.