CS336 Notes: Lecture 13 - Data 1
Data matters more than architecture or optimizers. Teams that build serious models treat their data mix and filters as the real secret.
Key Takeaways
Training is usually staged:
- Pre-training: huge, noisy web-scale text.
- Mid-training: smaller, cleaner data to push specific skills.
- Post-training: instruction and chat fine-tuning, often with RL for alignment and safety.
Real pipelines start from messy live sources (websites, code hosts, paper repositories) and run extraction, language ID, quality filters, deduplication, and domain rules.
To understand models (BERT, GPT-2/3, LLaMA, Gopher, OLMo), look at data choices and filters as much as the model design.
Common Crawl is a rough, biased snapshot of part of the web. Different teams turn it into very different corpora.
Special sources (books, Stack Exchange, GitHub, papers, Reddit) add capability but raise licensing, bias, and safety issues.
Learned filtering (quality classifiers, LLM-based scoring) often beats simple rules on benchmarks.
Copyright and licensing sit at the center of modern training. Most web text is copyrighted by default. Use depends on licenses, fair use, and platform terms.
Mid- and post-training often rely on synthetic data to target skills like long context, math, coding, and instruction following.
Data work is messy, heuristic, and still evolving. Better pipelines can still move the needle.
Why Data Matters More Than Architecture
Many model papers describe architecture and training details but stay vague about data. LLaMA 3, for example, names broad sources and rough filtering but not the full recipe.
Data details stay secret for two reasons:
- Competitive advantage: the mix and filters drive quality.
- Legal risk: fewer disclosed specifics means fewer targets.
Supervised learning always depended on labeled data. Foundation models use less manual labeling, but they still depend on careful selection, cleaning, filtering, and mixing.
Training Stages
Pre-training
Mostly raw or lightly filtered text. Web crawls and broad sources.
Mid-training
Smaller, higher-quality corpora. Targets math, code, long context, reasoning, or domain skills.
Post-training
Instruction and chat fine-tuning. Often includes reinforcement learning for helpfulness and safety. Produces "chat" behavior: follows instructions, converses cleanly.
Boundaries blur. Some "pre-training" data looks like instructions. Some pipelines use more than three stages.
Terminology:
- Base model: after pre-training and mid-training.
- Chat or instruct model: after post-training.
Example Training Mixes
AI2 released open models and described their data.
Typical pre-training mix:
- Filtered Common Crawl subsets.
- Code (for example, The Stack).
- Papers and math.
- Wikipedia.
- Token counts in the trillions.
Mid-training:
- Smaller filtered slices plus instruction-style sets (like FLAN).
- Synthetic data for math, reasoning, and long context.
- A small fraction of total tokens.
Post-training:
- Public or licensed chat transcripts.
- Large synthetic instruction-following sets.
- Data shaped for safety and alignment.
These recipes come from trial, error, and experience more than theory.
Early Pre-Training: BERT, Books, Wikipedia
BERT (2018) mainly used BooksCorpus and Wikipedia.
BooksCorpus came from scraped "free" books. It was later removed after terms-of-service issues. Early work lived in a looser legal moment.
Wikipedia:
Large, multilingual, and heavily edited. Aimed at sourced facts, not personal opinion. Notability rules exclude many niche or fringe topics. Released as downloadable dumps.
Limits:
Little first-person advice, recipes, or informal instruction. Skews toward what is well documented and notable.
Data Poisoning via Wikipedia Dumps
Dump schedules can be exploited:
- Insert a malicious edit before a dump.
- Let it enter the dump.
- Revert it on the live site afterward.
The live page looks clean, but the dump carries the poisoned text. A model trained on that dump can learn the attacker's association.
This is the general problem: open internet data can be manipulated, and full oversight is hard at scale.
GPT-2 and WebText
GPT-2 introduced WebText.
Idea:
The web is too big and too noisy. Reddit links with positive karma can act as a rough proxy for "worth reading."
Construction:
- Take Reddit posts with more than 3 karma.
- Follow outbound links.
- Download the linked pages.
Result:
About 8 million pages. About 40 GB of text.
WebText was not released. OpenWebText is an open reproduction that became widely used.
Common Crawl and Web Crawls
Common Crawl is a non-profit that has crawled the web since 2007.
Basic facts:
Monthly crawls. Billions of pages per crawl. Coverage is incomplete and biased. It is not "the internet."
How crawling works:
- Start from a huge seed set of URLs.
- Maintain a crawl frontier (a URL queue).
- Fetch pages in parallel, follow links, and add new URLs.
- Decide revisit frequency with heuristics.
- Respect robots.txt where possible to reduce load.
Formats:
- WARC: raw HTTP responses (HTML and more).
- WET: extracted text. This is lossy.
HTML-to-text extraction is a major quality lever. Different extractors keep different content and boilerplate. Studies show benchmark scores can shift by multiple points based on extractor choice.
Common Crawl also includes:
Harmful content. Lots of duplicates and near-duplicates. Missing or partial coverage of important sources.
Filtering Common Crawl: CCNet and C4
Raw Common Crawl needs heavy filtering.
CCNet (Meta)
Goal: multilingual, higher-quality subsets.
Steps:
- Deduplicate.
- Language ID.
- Score documents with an LM-style signal based on Wikipedia statistics.
- Keep documents that resemble Wikipedia.
Strength: pushes toward cleaner text.
Weakness: throws away useful writing that does not look like Wikipedia.
C4 (Google, used for T5)
Colossal Clean Crawled Corpus.
Starts from one Common Crawl snapshot and uses rules:
- Keep lines that end with sentence punctuation.
- Drop pages with fewer than three sentences.
- Remove pages with "bad words."
- Drop lines with braces to remove many code fragments.
- Keep only English.
Strength: keeps well-formed sentences even when style differs from Wikipedia.
Weakness: can keep spam that is grammatical.
They complement each other. Each has blind spots.
GPT-3: Data Mix and Quality Classifiers
GPT-3 used:
- Processed Common Crawl.
- WebText 2.
- Two books corpora (often described as Books1 and Books2).
- Wikipedia.
Total: about 400B tokens.
For Common Crawl, GPT-3 trained a quality classifier:
- Positives: pages from WebText, Wikipedia, books.
- Negatives: random Common Crawl pages.
- Keep Common Crawl pages that look like the positives.
Pattern:
- Define what "good" looks like.
- Train a classifier.
- Use it to mine a better subset from a huge pool.
The Pile: Community Mixed Data
EleutherAI built The Pile as an open alternative to closed mixes.
It combines many "high-quality" domains:
- OpenWebText.
- Stack Exchange.
- Wikipedia.
- ArXiv.
- PubMed Central.
- GitHub code.
- Enron emails.
- Project Gutenberg.
The Pile exceeds GPT-3's training data size.
Point: web crawl can cover everything in theory, but curated sources are easier to control and reuse. Some sources inject strong quirks, like Enron's influence on "email tone."
Better extraction tools (like jusText) are often used instead of default WET text.
Books, Shadow Libraries, Copyright
Project Gutenberg
Tens of thousands of mostly public-domain books. Used for long-context work (for example, PG-19).
Books3 and Shadow Libraries
Books3 was scraped from a shadow library and included many copyrighted works. It was taken down after complaints.
Shadow Libraries like LibGen and Sci-Hub
Host large collections without respecting copyright. Often illegal. Lawsuits have surfaced claims that some models trained on such sources.
This helps explain both secrecy and legal exposure.
Special Domains: Stack Exchange, GitHub, Papers
Stack Exchange
Q&A format with votes and accepted answers. Naturally resembles instruction-following: question in, answer out. Votes provide a built-in quality signal. Dumps exist, but commercial use may require a license.
GitHub
Main source for code. Many repos are tiny, abandoned, duplicated, or forks.
Building code data means deciding:
- What files count as code.
- Which licenses to allow.
- How to deduplicate forks and near-copies.
The Stack
Used GitHub Archive to find repo names. Cloned about 137M repos. Kept permissively licensed repos. Deduplicated. Produced about 3.1 TB of code.
Pattern: live service to snapshot to processed training set.
Academic Papers
PubMed Central provides open-access biomedical papers. Semantic Scholar aggregates papers from many sources. Adds dense scientific writing.
Gopher and MassiveText
DeepMind's Gopher used MassiveText:
- MassiveWeb (filtered web).
- C4.
- Books, news, GitHub, Wikipedia.
MassiveWeb filtering:
English focus. Hand-built rules for quality checks. Google SafeSearch to reduce toxic content. Avoided model-based filters to reduce bias from weak models.
Total text: about 10 TB. Training used about 300B tokens.
LLaMA, RedPajama, REPA
LLaMA (Meta, 2022) used:
- Common Crawl filtered with CCNet-style methods.
- A classifier trained to match sources cited by Wikipedia.
- C4.
- GitHub (permissive licenses plus manual filters).
- Wikipedia, Project Gutenberg.
- Books3, ArXiv, Stack Exchange.
Total: about 1.2T tokens.
RedPajama v1 (Together) aimed to reproduce LLaMA's mix openly and released data and code.
Cerebras produced a deduplicated subset (about 627B tokens).
RedPajama v2 later processed Common Crawl at larger scale (about 30T tokens) and added multiple quality signals to support filtering experiments.
RefinedWeb and FineWeb
RefinedWeb
Claim: filtered web alone can be enough.
Common Crawl plus strong extraction (Trafilatura). Gopher-style rules and heuristics. Fuzzy deduplication. About 5T tokens total, about 600B released publicly.
FineWeb
Expanded and improved the RefinedWeb approach. Uses all Common Crawl dumps. Rule-based filtering and deduplication. Basic anonymization. About 15T tokens of lightly filtered web text.
FineWeb is a broad base that others can filter more aggressively.
AI2 Dolma / OLMo
AI2's OLMo used Dolma:
- Processed Common Crawl.
- The Stack.
- C4.
- Reddit submissions and comments (from a project that later lost its public API).
- Semantic Scholar papers and other academic sources.
- Project Gutenberg and Wikipedia.
Processing:
English language ID. Quality filtering, initially without ML classifiers. Toxicity filtering with a classifier. Deduplication.
Total: about 3T tokens.
DataComp, DCLM Pool, DCLM Baseline
DataComp defines standard pools and benchmarks for filtering.
DCLM Pool
All Common Crawl dumps processed to about 240T tokens.
DCLM Baseline
Uses rule-based and model-based filtering to shrink the pool to about 3.8T tokens.
Model-based filtering pattern:
- Positives: OpenHermes and ELI5.
- Negatives: random FineWeb samples.
- Train a FastText classifier.
- Score the pool and keep high-scoring documents.
Reported result: DCLM baseline beats RefinedWeb and others on many benchmarks. AI2's later OLMo model adopted it as a core component. This marks a shift toward model-in-the-loop filtering. The tradeoff: it can bake in the model's biases.
NVIDIA NeMoTron CC Data
NeMoTron CC builds on DCLM-style ideas. The scale problem: DCLM baseline drops most tokens, and 3.8T tokens can be too small for very large training runs.
Choices and techniques:
Use jusText instead of Trafilatura to keep more usable text.
Score documents with multiple signals:
- A large NeMoTron model rates "educational value," then distills that into a faster classifier.
- Use the DCLM classifier as another signal.
Bucket documents by score and sample across buckets to keep diversity.
Some pipelines also transform data:
Rewrite low-quality documents into cleaner text with an LLM. Generate synthetic tasks from high-quality documents (Q&A, summaries, "extract key information" pairs).
Outcome:
About 6.3T tokens from Common Crawl. A 1T-token "high quality" subset performs best. NeMoTron CC outperforms DCLM and FineWeb on average benchmarks.
Copyright and Training
Copyright protects original expression fixed in a tangible medium. It does not protect ideas.
Key points:
Most internet text is copyrighted the moment it is posted, even without a notice. Registration is not required for copyright, though it matters for suing. Terms are long. Most works stay copyrighted for decades.
To use copyrighted data legally, you generally need a license or a valid fair use argument.
Platform terms also matter. Even if a work is openly licensed, a platform may still restrict bulk downloading or automated scraping.
Licenses and Creative Commons
Ways to get rights:
Direct deals between model builders and content owners. Creative Commons licenses.
Creative Commons:
Keeps the work copyrighted but grants reuse rights under conditions. Wikipedia uses a CC license, and many other online works do too.
Modern developers often mix licensed data, open data, and fair use claims.
Fair Use
Fair use is judged by four factors:
- Purpose and character (commercial vs research, transformative vs copying).
- Nature of the work (factual vs highly creative).
- Amount used (snippets vs entire works, training often copies entire works).
- Effect on the market (whether the new use replaces the original or harms its market).
Google Books scanning and snippet display was found fair use because it was transformative and did not replace book sales.
For language models:
Training makes copies, so copyright applies. Some argue training is transformative because the model learns patterns, not text. Models can memorize and reproduce text and may affect markets. Many cases are still being litigated.
Mid- and Post-Training for Capability
Mid- and post-training are about targeted skills more than raw volume.
Long Context
Very long context is expensive because attention cost grows quickly with sequence length. Common pattern: train mostly on shorter sequences, then add long-document training late. Books, long articles, and synthetic long sequences are common.
Turning Tasks into Instructions
Older NLP used many task-specific formats. Instruction tuning unified them.
Examples:
- SuperNaturalInstructions: over 1,600 tasks converted into instruction prompts and outputs.
- FLAN: many NLP tasks reformatted into "follow this instruction" style.
These help on benchmarks, but the prompts are often templated and unlike real user language.
Synthetic Instruction and Chat Data
Synthetic data became central for instruction following.
Self-Instruct and Alpaca
Start from a seed set. Use a strong model to generate instructions and answers. Produce large fine-tuning sets.
Vicuna
Trained on ShareGPT conversation logs. Shows that real chat logs can sharply improve chat quality.
Other Patterns
Self-chat and evol-instruct: models talk to themselves and gradually increase difficulty. Extract Q&A from the web using an LLM. Open Hermes: curated mix of instruction and chat sets, used as a positive source in some quality classifiers.
Human Data and RLHF
LLaMA 2 Chat highlights human-annotated instruction and chat data and suggests it can beat massive synthetic sets. Human data is expensive, so teams mix it with RLHF to stretch it. They also have to prevent annotators from quietly using tools like GPT-4.
Post-Training: LLaMA and NeMoTron
Newer post-training sets:
LLaMA NeMo post-training: a Meta and NVIDIA mix of public chat sets (like WildChat) and synthetic conversations from open models. Includes reasoning traces from models like R1 to enable chain-of-thought style outputs.
Ways to generate post-training data:
- Distill from proprietary API models like GPT-4 (strong, but licensing can block commercial use).
- Distill from open-weight models (safer licensing, often easier to ship).
- Use human-labeled data (safest legally, slow and expensive).
Conclusion
A typical pipeline looks like this:
- Start from live sources: web pages, code hosts, Q&A sites, paper repositories.
- Create snapshots or dumps.
- Extract text and attach metadata.
- Run language ID, cleaning, and deduplication.
- Apply quality filters, often model-based.
- Optionally rewrite or augment data into more useful formats.
- Mix sources and choose weights to form the final training mixture.
Main takeaways:
Data is the main differentiator when compute and architectures converge. Data work scales across large teams, while architecture work often lives with a small core.
The field is still driven by heuristics and varied choices. Legal and ethical constraints, especially copyright and platform terms, shape what teams can do.
The messiness is an opportunity. Better selection and filtering can still move the needle.
Keep reading
You might also like
CS336 Notes: Lecture 17 - Alignment, RL 2
RL foundations for LLMs: policy gradients, baselines for variance reduction, GRPO implementation details, and practical training considerations for reasoning models.