CS336 Notes: Lecture 14 - Data 2
Web-scale training data is messy. You crawl it, parse HTML, filter for language, quality, and toxicity, then deduplicate. The constraint that shapes all of this: the filter must be orders of magnitude cheaper than the model training.
Key Takeaways
Most filtering follows the same pattern: take a small high-quality target set T and a huge raw set R, train a cheap model to learn "what looks like T," score R, then keep the best parts.
Three common filters:
- N-gram language models (KenLM).
- Fast linear classifiers (fastText-style).
- Distribution matching via importance sampling.
The same machinery does language ID, quality filtering, domain selection (math, good Python), and toxicity filtering. A strong LLM is often used once to define the target signal, then distilled into a cheap filter.
Deduplication is separate from quality filtering. It matters to avoid wasted compute and reduce memorization.
Exact dedup uses hashing and Bloom filters: fast, memory-light approximate membership with almost no false negatives.
Near-dedup uses Jaccard similarity, MinHash (collision probability equals Jaccard), and locality-sensitive hashing (LSH) to turn pairwise comparisons into near-linear hashing and bucketing.
By tuning multiple hashes, bands, and "and/or" rules, you control similarity thresholds and clean the web at scale.
Data Pipelines and the Basic Filtering Pattern
Training data rarely arrives as clean files. Typical steps:
- Convert raw HTML to text.
- Filter by language, quality, toxicity.
- Remove duplicates and near-duplicates.
Common pipeline pattern:
You have a small target set T you like. You have a huge raw set R (Common Crawl, for example). You want a subset of R that resembles T.
The filter must generalize beyond T and run over the whole web, so it must be extremely fast. If the filter costs as much as training, you lose.
Recipe:
- Train a cheap model using T (and sometimes R).
- Score each document in R.
- Keep documents above a threshold, or resample with probability tied to the score.
N-Gram Language Models for Filtering
An n-gram LM estimates the probability of a word given the previous n-1 words.
Core idea:
- Count n-grams in a corpus.
- Estimate probability of next word given context from counts.
Problem: sparsity. Many reasonable n-grams never appear, especially for large n, so naive counts give too many zeros.
Fix: smoothing, commonly Kneser-Ney (as in KenLM). Back off to shorter contexts when counts are weak. Interpolate across context lengths.
How it is used for filtering:
- Train KenLM on high-quality text (often Wikipedia).
- Score raw text by probability or normalized perplexity.
- Lower perplexity means the text looks like the training corpus. Higher perplexity suggests junk.
It catches obvious garbage and non-language well, but it only sees short-range patterns. Repetition can fool it.
Example usage (CCNet, early LLaMA pipelines):
- Train KenLM on Wikipedia.
- Score Common Crawl paragraphs.
- Sort by perplexity and keep the best portion.
FastText-Style Linear Classifiers for Filtering
FastText is a fast text classifier built around simple models that train and run cheaply.
Plain bag-of-words is a linear classifier on word counts, but the weight matrix gets huge.
FastText compresses this:
- Map words into low-dimensional embeddings.
- Average (or sum) embeddings to get a document vector.
- Apply a linear classifier on that vector.
It also uses hashed n-grams:
- Create word n-grams (like bigrams).
- Hash them into a fixed number of bins.
- Treat those bins like extra tokens.
Collisions happen, but the model learns usable average weights anyway.
Filtering setup (often binary):
- Positive examples: target set T.
- Negative examples: random samples from R.
- Train the classifier.
- Run it over R and keep documents with high "target" probability.
You can use heavier models, but the economics are brutal. If you will keep only 1% of the web, the filter's forward pass must be orders of magnitude cheaper than main-model training.
Importance Sampling for Data Selection
Instead of a hard classifier, you can match distributions.
Importance sampling idea:
You want samples from target distribution p, but you can sample from q.
- Sample x from q, weight it by w(x) = p(x) / q(x).
- Resample by w to approximate sampling from p.
Data selection framing:
T are samples from p (the target distribution). R are samples from q (the raw distribution). Estimate p(x) and q(x), compute w(x), then sample from R proportional to w(x).
Challenge: T is small, so fitting a rich p is hard.
Cheap practical trick:
- Hash tokens or n-grams into buckets.
- Count bucket frequencies in T and in R.
- Estimate bucket probabilities for p and q (with smoothing).
- Score a document by multiplying its hashed-token probabilities, then form w = p/q.
This tends to preserve diversity better than a sharp in/out classifier because it tries to match the full shape of T, not just a boundary.
Summary of the three methods:
- KenLM: generative model trained on T. Keep text with low perplexity.
- fastText: discriminative model. Keep text with high p(target | document).
- Importance sampling: estimate p and q cheaply. Resample R by w = p/q.
All still follow the same pattern: learn "how much this looks like T," then select based on that score.
What "Good" Means Here
These cheap filters mostly see local word patterns. They do not reliably judge long-range coherence, factual correctness, or deep structure. Shuffling sentences can leave n-gram statistics mostly intact. Adversarial text can fool them.
They are useful as early, cheap stages to drop the worst of the web. Deeper quality signals usually come later.
Language Identification Using the Same Machinery
Goal: keep one language (often English) from a mixed-language crawl.
Why this matters:
Cleaning many languages is hard. If compute is fixed and you care about one language, spending tokens on others can reduce performance in the main language.
Example: BLOOM used about 30% English and shows tradeoffs versus an English-only model of the same size.
Practical tool:
FastText provides a pre-trained language ID model trained on multilingual sources. It outputs language probabilities from text. Pipelines often keep pages where P(English) is above a threshold (0.5, for example).
It works well on normal sentences. It struggles on short snippets, code, formulas, dialect, or code-switching. It is still a strong first pass before heavier processing.
Quality Filtering with Simple Models and LLM Help
"Quality" can mean grammar, coherence, informativeness, low spam, educational value. Simple models approximate it.
GPT-3-Style Pipeline
Positive: curated sources (books, Wikipedia, WebText), excluding Common Crawl. Negative: random Common Crawl. Train a linear classifier. Filter Common Crawl by the classifier score, sometimes with smooth sampling.
LLaMA (v1-style)
Positive: pages linked from Wikipedia (not Wikipedia itself). Negative: random Common Crawl pages. Train a classifier and keep pages it labels positive.
Phi-1 Example (Python Data)
- Start with a Python subset of "The Stack."
- Ask GPT-4: "How educational is this file for a student learning basic coding concepts?"
- Label about 100k files to define a target: "educational."
- Compute embeddings with a pre-trained model.
- Train a random forest on embeddings to imitate GPT-4's labels.
- Run the cheap classifier over the full Python subset and keep what it selects.
Reported outcome:
Unfiltered subset: about 12% on HumanEval after 96k steps. GPT-4-filtered subset: about 17% after 36k steps.
Core idea:
- Use an expensive model on a small sample to define the signal.
- Distill that signal into a cheap filter.
- Use the cheap filter at web scale.
Toxicity Filtering
We also filter hateful, toxic, or NSFW content.
A common dataset: Jigsaw Toxic Comments (Wikipedia talk page comments labeled toxic, severe toxic, obscene, threat, insult, identity hate).
Typical approach (like Dolma-style pipelines):
- Train a fastText classifier for hate vs safe.
- Train another for NSFW vs safe.
- Run both over raw text.
- Drop or down-weight text with high hate or NSFW scores.
These models make mistakes, especially on context, but they work well as bulk filters.
Why Deduplication Matters
The web has:
- Exact duplicates: identical text across mirrors and repeats.
- Near-duplicates: tiny edits to boilerplate, templates, localized pages, licenses, and mass-copied artifacts.
Deduplication matters because:
It saves compute by avoiding repeated training on the same text. It reduces memorization risk. Repetition increases the chance of verbatim recall, which raises copyright and privacy concerns.
Deduplication is not quality filtering:
Quality filtering: "this is bad, remove it." Deduplication: "this is fine, keep fewer copies."
Design choices:
What unit: sentence, paragraph, fixed-length spans, full documents. What match: exact, high overlap, semantic similarity. What action: remove all but one, or reduce frequency.
The scaling problem is the point: naive pairwise similarity over billions of items is impossible, so we need near-linear methods.
Exact Deduplication with Hashing and Bloom Filters
Exact dedup is simple:
- Hash each unit (often a paragraph or span).
- Group by hash.
- Keep one per hash group.
C4 does this on three-sentence spans: if the span reappears, all but one copy are removed. This can create odd edits, but it scales.
Bloom filters reduce memory and make membership queries fast.
Goal:
Very low false negatives (if it says "not present," it is almost certainly new). A tunable false positive rate (if it says "present," it might be wrong). Small memory footprint.
Bloom filter:
Bit array of length m, initially all zeros. k hash functions.
Insert(x): set the k hashed bit positions to 1.
Query(x): if any of the k bits is 0, x is definitely not in the set. If all are 1, x is probably in the set.
False positives come from collisions. The rate depends on m, n, and k, with an optimal k for given m and n.
Dolma example:
Use a Bloom filter for paragraph-level exact dedup. Tune false positive rate extremely low (around 10^-15) to avoid dropping new text.
Standard Bloom filters are insert-only, which fits one-pass pipelines.
Near-Duplicate Detection: Jaccard, MinHash, and LSH
Near-duplicates need a similarity measure and a way to find high-similarity pairs without comparing everything.
Common measure: Jaccard similarity over sets.
- Build a set per document (word shingles or n-grams).
- Jaccard(A, B) = |A ∩ B| / |A ∪ B|.
- Call two documents near-duplicates if Jaccard is above a high threshold (often around 0.99).
Directly computing Jaccard for all pairs is too expensive. MinHash turns it into hashing.
MinHash
- Pick a random permutation or hash over all possible set elements.
- Define h(S) as the element in S with the smallest hash value.
- Then P[h(A) = h(B)] = Jaccard(A, B).
Repeat with multiple independent hashes to estimate similarity.
Locality-Sensitive Hashing with MinHash
LSH avoids pairwise comparisons:
- Compute n MinHash values per document (a signature).
- Split the signature into b bands of r rows, with n = b × r.
- Two documents become candidates if any band matches exactly (all r hashes in that band match).
If s is the Jaccard similarity:
- Probability a band matches is s^r.
- Probability no band matches is (1 - s^r)^b.
- Probability of at least one matching band is 1 - (1 - s^r)^b.
Tuning r and b sets the threshold behavior:
Higher r sharpens the cutoff and raises the effective similarity threshold. Higher b increases collisions at lower similarity.
This converts near-dedup into hashing and bucketing, then optional comparisons inside buckets.
Semantic Deduplication and Handling High-Quality Duplicates
Paraphrase duplicates can be semantically the same but lexically different, so n-gram Jaccard may miss them.
One approach: embed documents and use approximate nearest neighbors or LSH in embedding space. It costs more and can erase diversity if thresholds are loose.
High-quality text often appears many times across mirrors. Pure dedup keeps one copy, but many training recipes intentionally do multiple epochs over curated corpora (books, Wikipedia, proprietary data). A common policy is to deduplicate noisy web dumps aggressively while treating curated sets separately.
Frequency can carry signal. Instead of keeping n copies or just one, you can cap or smooth counts (keep something like log(n) copies, for example).
Deduplication is powerful, but beyond exact or near-exact matches it can easily remove useful diversity. Careful thresholds and policy matter.
Keep reading
You might also like
CS336 Notes: Lecture 13 - Data 1
Training data for LLMs: Common Crawl processing, quality filtering, the evolution of data pipelines from BERT to modern models, and the critical role of copyright and licensing.