Skip to content

Blog

Stanford-cs336

Tutorials··12 min read

CS336 Notes: Lecture 17 - Alignment, RL 2

RL foundations for LLMs: policy gradients, baselines for variance reduction, GRPO implementation details, and practical training considerations for reasoning models.

Read
Tutorials··16 min read

CS336 Notes: Lecture 16 - Alignment, RL 1

Advanced RL for alignment: PPO implementation details, GRPO as a simpler alternative, overoptimization risks, and case studies from DeepSeek R1, Kimi K1.5, and Qwen 3.

Read
Tutorials··10 min read

CS336 Notes: Lecture 15 - Alignment, SFT and RLHF

Post-training for helpful assistants: supervised fine-tuning on instructions, safety tuning, RLHF with preference data, PPO vs DPO, and the challenges of learning from human feedback.

Read
Tutorials··11 min read

CS336 Notes: Lecture 14 - Data 2

Data filtering and deduplication at scale: n-gram language models, fastText classifiers, importance sampling, MinHash, LSH, and Bloom filters for efficient web-scale processing.

Read
Tutorials··13 min read

CS336 Notes: Lecture 13 - Data 1

Training data for LLMs: Common Crawl processing, quality filtering, the evolution of data pipelines from BERT to modern models, and the critical role of copyright and licensing.

Read