CS336 Notes: Lecture 17 - Alignment, RL 2
RL foundations for LLMs: policy gradients, baselines for variance reduction, GRPO implementation details, and practical training considerations for reasoning models.
Read
Blog
Filter
RL foundations for LLMs: policy gradients, baselines for variance reduction, GRPO implementation details, and practical training considerations for reasoning models.
Advanced RL for alignment: PPO implementation details, GRPO as a simpler alternative, overoptimization risks, and case studies from DeepSeek R1, Kimi K1.5, and Qwen 3.