CS336 Notes: Lecture 11 - Scaling Laws 2
Practical scaling: muP for hyperparameter transfer, WSD learning rate schedules, case studies from Cerebras-GPT, MiniCPM, and DeepSeek on compute-optimal training.
Blog
Filter
Practical scaling: muP for hyperparameter transfer, WSD learning rate schedules, case studies from Cerebras-GPT, MiniCPM, and DeepSeek on compute-optimal training.
LLM inference optimization: understanding the prefill vs decode split, KV cache management, speculative decoding, and why inference is fundamentally memory-bound.
Understanding scaling laws: how loss depends on data, parameters, and compute, the Chinchilla tradeoff for compute-optimal training, and why power laws emerge in deep learning.
Hands-on distributed training: implementing collectives with PyTorch and NCCL, data/tensor/pipeline parallelism in practice, and understanding the compute-memory-communication tradeoff.
Distributed training fundamentals: data parallelism, ZeRO/FSDP for memory efficiency, tensor and pipeline parallelism, and how to combine strategies for frontier-scale models.