CS336 Notes: Lecture 11 - Scaling Laws 2
Practical scaling: muP for hyperparameter transfer, WSD learning rate schedules, case studies from Cerebras-GPT, MiniCPM, and DeepSeek on compute-optimal training.
Read
Blog
Filter
Practical scaling: muP for hyperparameter transfer, WSD learning rate schedules, case studies from Cerebras-GPT, MiniCPM, and DeepSeek on compute-optimal training.
Understanding scaling laws: how loss depends on data, parameters, and compute, the Chinchilla tradeoff for compute-optimal training, and why power laws emerge in deep learning.