CS336 Notes: Lecture 10 - Inference
LLM inference optimization: understanding the prefill vs decode split, KV cache management, speculative decoding, and why inference is fundamentally memory-bound.
Blog
Filter
LLM inference optimization: understanding the prefill vs decode split, KV cache management, speculative decoding, and why inference is fundamentally memory-bound.
Understanding scaling laws: how loss depends on data, parameters, and compute, the Chinchilla tradeoff for compute-optimal training, and why power laws emerge in deep learning.
Hands-on distributed training: implementing collectives with PyTorch and NCCL, data/tensor/pipeline parallelism in practice, and understanding the compute-memory-communication tradeoff.
Distributed training fundamentals: data parallelism, ZeRO/FSDP for memory efficiency, tensor and pipeline parallelism, and how to combine strategies for frontier-scale models.