CS336 Notes: Lecture 17 - Alignment, RL 2
RL for language models has a simple goal: make the model pick answers that score well. The constraint: sparse rewards, high variance, and expensive inference.
Key Takeaways
In RL for language models, the state is the prompt plus all tokens so far. The action is the next token. Reward often arrives only at the end, and it is often verifiable as correct or incorrect.
Policy gradient methods raise expected reward by increasing the probability of actions that scored well. Naive policy gradient is high variance. Sparse rewards make learning stall.
Subtracting a baseline that depends only on the state keeps the gradient unbiased and cuts variance. If the baseline is the expected reward given the state, the update becomes an advantage update.
GRPO is a policy gradient method built for language models. For each prompt it samples a group of responses, uses the group's average reward as a baseline, then normalizes and clips updates using importance ratios, with an optional KL penalty.
A toy task (sorting numbers) shows how reward shape, baselines, and GRPO-style deltas change what the model learns. Partial credit can speed learning, but it can also create loopholes and local optima.
At scale, RL for language models is harder than supervised pretraining. It needs heavy inference, multiple frozen and reference models, tight variance control, and reward functions that do not get gamed.
Reinforcement Learning Setup for Language Models
We do RL on a language model.
State
The state S is the prompt plus the response so far. As tokens are generated, the state grows by appending each token.
Action
Each action is choosing the next token. For outcome rewards, it helps to treat the full response as one action A.
Reward
Reward measures the quality of the whole response. We focus on outcome rewards: reward depends on the full response, not each step. It is verifiable and deterministic, like parsing the final answer and comparing to a ground-truth key.
Transitions
In language models, transitions are simple and known: append the chosen token. In robotics, transitions come from the world and may be unknown.
State Space
In robotics, many states are physically impossible. In language models, any token sequence is reachable. The hard part is finding sequences that earn reward. This freedom also lets models invent internal scratchpads and chains of thought.
Policy and Objective
The policy π is a distribution over next tokens given the current state. We usually start from a pretrained LM and fine-tune with RL.
A Rollout
Start from a prompt. Sample a full response from π(·|S). Compute one reward for that response.
Objective
Maximize expected reward:
J(θ) = E_{S,A}[R(S,A)]
where S comes from prompts and A is sampled from πθ(·|S).
Naive Policy Gradient and the Sparse-Reward Problem
With outcome rewards and full-response action A, the policy gradient is:
∇θ J(θ) = E_{S,A}[R(S,A) ∇θ log πθ(A|S)]
A Simple SGD Estimate
- Sample S.
- Sample A from πθ(·|S).
- Compute R(S,A).
- Update with R(S,A) ∇θ log πθ(A|S).
This looks like supervised learning, except the "label" A comes from the model, and the update is scaled by reward. High-reward responses are pushed up more.
Why Sparse Rewards Stall
If reward is 1 for correct and 0 otherwise, then incorrect samples contribute nothing. If the policy almost never produces a correct answer, most gradients are zero and learning barely moves.
On-Policy Data Keeps Changing
Each update changes πθ, so the distribution of sampled (S, A, R) changes continuously. This can help the model climb toward better regions, but it makes training harder to analyze and monitor.
Verifiable Reward vs Human Feedback
Verifiable Outcome Reward
Compute reward by checking the final answer against a reference. Reward is often 0 or 1.
Human Feedback Reward
In RLHF, a reward model is trained from preference data and outputs a real-valued score. The reward is smoother, so the same RL algorithm can behave very differently and needs different tuning.
Baselines and Variance Reduction
Naive policy gradient has high variance. We want the same expected gradient, with lower variance.
Baseline Trick
Replace R(S,A) with R(S,A) - B(S), where B depends only on S:
E[(R(S,A) - B(S)) ∇ log πθ(A|S)]
This stays unbiased because, for fixed S, the baseline term sums to zero.
So the baseline changes variance, not the expectation.
Toy Two-State Example
States S1 and S2 with actions A1 and A2:
S1,A1 → 11, S1,A2 → 9
S2,A1 → 0, S2,A2 → 2
A single sample like (S1,A2) with reward 9 can push toward A2 even though A1 is better. That is noise.
Choose baselines B(S1)=10, B(S2)=1:
S1,A1: 11-10 = 1, S1,A2: 9-10 = -1
S2,A1: 0-1 = -1, S2,A2: 2-1 = 1
The effective rewards are smaller and centered, so updates are less noisy.
Practical Baseline Choice
A common baseline is the expected reward given the state:
B(S) ≈ E[R(S,A) | S]
Value, Q, Advantage
V(S) = E[R | S]
Q(S,A) = E[R | S,A]
Advantage A(S,A) = Q(S,A) - V(S)
With outcome rewards on full responses, Q often matches R for a sampled response, and R - V(S) acts like an advantage estimate.
General Policy Gradient Form
Most methods in this family fit:
∇θ J(θ) ≈ E[Δ(S,A) ∇θ log πθ(A|S)]
What changes is Δ(S,A):
- Naive: Δ = R
- Baseline: Δ = R - B(S)
- Advantage-based: Δ approximates A(S,A)
- GRPO-style: Δ is centered and normalized within a group, often with clipped ratios and optional KL regularization
GRPO and Group-Relative Baselines
GRPO is a PPO-style method adapted to language models.
Core Idea
For each prompt, sample K responses. Use their within-prompt statistics as a baseline. This works well for language models because you can reset to the same prompt and sample many answers easily.
Group Baseline
For one prompt, compute rewards R1…RK.
Baseline is the group mean:
mean_R = (1/K) Σ_i Ri
Centered deltas:
Δi = Ri - mean_R
Now responses better than the group average get positive Δ. Worse ones get negative Δ.
Normalization
Divide by the group standard deviation:
Δi = (Ri - mean_R) / (std_R + ε)
This makes updates less sensitive to reward scale and helps stability.
Sorting Numbers Toy Task
Task
Prompt is a fixed-length list of n numbers. The model must output the same numbers sorted. Prompt and response length are fixed.
Reward Options
1. Sparse 0/1
Reward = 1 if the output exactly matches the sorted sequence, else 0. This matches the goal but is extremely sparse. A random policy almost always gets 0.
2. Partial Credit by Exact Positions
Reward = number of positions matching the correct sorted sequence.
Example, truth: 0 1 2 3
- Output 0 1 2 3 → 4
- Output 2 1 3 0 → 1
- Output 1 0 2 3 → 1
This gives signal but can treat very different wrong answers the same.
3. Richer Partial Credit: Inclusion + Local Order
Reward = inclusion_score + adjacency_score.
Inclusion_score counts how many prompt tokens appear somewhere in the response. Adjacency_score counts how many adjacent pairs in the response are in sorted order.
This gives denser signal, but can be hacked by patterns that score well without truly solving sorting.
What the Toy Shows
Denser rewards help learning move. But they can also pull the model into easy, wrong strategies that score well.
A Simple Model for the Sorting Task
Instead of a transformer, use a small custom model.
Fixed prompt length and fixed output length. Vocabulary is the set of possible numbers. An embedding maps tokens to vectors. Position-specific parameters encode and decode positions. The model predicts each output position independently (non-autoregressive).
Sampling Responses
Given a prompt:
- Compute logits for each output position.
- Convert to probabilities.
- Sample one token per position independently.
- Repeat to get multiple samples per prompt.
Responses have shape [batch, num_samples, positions].
Computing Rewards and Deltas
Compute a reward matrix R with shape [batch, num_samples] by applying the reward function to each sampled response.
Delta variants:
- Raw: Δ = R
- Centered: Δi = Ri - mean(R within prompt)
- Standardized: Δi = (Ri - mean) / (std + small_constant)
A Max-Only Heuristic
Within a prompt's group, keep Δ only for responses with the maximum reward. Set others to 0. This can reduce drift toward mediocre partial-credit modes, but it changes the learning signal.
Log Probabilities of Sampled Responses
For policy gradient you need log πθ(A|S) for the sampled responses.
- Run the model to get logits.
- Compute log_probs via log_softmax.
- Gather the log_probs for the sampled token indices at each position.
This yields log_probs with shape [batch, num_samples, positions].
Because reward is one number per response, the same Δ for that response is applied to all its positions in the loss.
Naive Loss and GRPO Modifications
Naive Policy Gradient Loss
L = -E[Δ * log πθ(A|S)]
where log πθ(A|S) is the sum of per-token log_probs.
Importance Ratios Need an Old Policy
GRPO and PPO use a ratio between the current policy and an old snapshot:
r = πθ(A|S) / π_old(A|S) = exp(logp_current - logp_old)
If you compute both with the same live parameters and backprop through both, r collapses to 1 and the gradient can vanish. So logp_old must be treated as a constant, using detach/no_grad.
Clipped Objective
- Compute r per response.
- Compute a clipped ratio clipped_r = clip(r, 1-ε, 1+ε).
- Use the minimum of rΔ and clipped_rΔ (for the maximize-reward objective), then take a negative sign for a minimization loss.
KL Penalty to a Reference Policy
Add a KL term that pulls the policy toward a reference model π_ref:
L_total = L_GRPO + λ * KL(πθ || π_ref)
KL Estimation Trick
KL is E_P[log(P/Q)]. A lower-variance unbiased form often used:
E_Q[Q/P - log(Q/P) - 1]
In code, compute per token and average over batch, samples, and positions.
Training Loop Structure
A Typical GRPO-Style Loop
- Sample prompts.
- For each prompt, sample multiple responses from the current policy (or a frozen snapshot).
- Compute rewards for all responses.
- Compute Δ using raw, centered, normalized, or another scheme.
- Compute and store logp_old for those responses using π_old, and detach.
- If using KL, compute log_probs under π_ref.
Then do several gradient steps on the same sampled responses:
- Recompute logp_current under πθ.
- Compute ratios, apply clipping, apply Δ.
- Add KL penalty if used.
- Backprop and update θ.
Then refresh snapshots:
Update π_old to the current policy for the next batch's ratios. Update π_ref less often if it is meant to move slowly.
System and Memory Costs
Inference dominates cost because you must sample many responses. You often manage multiple models or views: current policy, old policy (or stored old log_probs), and a reference policy for KL.
Keeping full copies costs memory. Storing old log_probs can save memory.
Behavior of Different Delta Choices
Raw Rewards
Mean reward can improve, but the model may settle for patterns that earn moderate reward without fully solving the task.
Centered Rewards
Within each prompt, it pushes up the best samples and pushes down the worst. Mean reward often improves more. If all samples have the same reward, deltas are zero and that prompt contributes no gradient.
Normalized Rewards
Often similar in small runs, but it helps stability by making update size less tied to reward scale.
Loss is Hard to Read
Because training is on-policy and the data distribution keeps changing, loss trends are not as reliable as in supervised learning. Reward, especially on held-out prompts, is the signal that matters.
Reward Design and Credit Assignment
Reward Design is Fragile
Too sparse and learning stalls. Too generous with partial credit and the model can get stuck in local optima or exploit loopholes.
Credit Assignment is Blunt with Outcome Rewards
The same reward is assigned to every token in the response. There is no built-in way to tell which token helped. Process rewards could help, but they are hard to design for language reasoning.
Engineering Challenges at Scale
RL for language models is more complex than supervised pretraining.
You may need: a policy model being updated, a reward system (verifiable checker or reward model), an old policy snapshot or stored old log_probs for ratios, a reference policy for KL, sometimes a critic/value model in other methods.
You also need distributed sampling and reward computation, model synchronization across workers, and careful variance control.
Inference cost and multiple model copies drive the difficulty.
Why RL Matters
RL lets you optimize for what you can measure, not just what you can imitate.
If the reward captures the behavior you want, RL can push the model toward things that are rare or missing in pretraining data.
Success rests on three things:
- A reward that matches the goal and resists hacking.
- An optimizer that stays stable (baselines, clipping, KL, and similar tools).
- A system that can run sampling and training at scale.
Keep reading
You might also like
CS336 Notes: Lecture 16 - Alignment, RL 1
Advanced RL for alignment: PPO implementation details, GRPO as a simpler alternative, overoptimization risks, and case studies from DeepSeek R1, Kimi K1.5, and Qwen 3.