CS336 Notes: Lecture 17 - Alignment, RL 2

RL for language models has a simple goal: make the model pick answers that score well. The constraint: sparse rewards, high variance, and expensive inference.

Key Takeaways

In RL for language models, the state is the prompt plus all tokens so far. The action is the next token. Reward often arrives only at the end, and it is often verifiable as correct or incorrect.

Policy gradient methods raise expected reward by increasing the probability of actions that scored well. Naive policy gradient is high variance. Sparse rewards make learning stall.

Subtracting a baseline that depends only on the state keeps the gradient unbiased and cuts variance. If the baseline is the expected reward given the state, the update becomes an advantage update.

GRPO is a policy gradient method built for language models. For each prompt it samples a group of responses, uses the group's average reward as a baseline, then normalizes and clips updates using importance ratios, with an optional KL penalty.

A toy task (sorting numbers) shows how reward shape, baselines, and GRPO-style deltas change what the model learns. Partial credit can speed learning, but it can also create loopholes and local optima.

At scale, RL for language models is harder than supervised pretraining. It needs heavy inference, multiple frozen and reference models, tight variance control, and reward functions that do not get gamed.

Reinforcement Learning Setup for Language Models

We do RL on a language model.

State

The state S is the prompt plus the response so far. As tokens are generated, the state grows by appending each token.

Action

Each action is choosing the next token. For outcome rewards, it helps to treat the full response as one action A.

Reward

Reward measures the quality of the whole response. We focus on outcome rewards: reward depends on the full response, not each step. It is verifiable and deterministic, like parsing the final answer and comparing to a ground-truth key.

Transitions

In language models, transitions are simple and known: append the chosen token. In robotics, transitions come from the world and may be unknown.

State Space

In robotics, many states are physically impossible. In language models, any token sequence is reachable. The hard part is finding sequences that earn reward. This freedom also lets models invent internal scratchpads and chains of thought.

Policy and Objective

The policy π is a distribution over next tokens given the current state. We usually start from a pretrained LM and fine-tune with RL.

A Rollout

Start from a prompt. Sample a full response from π(·|S). Compute one reward for that response.

Objective

Maximize expected reward:

J(θ) = E_{S,A}[R(S,A)]

where S comes from prompts and A is sampled from πθ(·|S).

Naive Policy Gradient and the Sparse-Reward Problem

With outcome rewards and full-response action A, the policy gradient is:

∇θ J(θ) = E_{S,A}[R(S,A) ∇θ log πθ(A|S)]

A Simple SGD Estimate

Sample S.
Sample A from πθ(·|S).
Compute R(S,A).
Update with R(S,A) ∇θ log πθ(A|S).

This looks like supervised learning, except the "label" A comes from the model, and the update is scaled by reward. High-reward responses are pushed up more.

Why Sparse Rewards Stall

If reward is 1 for correct and 0 otherwise, then incorrect samples contribute nothing. If the policy almost never produces a correct answer, most gradients are zero and learning barely moves.

On-Policy Data Keeps Changing

Each update changes πθ, so the distribution of sampled (S, A, R) changes continuously. This can help the model climb toward better regions, but it makes training harder to analyze and monitor.

Verifiable Reward vs Human Feedback

Verifiable Outcome Reward

Compute reward by checking the final answer against a reference. Reward is often 0 or 1.

Human Feedback Reward

In RLHF, a reward model is trained from preference data and outputs a real-valued score. The reward is smoother, so the same RL algorithm can behave very differently and needs different tuning.

Baselines and Variance Reduction

Naive policy gradient has high variance. We want the same expected gradient, with lower variance.

Baseline Trick

Replace R(S,A) with R(S,A) - B(S), where B depends only on S:

E[(R(S,A) - B(S)) ∇ log πθ(A|S)]

This stays unbiased because, for fixed S, the baseline term sums to zero.

So the baseline changes variance, not the expectation.

Toy Two-State Example

States S1 and S2 with actions A1 and A2:

S1,A1 → 11, S1,A2 → 9
S2,A1 → 0,  S2,A2 → 2

A single sample like (S1,A2) with reward 9 can push toward A2 even though A1 is better. That is noise.

Choose baselines B(S1)=10, B(S2)=1:

S1,A1: 11-10 = 1,  S1,A2: 9-10 = -1
S2,A1: 0-1 = -1,   S2,A2: 2-1 = 1

The effective rewards are smaller and centered, so updates are less noisy.

Practical Baseline Choice

A common baseline is the expected reward given the state:

B(S) ≈ E[R(S,A) | S]

Value, Q, Advantage

V(S) = E[R | S]
Q(S,A) = E[R | S,A]
Advantage A(S,A) = Q(S,A) - V(S)

With outcome rewards on full responses, Q often matches R for a sampled response, and R - V(S) acts like an advantage estimate.

General Policy Gradient Form

Most methods in this family fit:

∇θ J(θ) ≈ E[Δ(S,A) ∇θ log πθ(A|S)]

What changes is Δ(S,A):

Naive: Δ = R
Baseline: Δ = R - B(S)
Advantage-based: Δ approximates A(S,A)
GRPO-style: Δ is centered and normalized within a group, often with clipped ratios and optional KL regularization

GRPO and Group-Relative Baselines

GRPO is a PPO-style method adapted to language models.

Core Idea

For each prompt, sample K responses. Use their within-prompt statistics as a baseline. This works well for language models because you can reset to the same prompt and sample many answers easily.

Group Baseline

For one prompt, compute rewards R1…RK.

Baseline is the group mean:

mean_R = (1/K) Σ_i Ri

Centered deltas:

Δi = Ri - mean_R

Now responses better than the group average get positive Δ. Worse ones get negative Δ.

Normalization

Divide by the group standard deviation:

Δi = (Ri - mean_R) / (std_R + ε)

This makes updates less sensitive to reward scale and helps stability.

Sorting Numbers Toy Task

Task

Prompt is a fixed-length list of n numbers. The model must output the same numbers sorted. Prompt and response length are fixed.

Reward Options

1. Sparse 0/1

Reward = 1 if the output exactly matches the sorted sequence, else 0. This matches the goal but is extremely sparse. A random policy almost always gets 0.

2. Partial Credit by Exact Positions

Reward = number of positions matching the correct sorted sequence.

Example, truth: 0 1 2 3

Output 0 1 2 3 → 4
Output 2 1 3 0 → 1
Output 1 0 2 3 → 1

This gives signal but can treat very different wrong answers the same.

3. Richer Partial Credit: Inclusion + Local Order

Reward = inclusion_score + adjacency_score.

Inclusion_score counts how many prompt tokens appear somewhere in the response. Adjacency_score counts how many adjacent pairs in the response are in sorted order.

This gives denser signal, but can be hacked by patterns that score well without truly solving sorting.

What the Toy Shows

Denser rewards help learning move. But they can also pull the model into easy, wrong strategies that score well.

A Simple Model for the Sorting Task

Instead of a transformer, use a small custom model.

Fixed prompt length and fixed output length. Vocabulary is the set of possible numbers. An embedding maps tokens to vectors. Position-specific parameters encode and decode positions. The model predicts each output position independently (non-autoregressive).

Sampling Responses

Given a prompt:

Compute logits for each output position.
Convert to probabilities.
Sample one token per position independently.
Repeat to get multiple samples per prompt.

Responses have shape [batch, num_samples, positions].

Computing Rewards and Deltas

Compute a reward matrix R with shape [batch, num_samples] by applying the reward function to each sampled response.

Delta variants:

Raw: Δ = R
Centered: Δi = Ri - mean(R within prompt)
Standardized: Δi = (Ri - mean) / (std + small_constant)

A Max-Only Heuristic

Within a prompt's group, keep Δ only for responses with the maximum reward. Set others to 0. This can reduce drift toward mediocre partial-credit modes, but it changes the learning signal.

Log Probabilities of Sampled Responses

For policy gradient you need log πθ(A|S) for the sampled responses.

Run the model to get logits.
Compute log_probs via log_softmax.
Gather the log_probs for the sampled token indices at each position.

This yields log_probs with shape [batch, num_samples, positions].

Because reward is one number per response, the same Δ for that response is applied to all its positions in the loss.

Naive Loss and GRPO Modifications

Naive Policy Gradient Loss

L = -E[Δ * log πθ(A|S)]

where log πθ(A|S) is the sum of per-token log_probs.

Importance Ratios Need an Old Policy

GRPO and PPO use a ratio between the current policy and an old snapshot:

r = πθ(A|S) / π_old(A|S) = exp(logp_current - logp_old)

If you compute both with the same live parameters and backprop through both, r collapses to 1 and the gradient can vanish. So logp_old must be treated as a constant, using detach/no_grad.

Clipped Objective

Compute r per response.
Compute a clipped ratio clipped_r = clip(r, 1-ε, 1+ε).
Use the minimum of rΔ and clipped_rΔ (for the maximize-reward objective), then take a negative sign for a minimization loss.

KL Penalty to a Reference Policy

Add a KL term that pulls the policy toward a reference model π_ref:

L_total = L_GRPO + λ * KL(πθ || π_ref)

KL Estimation Trick

KL is E_P[log(P/Q)]. A lower-variance unbiased form often used:

E_Q[Q/P - log(Q/P) - 1]

In code, compute per token and average over batch, samples, and positions.

Training Loop Structure

A Typical GRPO-Style Loop

Sample prompts.
For each prompt, sample multiple responses from the current policy (or a frozen snapshot).
Compute rewards for all responses.
Compute Δ using raw, centered, normalized, or another scheme.
Compute and store logp_old for those responses using π_old, and detach.
If using KL, compute log_probs under π_ref.

Then do several gradient steps on the same sampled responses:

Recompute logp_current under πθ.
Compute ratios, apply clipping, apply Δ.
Add KL penalty if used.
Backprop and update θ.

Then refresh snapshots:

Update π_old to the current policy for the next batch's ratios. Update π_ref less often if it is meant to move slowly.

System and Memory Costs

Inference dominates cost because you must sample many responses. You often manage multiple models or views: current policy, old policy (or stored old log_probs), and a reference policy for KL.

Keeping full copies costs memory. Storing old log_probs can save memory.

Behavior of Different Delta Choices

Raw Rewards

Mean reward can improve, but the model may settle for patterns that earn moderate reward without fully solving the task.

Centered Rewards

Within each prompt, it pushes up the best samples and pushes down the worst. Mean reward often improves more. If all samples have the same reward, deltas are zero and that prompt contributes no gradient.

Normalized Rewards

Often similar in small runs, but it helps stability by making update size less tied to reward scale.

Loss is Hard to Read

Because training is on-policy and the data distribution keeps changing, loss trends are not as reliable as in supervised learning. Reward, especially on held-out prompts, is the signal that matters.

Reward Design and Credit Assignment

Reward Design is Fragile

Too sparse and learning stalls. Too generous with partial credit and the model can get stuck in local optima or exploit loopholes.

Credit Assignment is Blunt with Outcome Rewards

The same reward is assigned to every token in the response. There is no built-in way to tell which token helped. Process rewards could help, but they are hard to design for language reasoning.

Engineering Challenges at Scale

RL for language models is more complex than supervised pretraining.

You may need: a policy model being updated, a reward system (verifiable checker or reward model), an old policy snapshot or stored old log_probs for ratios, a reference policy for KL, sometimes a critic/value model in other methods.

You also need distributed sampling and reward computation, model synchronization across workers, and careful variance control.

Inference cost and multiple model copies drive the difficulty.

Why RL Matters

RL lets you optimize for what you can measure, not just what you can imitate.

If the reward captures the behavior you want, RL can push the model toward things that are rare or missing in pretraining data.

Success rests on three things:

A reward that matches the goal and resists hacking.
An optimizer that stays stable (baselines, clipping, KL, and similar tools).
A system that can run sampling and training at scale.

RL for language models has a simple goal: make the model pick answers that score well. The constraint: sparse rewards, high variance, and expensive inference.

Key Takeaways

In RL for language models, the state is the prompt plus all tokens so far. The action is the next token. Reward often arrives only at the end, and it is often verifiable as correct or incorrect.

Policy gradient methods raise expected reward by increasing the probability of actions that scored well. Naive policy gradient is high variance. Sparse rewards make learning stall.

Subtracting a baseline that depends only on the state keeps the gradient unbiased and cuts variance. If the baseline is the expected reward given the state, the update becomes an advantage update.

Reinforcement Learning Setup for Language Models

We do RL on a language model.

State

The state S is the prompt plus the response so far. As tokens are generated, the state grows by appending each token.

Action

Each action is choosing the next token. For outcome rewards, it helps to treat the full response as one action A.

Reward

Transitions

In language models, transitions are simple and known: append the chosen token. In robotics, transitions come from the world and may be unknown.

State Space

Policy and Objective

The policy π is a distribution over next tokens given the current state. We usually start from a pretrained LM and fine-tune with RL.

A Rollout

Start from a prompt. Sample a full response from π(·|S). Compute one reward for that response.

Objective

Maximize expected reward:

J(θ) = E_{S,A}[R(S,A)]

where S comes from prompts and A is sampled from πθ(·|S).

Naive Policy Gradient and the Sparse-Reward Problem

With outcome rewards and full-response action A, the policy gradient is:

∇θ J(θ) = E_{S,A}[R(S,A) ∇θ log πθ(A|S)]

A Simple SGD Estimate

Sample S.
Sample A from πθ(·|S).
Compute R(S,A).
Update with R(S,A) ∇θ log πθ(A|S).

This looks like supervised learning, except the "label" A comes from the model, and the update is scaled by reward. High-reward responses are pushed up more.

Why Sparse Rewards Stall

If reward is 1 for correct and 0 otherwise, then incorrect samples contribute nothing. If the policy almost never produces a correct answer, most gradients are zero and learning barely moves.

On-Policy Data Keeps Changing

Each update changes πθ, so the distribution of sampled (S, A, R) changes continuously. This can help the model climb toward better regions, but it makes training harder to analyze and monitor.

Verifiable Reward vs Human Feedback

Verifiable Outcome Reward

Compute reward by checking the final answer against a reference. Reward is often 0 or 1.

Human Feedback Reward

In RLHF, a reward model is trained from preference data and outputs a real-valued score. The reward is smoother, so the same RL algorithm can behave very differently and needs different tuning.

Baselines and Variance Reduction

Naive policy gradient has high variance. We want the same expected gradient, with lower variance.

Baseline Trick

Replace R(S,A) with R(S,A) - B(S), where B depends only on S:

E[(R(S,A) - B(S)) ∇ log πθ(A|S)]

This stays unbiased because, for fixed S, the baseline term sums to zero.

So the baseline changes variance, not the expectation.

Toy Two-State Example

States S1 and S2 with actions A1 and A2:

S1,A1 → 11, S1,A2 → 9
S2,A1 → 0,  S2,A2 → 2

A single sample like (S1,A2) with reward 9 can push toward A2 even though A1 is better. That is noise.

Choose baselines B(S1)=10, B(S2)=1:

S1,A1: 11-10 = 1,  S1,A2: 9-10 = -1
S2,A1: 0-1 = -1,   S2,A2: 2-1 = 1

The effective rewards are smaller and centered, so updates are less noisy.

Practical Baseline Choice

A common baseline is the expected reward given the state:

B(S) ≈ E[R(S,A) | S]

Value, Q, Advantage

V(S) = E[R | S]
Q(S,A) = E[R | S,A]
Advantage A(S,A) = Q(S,A) - V(S)

With outcome rewards on full responses, Q often matches R for a sampled response, and R - V(S) acts like an advantage estimate.

General Policy Gradient Form

Most methods in this family fit:

∇θ J(θ) ≈ E[Δ(S,A) ∇θ log πθ(A|S)]

What changes is Δ(S,A):

Naive: Δ = R
Baseline: Δ = R - B(S)
Advantage-based: Δ approximates A(S,A)
GRPO-style: Δ is centered and normalized within a group, often with clipped ratios and optional KL regularization

GRPO and Group-Relative Baselines

GRPO is a PPO-style method adapted to language models.

Core Idea

For each prompt, sample K responses. Use their within-prompt statistics as a baseline. This works well for language models because you can reset to the same prompt and sample many answers easily.

Group Baseline

For one prompt, compute rewards R1…RK.

Baseline is the group mean:

mean_R = (1/K) Σ_i Ri

Centered deltas:

Δi = Ri - mean_R

Now responses better than the group average get positive Δ. Worse ones get negative Δ.

Normalization

Divide by the group standard deviation:

Δi = (Ri - mean_R) / (std_R + ε)

This makes updates less sensitive to reward scale and helps stability.

Sorting Numbers Toy Task

Task

Prompt is a fixed-length list of n numbers. The model must output the same numbers sorted. Prompt and response length are fixed.

Reward Options

1. Sparse 0/1

Reward = 1 if the output exactly matches the sorted sequence, else 0. This matches the goal but is extremely sparse. A random policy almost always gets 0.

2. Partial Credit by Exact Positions

Reward = number of positions matching the correct sorted sequence.

Example, truth: 0 1 2 3

Output 0 1 2 3 → 4
Output 2 1 3 0 → 1
Output 1 0 2 3 → 1

This gives signal but can treat very different wrong answers the same.

3. Richer Partial Credit: Inclusion + Local Order

Reward = inclusion_score + adjacency_score.

Inclusion_score counts how many prompt tokens appear somewhere in the response. Adjacency_score counts how many adjacent pairs in the response are in sorted order.

This gives denser signal, but can be hacked by patterns that score well without truly solving sorting.

What the Toy Shows

Denser rewards help learning move. But they can also pull the model into easy, wrong strategies that score well.

A Simple Model for the Sorting Task

Instead of a transformer, use a small custom model.

Sampling Responses

Given a prompt:

Compute logits for each output position.
Convert to probabilities.
Sample one token per position independently.
Repeat to get multiple samples per prompt.

Responses have shape [batch, num_samples, positions].

Computing Rewards and Deltas

Compute a reward matrix R with shape [batch, num_samples] by applying the reward function to each sampled response.

Delta variants:

Raw: Δ = R
Centered: Δi = Ri - mean(R within prompt)
Standardized: Δi = (Ri - mean) / (std + small_constant)

A Max-Only Heuristic

Within a prompt's group, keep Δ only for responses with the maximum reward. Set others to 0. This can reduce drift toward mediocre partial-credit modes, but it changes the learning signal.

Log Probabilities of Sampled Responses

For policy gradient you need log πθ(A|S) for the sampled responses.

Run the model to get logits.
Compute log_probs via log_softmax.
Gather the log_probs for the sampled token indices at each position.

This yields log_probs with shape [batch, num_samples, positions].

Because reward is one number per response, the same Δ for that response is applied to all its positions in the loss.

Naive Loss and GRPO Modifications

Naive Policy Gradient Loss

L = -E[Δ * log πθ(A|S)]

where log πθ(A|S) is the sum of per-token log_probs.

Importance Ratios Need an Old Policy

GRPO and PPO use a ratio between the current policy and an old snapshot:

r = πθ(A|S) / π_old(A|S) = exp(logp_current - logp_old)

If you compute both with the same live parameters and backprop through both, r collapses to 1 and the gradient can vanish. So logp_old must be treated as a constant, using detach/no_grad.

Clipped Objective

Compute r per response.
Compute a clipped ratio clipped_r = clip(r, 1-ε, 1+ε).
Use the minimum of rΔ and clipped_rΔ (for the maximize-reward objective), then take a negative sign for a minimization loss.

KL Penalty to a Reference Policy

Add a KL term that pulls the policy toward a reference model π_ref:

L_total = L_GRPO + λ * KL(πθ || π_ref)

KL Estimation Trick

KL is E_P[log(P/Q)]. A lower-variance unbiased form often used:

E_Q[Q/P - log(Q/P) - 1]

In code, compute per token and average over batch, samples, and positions.

Training Loop Structure

A Typical GRPO-Style Loop

Sample prompts.
For each prompt, sample multiple responses from the current policy (or a frozen snapshot).
Compute rewards for all responses.
Compute Δ using raw, centered, normalized, or another scheme.
Compute and store logp_old for those responses using π_old, and detach.
If using KL, compute log_probs under π_ref.

Then do several gradient steps on the same sampled responses:

Recompute logp_current under πθ.
Compute ratios, apply clipping, apply Δ.
Add KL penalty if used.
Backprop and update θ.

Then refresh snapshots:

Update π_old to the current policy for the next batch's ratios. Update π_ref less often if it is meant to move slowly.

System and Memory Costs

Inference dominates cost because you must sample many responses. You often manage multiple models or views: current policy, old policy (or stored old log_probs), and a reference policy for KL.

Keeping full copies costs memory. Storing old log_probs can save memory.

Behavior of Different Delta Choices

Raw Rewards

Mean reward can improve, but the model may settle for patterns that earn moderate reward without fully solving the task.

Centered Rewards

Normalized Rewards

Often similar in small runs, but it helps stability by making update size less tied to reward scale.

Loss is Hard to Read

Because training is on-policy and the data distribution keeps changing, loss trends are not as reliable as in supervised learning. Reward, especially on held-out prompts, is the signal that matters.

Reward Design and Credit Assignment

Reward Design is Fragile

Too sparse and learning stalls. Too generous with partial credit and the model can get stuck in local optima or exploit loopholes.

Credit Assignment is Blunt with Outcome Rewards

The same reward is assigned to every token in the response. There is no built-in way to tell which token helped. Process rewards could help, but they are hard to design for language reasoning.

Engineering Challenges at Scale

RL for language models is more complex than supervised pretraining.

You also need distributed sampling and reward computation, model synchronization across workers, and careful variance control.

Inference cost and multiple model copies drive the difficulty.

Why RL Matters

RL lets you optimize for what you can measure, not just what you can imitate.

If the reward captures the behavior you want, RL can push the model toward things that are rare or missing in pretraining data.

Success rests on three things:

A reward that matches the goal and resists hacking.
An optimizer that stays stable (baselines, clipping, KL, and similar tools).
A system that can run sampling and training at scale.

Key Takeaways

Reinforcement Learning Setup for Language Models

State

Action

Reward

Transitions

State Space

Policy and Objective

A Rollout

Objective

Naive Policy Gradient and the Sparse-Reward Problem

A Simple SGD Estimate

Why Sparse Rewards Stall

On-Policy Data Keeps Changing

Verifiable Reward vs Human Feedback

Verifiable Outcome Reward

Human Feedback Reward

Baselines and Variance Reduction

Baseline Trick

Toy Two-State Example

Practical Baseline Choice

Value, Q, Advantage

General Policy Gradient Form

GRPO and Group-Relative Baselines

Core Idea

Group Baseline

Normalization

Sorting Numbers Toy Task

Task

Reward Options

What the Toy Shows

A Simple Model for the Sorting Task

Sampling Responses

Computing Rewards and Deltas

A Max-Only Heuristic

Log Probabilities of Sampled Responses

Naive Loss and GRPO Modifications

Naive Policy Gradient Loss

Importance Ratios Need an Old Policy

Clipped Objective

KL Penalty to a Reference Policy

KL Estimation Trick

Training Loop Structure

A Typical GRPO-Style Loop

System and Memory Costs

Behavior of Different Delta Choices

Raw Rewards

Centered Rewards

Normalized Rewards

Loss is Hard to Read

Reward Design and Credit Assignment

Reward Design is Fragile

Credit Assignment is Blunt with Outcome Rewards

Engineering Challenges at Scale

Why RL Matters

Cs336 Notes

You might also like

CS336 Notes: Lecture 16 - Alignment, RL 1

CS336 Notes: Lecture 15 - Alignment, SFT and RLHF

CS336 Notes: Lecture 13 - Data 1

Key Takeaways

Reinforcement Learning Setup for Language Models

State

Action

Reward

Transitions

State Space

Policy and Objective

A Rollout

Objective

Naive Policy Gradient and the Sparse-Reward Problem

A Simple SGD Estimate

Why Sparse Rewards Stall

On-Policy Data Keeps Changing

Verifiable Reward vs Human Feedback

Verifiable Outcome Reward

Human Feedback Reward

Baselines and Variance Reduction

Baseline Trick

Toy Two-State Example