CS336 Notes: Lecture 12 - Evaluation

What you measure shapes what gets built. Evaluation looks like bookkeeping. It is not. It steers the entire field.

Key Takeaways

There is no single correct evaluation. The right setup depends on the question you are trying to answer.

Every evaluation is a pipeline. You choose inputs, decide how to call the model, score outputs, then interpret the numbers. Each step introduces assumptions.

Tailoring prompts to a model can lift scores on a goal. It also breaks fair comparisons across models. You gain realism and lose clean measurement.

Perplexity still matters for pretraining and scaling laws. It does not capture what users care about in practice.

Many benchmarks look like exams. They saturate, encode bias, and get contaminated by training data. Strong scores can mean the model learned the benchmark, not the skill.

Instruction-following benchmarks and LLM-as-a-judge approaches score open-ended answers. Judges add bias and can be gamed. Longer answers often win even when they hallucinate more.

Agent benchmarks test multi-step workflows with tools. Models still struggle. Long-horizon planning exposes weaknesses that short-answer tests hide.

Safety benchmarks measure refusal. A model that refuses everything scores perfectly and is useless. Safety needs to pair with capability metrics.

Real use does not look like exams. Realistic evaluation needs real tasks. Privacy constraints make sharing hard.

Train-test overlap and label errors break benchmarks quietly. Many papers do not report decontamination checks.

Modern evaluation often measures whole systems, not just models. Clear rules and shared data matter for scientific comparison.

Why Evaluation Matters

When labs chase leaderboard metrics, teams change data, architecture, and training to move those numbers. What gets measured becomes the target.

Different groups use evaluation for different reasons:

Users and companies choosing a model for a specific workflow.
Researchers tracking capability and progress.
Policymakers weighing benefits and risks.
Developers needing fast feedback during training.

No single evaluation serves all of these goals. If you try to satisfy everyone, you satisfy no one.

What Evaluation Is Trying to Answer

"Evaluate the model" is not a question. Make it concrete:

Which model fits my customer support workflow?
Did this training method improve reasoning?
Is this model safe enough to deploy?

Your evaluation setup must match the question. A single accuracy number without context can mislead.

A Framework for Thinking About Evaluation

1. Inputs

Where do prompts come from? Which use cases do they cover? Do you include rare, hard cases or only typical ones?

In multi-turn chat, later inputs depend on earlier outputs. That makes evaluation harder because you cannot fix the prompt sequence in advance.

2. How You Call the Model

Prompting strategy changes results dramatically. Zero-shot, few-shot, chain-of-thought, tool use, retrieval-augmented generation. Each shows a different aspect of capability.

Decide what you are evaluating. The bare language model? A tool-using agent? A system with scaffolding, routing, and multiple models?

3. Outputs and Metrics

Are reference answers clean and correct? Which metric matches the task: exact match, F1, pass@k, BLEU, ROUGE?

Do you include cost? Two models can score similarly and differ wildly in price and latency. If you ignore that, you miss what matters for deployment.

4. Interpreting the Numbers

What does 91% mean in practice? Is that deployable?

Is the score real generalization, or did training data overlap with the test set? Are you measuring a model, a system, or a change in method?

Adapting Inputs to the Model

Sometimes prompts are tailored to a specific model. This can help and also break comparability.

When adaptation helps:

In multi-turn chat, fixed scripts can push the model into unnatural states. Letting the model drive the conversation can be more realistic.

In red-teaming, adapting prompts based on the model's replies finds rare failures efficiently.

Costs of adaptation:

Model-specific inputs make cross-model comparisons less fair. You gain realism or efficiency. You lose clean apples-to-apples measurement.

Perplexity and Language Modeling

What Perplexity Is

A language model assigns probabilities to token sequences. Perplexity measures how well the model predicts tokens on a held-out dataset. Lower is better. Pretraining optimizes this objective.

Historical Use

Earlier work compared n-gram and neural models on datasets like Penn Treebank and WikiText. Large neural models cut perplexity sharply. That helped drive scaling.

Shift with GPT-2 and GPT-3

GPT-2 trained on WebText and did well out-of-distribution on older perplexity benchmarks. That showed transfer. GPT-3 and later still report perplexity, but attention shifted toward downstream performance.

Why Perplexity Remains Useful

It gives smooth, fine-grained feedback through token probabilities. It underpins scaling laws connecting model size, data, compute, and loss. It can apply to downstream tasks by conditioning on prompts and scoring candidate answers.

Limits and Pitfalls

Perplexity does not always track capabilities, especially for small model changes.

Perplexity leaderboards require trust. If probabilities are invalid or manipulated, reported "improvements" can be faked.

Some argue matching the true data distribution solves everything. Others argue it can be inefficient and misaligned with real tasks.

Benchmarks Based on Cloze and Common Sense

Some benchmarks use multiple choice and score by comparing probabilities over completions. This is close in spirit to perplexity.

Examples:

LAMBADA: fill in a missing word using long-range context.
HellaSwag: choose the most plausible continuation as a proxy for common sense.

Because many datasets draw from web-adjacent sources like WikiHow, train-test overlap is a concern for web-trained models.

Knowledge and Exam-Style Benchmarks

MMLU

MMLU spans 57 subjects with multiple-choice questions sourced from the web. It tests knowledge heavily. Reasoning without the right facts does not score well.

Early evaluations used few-shot prompting before instruction tuning. Prompt choice and example order affect scores. Instruction-tuned models often score well zero-shot now. Parts of the benchmark are saturated.

MMLU-Pro

MMLU-Pro raises difficulty by removing noisy questions and expanding choices from 4 to 10. Even strong models drop in accuracy. Chain-of-thought helps, but not consistently.

GPQA

GPQA targets PhD-level technical questions. Experts write and validate items. Non-experts try with search. Experts score around 60%. Non-experts with Google score around 30%. Earlier models scored in the high 30s. Newer models score higher. "Google-proof" means simple web search is not enough. Models may still reflect web-like knowledge patterns.

Humanity's Last Exam

HLE is multimodal and deliberately hard. It uses prizes to incentivize difficult items. Problem writers are LLM-aware, which can produce ultra-specific, model-targeted questions and bias. It is an extreme stress test, not a picture of everyday use.

Instruction-Following and Open-Ended Benchmarks

The Core Problem

Open-ended instructions often have many acceptable answers. There is no single reference output. Automated scoring is always a trade-off.

Chatbot Arena

Users chat with two hidden models and pick the better answer. Pairwise preferences are converted into ELO-style rankings.

Strengths: live prompts, no fixed test set, easy human judgment.

Weaknesses: user bias, gaming, and leaderboard distortions.

IFEval

Tests narrow instruction constraints like word limits or token restrictions. Constraints are easy to check, so results reproduce. But it mostly tests surface compliance. Nonsense can still pass.

AlpacaEval and WildBench

AlpacaEval uses GPT-4 as a judge to compare answers and report win rates. Early versions favored longer answers. Length-corrected variants reduced that bias. Scores correlate with Chatbot Arena.

WildBench uses real human-bot conversations judged by LLMs with checklists.

These benchmarks measure open-ended instruction following. Judge bias and gaming remain central risks.

Agent Benchmarks

Why Agents

Many real tasks require multiple model calls, planning, tool use, code execution, browsing, and debugging. Agent benchmarks measure the model plus scaffolding and tools.

SWEBench

Given a codebase and a GitHub issue, produce a patch that passes tests. The agent must read code, understand the request, modify files, and validate with tests. Success rates are still modest, though improving.

CyBench

Based on capture-the-flag security challenges. The agent issues commands to a remote environment to find secret flags. The loop: plan commands, execute, observe outputs, update memory, repeat. State-of-the-art solves few tasks. Some tasks take humans hours.

MLEBench

Kaggle-style machine learning competitions. The agent reads the problem, inspects data, writes code, debugs, tunes, and submits. Even top models solve only a small fraction.

The pattern: long-horizon planning, tool use, and debugging expose weaknesses that short-answer tests hide.

Reasoning-Only Benchmarks

ARC-AGI

ARC removes language and world knowledge. Inputs are colored grids. The task is to infer a transformation and apply it to a new grid. Rules feel intuitive, but they are not tied to text or facts.

Early models scored near zero. Recent systems score higher, often with heavy computation.

ARC-AGI-2

A harder successor in the same style.

These benchmarks probe pattern reasoning with minimal reliance on memorized knowledge.

Safety Benchmarks and Jailbreaking

Safety Benchmarks

HarmBench defines harmful behaviors and checks whether models refuse. AIR-Bench ties safety categories to laws and policies and tests with prompts.

The goal is not blanket refusal. The goal is usefulness with care. Do safe, helpful work while refusing dangerous requests.

Refusal vs Usefulness

A model that refuses everything looks perfectly safe and is useless. Safety metrics must pair with capability metrics.

Capability vs Propensity

Capability: can the model produce harmful content at all?

Propensity: will it do so under allowed prompts?

Closed APIs tend to reduce propensity through alignment. Open-weight models emphasize capability because safety layers can be removed or fine-tuned away.

Jailbreaking

Jailbreaking aims to bypass safety behavior. Work has shown that optimized "gibberish" suffixes can cause models to ignore safety rules. Safety can be brittle. Systematic prompting can reveal harmful behavior despite nominal refusals.

Pre-Deployment Testing

Safety institutes run voluntary pre-deployment tests with labs. They evaluate capability and safety on unreleased models. They often reuse existing benchmarks. The hard question remains: what counts as "safe enough"?

Real-World Tasks and Realism

Exam Benchmarks vs Real Users

Standardized tests do not look like daily use. Real prompts include chit-chat, messy questions, spam, adversarial requests, and practical work.

Two Kinds of Prompts

Quizzing prompts: the user already knows the answer and tests the model.

Asking prompts: the user needs help and does not know the answer.

Most value comes from asking prompts. Most benchmarks look like quizzing.

Real Use Analysis

Anthropic clustered real conversations and found coding as a major use case. MedHELM asked clinicians which tasks matter in practice and built benchmarks around those needs.

Realism vs Privacy

High-realism evaluation often needs real user or patient data. That creates privacy and sharing constraints. There is a real tension between realism and privacy.

Train-Test Overlap and Data Quality

Train-Test Contamination

Models train on massive internet corpora. Benchmarks can overlap with training data. Filtering shared long n-grams helps but misses translations, paraphrases, and variants like math problems in other languages.

Detecting Contamination Without Training Data

Look for output patterns that mirror dataset ordering. Watch for suspicious confidence spikes on specific items. Treat unusually sharp performance jumps with caution.

Norms and Reporting

Many papers do not report decontamination checks. Stronger norms would include clearer test-set protection and contamination audits.

Label Noise and Benchmark Quality

Benchmarks often include wrong answers or bad questions. Fixing these errors often raises top scores. Some "difficulty" was just noise.

What Exactly Are We Evaluating?

Methods vs Systems

Classic ML evaluation compared methods on fixed splits. Which algorithm gets the best accuracy?

Today, many evaluations compare systems. Systems can differ in data mixtures, architectures, prompts, scaffolding, post-processing, and safety layers.

System-level evaluation matches what users experience. It can hide the source of gains and make scientific comparison harder.

Some efforts, like speed runs and fixed-data competitions, try to bring back method-focused comparisons under shared rules.

The bottom line: evaluation shapes incentives. Get it wrong, and you build the wrong thing.