Part 4 — Evaluation¶

Overview

Training loss tells you how well the model fits the training data. Evaluation tells you whether the model is useful. This article covers automatic metrics (perplexity, BLEU, ROUGE, BERTScore), task benchmarks, and how to build a repeatable evaluation pipeline.

1. Why Evaluation Is Hard¶

A model with low training loss can still produce garbage outputs (overfitting, wrong format)
Automatic metrics correlate imperfectly with human judgement
Benchmarks can be contaminated (test data in the pre-training corpus)

The golden rule: use multiple signals — no single metric is sufficient.

2. Perplexity¶

Perplexity measures how surprised a language model is by text. Lower is better.

\[\text{PPL} = \exp\!\left(-\frac{1}{N}\sum_{i=1}^{N} \log P(w_i \mid w_1, \ldots, w_{i-1})\right)\]

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "gpt2"
model     = AutoModelForCausalLM.from_pretrained(model_id).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(model_id)

def perplexity(text: str, stride: int = 512) -> float:
    encodings  = tokenizer(text, return_tensors="pt")
    input_ids  = encodings.input_ids.cuda()
    max_length = model.config.n_positions
    seq_len    = input_ids.size(1)

    nlls = []
    prev_end = 0

    for begin in range(0, seq_len, stride):
        end   = min(begin + max_length, seq_len)
        trg_l = end - prev_end
        input_chunk = input_ids[:, begin:end]
        target_ids  = input_chunk.clone()
        target_ids[:, :-trg_l] = -100   # only compute loss on the new tokens

        with torch.no_grad():
            loss = model(input_chunk, labels=target_ids).loss
        nlls.append(loss * trg_l)
        prev_end = end
        if end == seq_len:
            break

    return torch.exp(torch.stack(nlls).sum() / seq_len).item()

text = open("data/test_corpus.txt").read()
print(f"Perplexity: {perplexity(text):.2f}")

Using the `evaluate` Library¶

import evaluate

perplexity = evaluate.load("perplexity", module_type="metric")

results = perplexity.compute(
    predictions=["The quick brown fox jumps.", "Language models are fascinating."],
    model_id="gpt2",
)
print(results["mean_perplexity"])

3. BLEU¶

BLEU (Bilingual Evaluation Understudy) measures n-gram overlap between generated text and reference translations. Originally for MT, now used broadly for text generation.

import evaluate

bleu = evaluate.load("bleu")

predictions = ["the cat sat on the mat"]
references  = [["the cat is on the mat"]]   # list of lists — multiple refs supported

result = bleu.compute(predictions=predictions, references=references)
print(result)
# {'bleu': 0.511, 'precisions': [...], 'brevity_penalty': 1.0, ...}

BLEU limitations

BLEU penalises valid paraphrases, ignores semantics, and correlates poorly with human judgement for open-ended generation. Use sacrebleu for reproducible MT evaluation, and treat BLEU as one signal among many for other tasks.

SacreBLEU (Reproducible MT Evaluation)¶

import evaluate

sacrebleu = evaluate.load("sacrebleu")
result = sacrebleu.compute(
    predictions=["the cat sat on the mat"],
    references=[["the cat is on the mat"]],
)
print(f"SacreBLEU: {result['score']:.2f}")

4. ROUGE¶

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures n-gram recall against references. Standard for summarisation.

Metric	Measures
`rouge1`	Unigram overlap
`rouge2`	Bigram overlap
`rougeL`	Longest common subsequence

import evaluate

rouge = evaluate.load("rouge")

predictions = ["The government announced new economic measures to combat inflation."]
references  = ["New economic policies were announced to fight rising inflation."]

result = rouge.compute(predictions=predictions, references=references)
print(result)
# {'rouge1': 0.533, 'rouge2': 0.267, 'rougeL': 0.467, 'rougeLsum': 0.467}

Batch Evaluation on a Dataset¶

from datasets import load_dataset
from transformers import pipeline

summariser = pipeline("summarization", model="facebook/bart-large-cnn", device=0)
rouge      = evaluate.load("rouge")

test_ds = load_dataset("cnn_dailymail", "3.0.0", split="test[:500]")

predictions = [s["summary_text"] for s in summariser(test_ds["article"], max_length=128)]
references  = test_ds["highlights"]

result = rouge.compute(predictions=predictions, references=references)
for k, v in result.items():
    print(f"{k}: {v:.4f}")

5. BERTScore¶

BERTScore uses contextual embeddings to measure semantic similarity — it correlates much better with human judgement than BLEU/ROUGE for open-ended generation.

import evaluate

bertscore = evaluate.load("bertscore")

predictions = ["The president signed the bill into law."]
references  = ["The bill was signed by the president."]

result = bertscore.compute(
    predictions=predictions,
    references=references,
    lang="en",
    model_type="distilbert-base-uncased",   # or "microsoft/deberta-xlarge-mnli" for best quality
)
print(f"Precision: {result['precision'][0]:.3f}")
print(f"Recall:    {result['recall'][0]:.3f}")
print(f"F1:        {result['f1'][0]:.3f}")

6. Classification Metrics¶

import evaluate
import numpy as np

accuracy  = evaluate.load("accuracy")
f1        = evaluate.load("f1")
precision = evaluate.load("precision")
recall    = evaluate.load("recall")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions    = np.argmax(logits, axis=-1)

    return {
        "accuracy":  accuracy.compute( predictions=predictions, references=labels)["accuracy"],
        "f1":        f1.compute(       predictions=predictions, references=labels, average="weighted")["f1"],
        "precision": precision.compute(predictions=predictions, references=labels, average="weighted")["precision"],
        "recall":    recall.compute(   predictions=predictions, references=labels, average="weighted")["recall"],
    }

# Pass to Trainer
trainer = Trainer(
    ...
    compute_metrics=compute_metrics,
)

7. LLM-as-Judge¶

When there is no single correct answer (e.g. helpfulness, safety, creativity), use a strong LLM to score outputs.

from anthropic import Anthropic

client = Anthropic()

JUDGE_PROMPT = """You are an impartial evaluator.

Rate the following response on a scale of 1–5:
  1 = Completely wrong or harmful
  3 = Partially correct but incomplete  
  5 = Excellent — accurate, helpful, and well-structured

Question: {question}
Response: {response}

Return only a single integer (1-5)."""

def llm_judge(question: str, response: str) -> int:
    message = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=8,
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(question=question, response=response),
        }],
    )
    return int(message.content[0].text.strip())

scores = [llm_judge(q, r) for q, r in zip(questions, responses)]
print(f"Mean score: {sum(scores)/len(scores):.2f}/5")

Reduce position bias

LLM judges can favour the first response shown. For pairwise comparison, swap the order of A/B and average both judgements.

8. Running Standard Benchmarks¶

# pip install lm-eval
lm_eval --model hf \
        --model_args pretrained=meta-llama/Llama-3.2-1B \
        --tasks hellaswag,arc_easy,arc_challenge,winogrande,mmlu \
        --device cuda:0 \
        --batch_size 8 \
        --output_path results/llama-1b-benchmarks.json

Benchmark	What it tests
HellaSwag	Commonsense reasoning (sentence completion)
ARC Easy / Challenge	Science QA (easy and hard)
WinoGrande	Commonsense reasoning (pronoun resolution)
MMLU	57-subject academic knowledge
TruthfulQA	Tendency to produce confident false statements
HumanEval	Python code generation
MT-Bench	Multi-turn instruction following (LLM-judged)

9. Evaluation Pipeline¶

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from datasets import load_dataset
import evaluate, json, datetime

def run_evaluation(model_path: str, eval_dataset: str, output_path: str):
    tokenizer  = AutoTokenizer.from_pretrained(model_path)
    gen        = pipeline("text-generation", model=model_path, tokenizer=tokenizer, device=0)
    rouge      = evaluate.load("rouge")
    bertscore  = evaluate.load("bertscore")
    ds         = load_dataset("json", data_files=eval_dataset, split="train")

    predictions, references = [], []
    for example in ds:
        output = gen(example["prompt"], max_new_tokens=256, do_sample=False)[0]["generated_text"]
        predictions.append(output[len(example["prompt"]):].strip())
        references.append(example["reference"])

    results = {
        "model": model_path,
        "timestamp": datetime.datetime.utcnow().isoformat(),
        "n_examples": len(ds),
        **rouge.compute(predictions=predictions, references=references),
        "bertscore_f1_mean": sum(
            bertscore.compute(predictions=predictions, references=references, lang="en")["f1"]
        ) / len(predictions),
    }

    with open(output_path, "w") as f:
        json.dump(results, f, indent=2)

    return results

results = run_evaluation(
    model_path="models/mistral-sft",
    eval_dataset="data/eval.jsonl",
    output_path="results/mistral-sft-eval.json",
)
print(results)

10. Evaluation Checklist¶

Held-out test set not seen during training or hyperparameter tuning
Multiple metrics used (at least 2–3)
LLM-as-Judge or human evaluation included for generation tasks
Baseline comparison (pre-fine-tune model, GPT-4, random)
Results are deterministic (do_sample=False or fixed seed)
Benchmark contamination checked if using public benchmarks
Results saved with model version and timestamp