Part 4 — Evaluation¶
Overview
Training loss tells you how well the model fits the training data. Evaluation tells you whether the model is useful. This article covers automatic metrics (perplexity, BLEU, ROUGE, BERTScore), task benchmarks, and how to build a repeatable evaluation pipeline.
1. Why Evaluation Is Hard¶
- A model with low training loss can still produce garbage outputs (overfitting, wrong format)
- Automatic metrics correlate imperfectly with human judgement
- Benchmarks can be contaminated (test data in the pre-training corpus)
The golden rule: use multiple signals — no single metric is sufficient.
2. Perplexity¶
Perplexity measures how surprised a language model is by text. Lower is better.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_id).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(model_id)
def perplexity(text: str, stride: int = 512) -> float:
encodings = tokenizer(text, return_tensors="pt")
input_ids = encodings.input_ids.cuda()
max_length = model.config.n_positions
seq_len = input_ids.size(1)
nlls = []
prev_end = 0
for begin in range(0, seq_len, stride):
end = min(begin + max_length, seq_len)
trg_l = end - prev_end
input_chunk = input_ids[:, begin:end]
target_ids = input_chunk.clone()
target_ids[:, :-trg_l] = -100 # only compute loss on the new tokens
with torch.no_grad():
loss = model(input_chunk, labels=target_ids).loss
nlls.append(loss * trg_l)
prev_end = end
if end == seq_len:
break
return torch.exp(torch.stack(nlls).sum() / seq_len).item()
text = open("data/test_corpus.txt").read()
print(f"Perplexity: {perplexity(text):.2f}")
Using the evaluate Library¶
import evaluate
perplexity = evaluate.load("perplexity", module_type="metric")
results = perplexity.compute(
predictions=["The quick brown fox jumps.", "Language models are fascinating."],
model_id="gpt2",
)
print(results["mean_perplexity"])
3. BLEU¶
BLEU (Bilingual Evaluation Understudy) measures n-gram overlap between generated text and reference translations. Originally for MT, now used broadly for text generation.
import evaluate
bleu = evaluate.load("bleu")
predictions = ["the cat sat on the mat"]
references = [["the cat is on the mat"]] # list of lists — multiple refs supported
result = bleu.compute(predictions=predictions, references=references)
print(result)
# {'bleu': 0.511, 'precisions': [...], 'brevity_penalty': 1.0, ...}
BLEU limitations
BLEU penalises valid paraphrases, ignores semantics, and correlates poorly with human judgement for open-ended generation. Use sacrebleu for reproducible MT evaluation, and treat BLEU as one signal among many for other tasks.
SacreBLEU (Reproducible MT Evaluation)¶
import evaluate
sacrebleu = evaluate.load("sacrebleu")
result = sacrebleu.compute(
predictions=["the cat sat on the mat"],
references=[["the cat is on the mat"]],
)
print(f"SacreBLEU: {result['score']:.2f}")
4. ROUGE¶
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures n-gram recall against references. Standard for summarisation.
| Metric | Measures |
|---|---|
rouge1 |
Unigram overlap |
rouge2 |
Bigram overlap |
rougeL |
Longest common subsequence |
import evaluate
rouge = evaluate.load("rouge")
predictions = ["The government announced new economic measures to combat inflation."]
references = ["New economic policies were announced to fight rising inflation."]
result = rouge.compute(predictions=predictions, references=references)
print(result)
# {'rouge1': 0.533, 'rouge2': 0.267, 'rougeL': 0.467, 'rougeLsum': 0.467}
Batch Evaluation on a Dataset¶
from datasets import load_dataset
from transformers import pipeline
summariser = pipeline("summarization", model="facebook/bart-large-cnn", device=0)
rouge = evaluate.load("rouge")
test_ds = load_dataset("cnn_dailymail", "3.0.0", split="test[:500]")
predictions = [s["summary_text"] for s in summariser(test_ds["article"], max_length=128)]
references = test_ds["highlights"]
result = rouge.compute(predictions=predictions, references=references)
for k, v in result.items():
print(f"{k}: {v:.4f}")
5. BERTScore¶
BERTScore uses contextual embeddings to measure semantic similarity — it correlates much better with human judgement than BLEU/ROUGE for open-ended generation.
import evaluate
bertscore = evaluate.load("bertscore")
predictions = ["The president signed the bill into law."]
references = ["The bill was signed by the president."]
result = bertscore.compute(
predictions=predictions,
references=references,
lang="en",
model_type="distilbert-base-uncased", # or "microsoft/deberta-xlarge-mnli" for best quality
)
print(f"Precision: {result['precision'][0]:.3f}")
print(f"Recall: {result['recall'][0]:.3f}")
print(f"F1: {result['f1'][0]:.3f}")
6. Classification Metrics¶
import evaluate
import numpy as np
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")
precision = evaluate.load("precision")
recall = evaluate.load("recall")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return {
"accuracy": accuracy.compute( predictions=predictions, references=labels)["accuracy"],
"f1": f1.compute( predictions=predictions, references=labels, average="weighted")["f1"],
"precision": precision.compute(predictions=predictions, references=labels, average="weighted")["precision"],
"recall": recall.compute( predictions=predictions, references=labels, average="weighted")["recall"],
}
# Pass to Trainer
trainer = Trainer(
...
compute_metrics=compute_metrics,
)
7. LLM-as-Judge¶
When there is no single correct answer (e.g. helpfulness, safety, creativity), use a strong LLM to score outputs.
from anthropic import Anthropic
client = Anthropic()
JUDGE_PROMPT = """You are an impartial evaluator.
Rate the following response on a scale of 1–5:
1 = Completely wrong or harmful
3 = Partially correct but incomplete
5 = Excellent — accurate, helpful, and well-structured
Question: {question}
Response: {response}
Return only a single integer (1-5)."""
def llm_judge(question: str, response: str) -> int:
message = client.messages.create(
model="claude-opus-4-6",
max_tokens=8,
messages=[{
"role": "user",
"content": JUDGE_PROMPT.format(question=question, response=response),
}],
)
return int(message.content[0].text.strip())
scores = [llm_judge(q, r) for q, r in zip(questions, responses)]
print(f"Mean score: {sum(scores)/len(scores):.2f}/5")
Reduce position bias
LLM judges can favour the first response shown. For pairwise comparison, swap the order of A/B and average both judgements.
8. Running Standard Benchmarks¶
# pip install lm-eval
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3.2-1B \
--tasks hellaswag,arc_easy,arc_challenge,winogrande,mmlu \
--device cuda:0 \
--batch_size 8 \
--output_path results/llama-1b-benchmarks.json
| Benchmark | What it tests |
|---|---|
| HellaSwag | Commonsense reasoning (sentence completion) |
| ARC Easy / Challenge | Science QA (easy and hard) |
| WinoGrande | Commonsense reasoning (pronoun resolution) |
| MMLU | 57-subject academic knowledge |
| TruthfulQA | Tendency to produce confident false statements |
| HumanEval | Python code generation |
| MT-Bench | Multi-turn instruction following (LLM-judged) |
9. Evaluation Pipeline¶
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from datasets import load_dataset
import evaluate, json, datetime
def run_evaluation(model_path: str, eval_dataset: str, output_path: str):
tokenizer = AutoTokenizer.from_pretrained(model_path)
gen = pipeline("text-generation", model=model_path, tokenizer=tokenizer, device=0)
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")
ds = load_dataset("json", data_files=eval_dataset, split="train")
predictions, references = [], []
for example in ds:
output = gen(example["prompt"], max_new_tokens=256, do_sample=False)[0]["generated_text"]
predictions.append(output[len(example["prompt"]):].strip())
references.append(example["reference"])
results = {
"model": model_path,
"timestamp": datetime.datetime.utcnow().isoformat(),
"n_examples": len(ds),
**rouge.compute(predictions=predictions, references=references),
"bertscore_f1_mean": sum(
bertscore.compute(predictions=predictions, references=references, lang="en")["f1"]
) / len(predictions),
}
with open(output_path, "w") as f:
json.dump(results, f, indent=2)
return results
results = run_evaluation(
model_path="models/mistral-sft",
eval_dataset="data/eval.jsonl",
output_path="results/mistral-sft-eval.json",
)
print(results)
10. Evaluation Checklist¶
- Held-out test set not seen during training or hyperparameter tuning
- Multiple metrics used (at least 2–3)
- LLM-as-Judge or human evaluation included for generation tasks
- Baseline comparison (pre-fine-tune model, GPT-4, random)
- Results are deterministic (
do_sample=Falseor fixed seed) - Benchmark contamination checked if using public benchmarks
- Results saved with model version and timestamp