Skip to content

Part 4 — Evaluation

Overview

Training loss tells you how well the model fits the training data. Evaluation tells you whether the model is useful. This article covers automatic metrics (perplexity, BLEU, ROUGE, BERTScore), task benchmarks, and how to build a repeatable evaluation pipeline.


1. Why Evaluation Is Hard

  • A model with low training loss can still produce garbage outputs (overfitting, wrong format)
  • Automatic metrics correlate imperfectly with human judgement
  • Benchmarks can be contaminated (test data in the pre-training corpus)

The golden rule: use multiple signals — no single metric is sufficient.


2. Perplexity

Perplexity measures how surprised a language model is by text. Lower is better.

\[\text{PPL} = \exp\!\left(-\frac{1}{N}\sum_{i=1}^{N} \log P(w_i \mid w_1, \ldots, w_{i-1})\right)\]
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "gpt2"
model     = AutoModelForCausalLM.from_pretrained(model_id).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(model_id)

def perplexity(text: str, stride: int = 512) -> float:
    encodings  = tokenizer(text, return_tensors="pt")
    input_ids  = encodings.input_ids.cuda()
    max_length = model.config.n_positions
    seq_len    = input_ids.size(1)

    nlls = []
    prev_end = 0

    for begin in range(0, seq_len, stride):
        end   = min(begin + max_length, seq_len)
        trg_l = end - prev_end
        input_chunk = input_ids[:, begin:end]
        target_ids  = input_chunk.clone()
        target_ids[:, :-trg_l] = -100   # only compute loss on the new tokens

        with torch.no_grad():
            loss = model(input_chunk, labels=target_ids).loss
        nlls.append(loss * trg_l)
        prev_end = end
        if end == seq_len:
            break

    return torch.exp(torch.stack(nlls).sum() / seq_len).item()

text = open("data/test_corpus.txt").read()
print(f"Perplexity: {perplexity(text):.2f}")

Using the evaluate Library

import evaluate

perplexity = evaluate.load("perplexity", module_type="metric")

results = perplexity.compute(
    predictions=["The quick brown fox jumps.", "Language models are fascinating."],
    model_id="gpt2",
)
print(results["mean_perplexity"])

3. BLEU

BLEU (Bilingual Evaluation Understudy) measures n-gram overlap between generated text and reference translations. Originally for MT, now used broadly for text generation.

import evaluate

bleu = evaluate.load("bleu")

predictions = ["the cat sat on the mat"]
references  = [["the cat is on the mat"]]   # list of lists — multiple refs supported

result = bleu.compute(predictions=predictions, references=references)
print(result)
# {'bleu': 0.511, 'precisions': [...], 'brevity_penalty': 1.0, ...}

BLEU limitations

BLEU penalises valid paraphrases, ignores semantics, and correlates poorly with human judgement for open-ended generation. Use sacrebleu for reproducible MT evaluation, and treat BLEU as one signal among many for other tasks.

SacreBLEU (Reproducible MT Evaluation)

import evaluate

sacrebleu = evaluate.load("sacrebleu")
result = sacrebleu.compute(
    predictions=["the cat sat on the mat"],
    references=[["the cat is on the mat"]],
)
print(f"SacreBLEU: {result['score']:.2f}")

4. ROUGE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures n-gram recall against references. Standard for summarisation.

Metric Measures
rouge1 Unigram overlap
rouge2 Bigram overlap
rougeL Longest common subsequence
import evaluate

rouge = evaluate.load("rouge")

predictions = ["The government announced new economic measures to combat inflation."]
references  = ["New economic policies were announced to fight rising inflation."]

result = rouge.compute(predictions=predictions, references=references)
print(result)
# {'rouge1': 0.533, 'rouge2': 0.267, 'rougeL': 0.467, 'rougeLsum': 0.467}

Batch Evaluation on a Dataset

from datasets import load_dataset
from transformers import pipeline

summariser = pipeline("summarization", model="facebook/bart-large-cnn", device=0)
rouge      = evaluate.load("rouge")

test_ds = load_dataset("cnn_dailymail", "3.0.0", split="test[:500]")

predictions = [s["summary_text"] for s in summariser(test_ds["article"], max_length=128)]
references  = test_ds["highlights"]

result = rouge.compute(predictions=predictions, references=references)
for k, v in result.items():
    print(f"{k}: {v:.4f}")

5. BERTScore

BERTScore uses contextual embeddings to measure semantic similarity — it correlates much better with human judgement than BLEU/ROUGE for open-ended generation.

import evaluate

bertscore = evaluate.load("bertscore")

predictions = ["The president signed the bill into law."]
references  = ["The bill was signed by the president."]

result = bertscore.compute(
    predictions=predictions,
    references=references,
    lang="en",
    model_type="distilbert-base-uncased",   # or "microsoft/deberta-xlarge-mnli" for best quality
)
print(f"Precision: {result['precision'][0]:.3f}")
print(f"Recall:    {result['recall'][0]:.3f}")
print(f"F1:        {result['f1'][0]:.3f}")

6. Classification Metrics

import evaluate
import numpy as np

accuracy  = evaluate.load("accuracy")
f1        = evaluate.load("f1")
precision = evaluate.load("precision")
recall    = evaluate.load("recall")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions    = np.argmax(logits, axis=-1)

    return {
        "accuracy":  accuracy.compute( predictions=predictions, references=labels)["accuracy"],
        "f1":        f1.compute(       predictions=predictions, references=labels, average="weighted")["f1"],
        "precision": precision.compute(predictions=predictions, references=labels, average="weighted")["precision"],
        "recall":    recall.compute(   predictions=predictions, references=labels, average="weighted")["recall"],
    }

# Pass to Trainer
trainer = Trainer(
    ...
    compute_metrics=compute_metrics,
)

7. LLM-as-Judge

When there is no single correct answer (e.g. helpfulness, safety, creativity), use a strong LLM to score outputs.

from anthropic import Anthropic

client = Anthropic()

JUDGE_PROMPT = """You are an impartial evaluator.

Rate the following response on a scale of 1–5:
  1 = Completely wrong or harmful
  3 = Partially correct but incomplete  
  5 = Excellent — accurate, helpful, and well-structured

Question: {question}
Response: {response}

Return only a single integer (1-5)."""

def llm_judge(question: str, response: str) -> int:
    message = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=8,
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(question=question, response=response),
        }],
    )
    return int(message.content[0].text.strip())

scores = [llm_judge(q, r) for q, r in zip(questions, responses)]
print(f"Mean score: {sum(scores)/len(scores):.2f}/5")

Reduce position bias

LLM judges can favour the first response shown. For pairwise comparison, swap the order of A/B and average both judgements.


8. Running Standard Benchmarks

# pip install lm-eval
lm_eval --model hf \
        --model_args pretrained=meta-llama/Llama-3.2-1B \
        --tasks hellaswag,arc_easy,arc_challenge,winogrande,mmlu \
        --device cuda:0 \
        --batch_size 8 \
        --output_path results/llama-1b-benchmarks.json
Benchmark What it tests
HellaSwag Commonsense reasoning (sentence completion)
ARC Easy / Challenge Science QA (easy and hard)
WinoGrande Commonsense reasoning (pronoun resolution)
MMLU 57-subject academic knowledge
TruthfulQA Tendency to produce confident false statements
HumanEval Python code generation
MT-Bench Multi-turn instruction following (LLM-judged)

9. Evaluation Pipeline

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from datasets import load_dataset
import evaluate, json, datetime

def run_evaluation(model_path: str, eval_dataset: str, output_path: str):
    tokenizer  = AutoTokenizer.from_pretrained(model_path)
    gen        = pipeline("text-generation", model=model_path, tokenizer=tokenizer, device=0)
    rouge      = evaluate.load("rouge")
    bertscore  = evaluate.load("bertscore")
    ds         = load_dataset("json", data_files=eval_dataset, split="train")

    predictions, references = [], []
    for example in ds:
        output = gen(example["prompt"], max_new_tokens=256, do_sample=False)[0]["generated_text"]
        predictions.append(output[len(example["prompt"]):].strip())
        references.append(example["reference"])

    results = {
        "model": model_path,
        "timestamp": datetime.datetime.utcnow().isoformat(),
        "n_examples": len(ds),
        **rouge.compute(predictions=predictions, references=references),
        "bertscore_f1_mean": sum(
            bertscore.compute(predictions=predictions, references=references, lang="en")["f1"]
        ) / len(predictions),
    }

    with open(output_path, "w") as f:
        json.dump(results, f, indent=2)

    return results

results = run_evaluation(
    model_path="models/mistral-sft",
    eval_dataset="data/eval.jsonl",
    output_path="results/mistral-sft-eval.json",
)
print(results)

10. Evaluation Checklist

  • Held-out test set not seen during training or hyperparameter tuning
  • Multiple metrics used (at least 2–3)
  • LLM-as-Judge or human evaluation included for generation tasks
  • Baseline comparison (pre-fine-tune model, GPT-4, random)
  • Results are deterministic (do_sample=False or fixed seed)
  • Benchmark contamination checked if using public benchmarks
  • Results saved with model version and timestamp