Skip to content

Part 2 — Training Methods & Process

Overview

This article covers the mechanics of training a neural network: the forward/backward pass, loss functions, optimizers, learning rate schedulers, mixed-precision training, gradient accumulation, and how to build a production-grade training loop with HuggingFace Trainer or raw PyTorch.


1. The Training Loop at a Glance

for each epoch:
    for each batch:
        1. Forward pass  → compute predictions
        2. Loss          → measure how wrong predictions are
        3. Backward pass → compute gradients (∂loss / ∂weights)
        4. Optimizer step→ update weights
        5. Zero grads    → clear gradients for next batch
import torch
from torch import nn
from torch.optim import AdamW

model = MyModel()
optimizer = AdamW(model.parameters(), lr=3e-4)
loss_fn = nn.CrossEntropyLoss()

for epoch in range(num_epochs):
    model.train()
    for batch in train_loader:
        inputs, labels = batch["input_ids"], batch["labels"]

        # 1 & 2 — forward + loss
        logits = model(inputs)
        loss = loss_fn(logits.view(-1, vocab_size), labels.view(-1))

        # 3 — backward
        loss.backward()

        # 4 — update
        optimizer.step()

        # 5 — zero gradients
        optimizer.zero_grad()

2. Loss Functions

# CrossEntropy on next-token prediction
# HuggingFace models compute this internally when labels are provided
outputs = model(input_ids=input_ids, labels=input_ids)
loss = outputs.loss
from torch.nn import CrossEntropyLoss

loss_fn = CrossEntropyLoss()
loss = loss_fn(logits, labels)   # logits: [B, num_classes], labels: [B]
loss_fn = CrossEntropyLoss(ignore_index=-100)  # -100 masks padding tokens
loss = loss_fn(logits.view(-1, num_labels), labels.view(-1))

3. Optimizers

AdamW — The Default for Transformers

AdamW decouples weight decay from the adaptive learning rate, which makes it much more effective than plain Adam for fine-tuning.

from torch.optim import AdamW

optimizer = AdamW(
    model.parameters(),
    lr=2e-5,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0.01,
)

Separate learning rates per layer group

Apply a lower LR to the pretrained backbone and a higher LR to the task head:

optimizer = AdamW([
    {"params": model.backbone.parameters(), "lr": 1e-5},
    {"params": model.classifier.parameters(), "lr": 1e-3},
])

8-bit Adam (Memory Efficient)

# pip install bitsandbytes
import bitsandbytes as bnb

optimizer = bnb.optim.AdamW8bit(
    model.parameters(),
    lr=2e-5,
    weight_decay=0.01,
)

4. Learning Rate Schedulers

The LR scheduler controls how the learning rate changes over time. A well-chosen schedule is crucial for convergence.

from transformers import get_scheduler

num_training_steps = num_epochs * len(train_loader)
num_warmup_steps   = int(0.06 * num_training_steps)   # 6% warmup is common

scheduler = get_scheduler(
    name="cosine",                         # see table below
    optimizer=optimizer,
    num_warmup_steps=num_warmup_steps,
    num_training_steps=num_training_steps,
)

# Call after optimizer.step()
scheduler.step()
Schedule Behaviour When to Use
linear Decays to 0 linearly Safe default
cosine Smooth cosine decay LLM training
cosine_with_restarts Cosine with periodic restores Long runs
constant Fixed LR Debugging
constant_with_warmup Warm-up then fixed Short fine-tunes

Visualising the Schedule

import matplotlib.pyplot as plt

lrs = []
for _ in range(num_training_steps):
    lrs.append(scheduler.get_last_lr()[0])
    optimizer.step()
    scheduler.step()

plt.plot(lrs)
plt.xlabel("Step")
plt.ylabel("Learning Rate")
plt.title("Cosine LR Schedule")
plt.show()

5. Mixed-Precision Training (fp16 / bf16)

Training in 16-bit halves memory usage and increases throughput with no accuracy loss when done correctly.

from torch.cuda.amp import GradScaler, autocast

scaler = GradScaler()   # only needed for fp16; bf16 doesn't need scaling

for batch in train_loader:
    optimizer.zero_grad()

    with autocast(dtype=torch.float16):
        outputs = model(**batch)
        loss = outputs.loss

    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)

    # Gradient clipping (crucial with mixed precision)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    scaler.step(optimizer)
    scaler.update()
    scheduler.step()

fp16 vs bf16

  • fp16: Higher precision in the fractional part, needs GradScaler to prevent underflow
  • bf16: Same range as fp32, no GradScaler needed. Preferred on Ampere+ GPUs (A100, 4090) and all TPUs.

6. Gradient Accumulation

Simulate larger batch sizes when GPU memory is limited.

ACCUMULATION_STEPS = 8   # effective batch = batch_size × 8
model.train()

optimizer.zero_grad()
for step, batch in enumerate(train_loader):

    with autocast(dtype=torch.bfloat16):
        outputs = model(**batch)
        loss = outputs.loss / ACCUMULATION_STEPS   # ← normalise

    loss.backward()

    if (step + 1) % ACCUMULATION_STEPS == 0:
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

7. Gradient Clipping

Large gradients destabilise training. Clip them before every optimizer step.

# Clip by global norm — standard for transformers
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

8. Production Training with HuggingFace Trainer

Trainer handles the training loop, mixed precision, gradient accumulation, checkpointing, and logging automatically.

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)

model     = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

args = TrainingArguments(
    output_dir="checkpoints/gpt2-clm",

    # Batch & steps
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=8,        # effective batch = 32
    num_train_epochs=3,

    # Precision
    bf16=True,                            # use fp16=True on older GPUs

    # Optimisation
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.06,
    lr_scheduler_type="cosine",
    max_grad_norm=1.0,

    # Checkpointing
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",

    # Logging
    logging_steps=50,
    report_to="wandb",                    # or "mlflow", "tensorboard", "none"
    run_name="gpt2-clm-run1",
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    data_collator=collator,
)

trainer.train()
trainer.save_model("models/gpt2-clm-final")

9. Checkpointing & Resuming

# Save a checkpoint manually
trainer.save_model("checkpoints/epoch2")
tokenizer.save_pretrained("checkpoints/epoch2")

# Resume training from checkpoint
trainer.train(resume_from_checkpoint="checkpoints/epoch2")

Save the tokenizer alongside the model

Always save the tokenizer with the checkpoint. Mismatched tokenizers cause silent, hard-to-debug errors at inference time.


10. Monitoring Training Health

Watch these signals during training:

Signal Healthy Warning
train_loss Steadily decreasing Flat from step 1 (LR too low) or NaN
eval_loss Decreasing, tracks train Diverges upward → overfitting
grad_norm < 5 Spikes → increase clipping, lower LR
learning_rate Follows schedule Drops to 0 too fast → reduce decay
GPU utilisation > 80% < 40% → increase batch or use DataLoader workers
# Quick NaN check after every backward pass (dev mode only)
if torch.isnan(loss):
    raise ValueError(f"NaN loss at step {step}. Last batch: {batch}")

11. Multi-GPU Training with Accelerate

# pip install accelerate
# accelerate config   ← run this once to set up your environment

from accelerate import Accelerator

accelerator = Accelerator(mixed_precision="bf16")

model, optimizer, train_loader, scheduler = accelerator.prepare(
    model, optimizer, train_loader, scheduler
)

for batch in train_loader:
    with accelerator.accumulate(model):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

Launch with:

accelerate launch train.py
# or for 4 GPUs:
accelerate launch --num_processes 4 train.py