Part 2 — Training Methods & Process¶
Overview
This article covers the mechanics of training a neural network: the forward/backward pass, loss functions, optimizers, learning rate schedulers, mixed-precision training, gradient accumulation, and how to build a production-grade training loop with HuggingFace Trainer or raw PyTorch.
1. The Training Loop at a Glance¶
for each epoch:
for each batch:
1. Forward pass → compute predictions
2. Loss → measure how wrong predictions are
3. Backward pass → compute gradients (∂loss / ∂weights)
4. Optimizer step→ update weights
5. Zero grads → clear gradients for next batch
import torch
from torch import nn
from torch.optim import AdamW
model = MyModel()
optimizer = AdamW(model.parameters(), lr=3e-4)
loss_fn = nn.CrossEntropyLoss()
for epoch in range(num_epochs):
model.train()
for batch in train_loader:
inputs, labels = batch["input_ids"], batch["labels"]
# 1 & 2 — forward + loss
logits = model(inputs)
loss = loss_fn(logits.view(-1, vocab_size), labels.view(-1))
# 3 — backward
loss.backward()
# 4 — update
optimizer.step()
# 5 — zero gradients
optimizer.zero_grad()
2. Loss Functions¶
3. Optimizers¶
AdamW — The Default for Transformers¶
AdamW decouples weight decay from the adaptive learning rate, which makes it much more effective than plain Adam for fine-tuning.
from torch.optim import AdamW
optimizer = AdamW(
model.parameters(),
lr=2e-5,
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=0.01,
)
Separate learning rates per layer group
Apply a lower LR to the pretrained backbone and a higher LR to the task head:
8-bit Adam (Memory Efficient)¶
# pip install bitsandbytes
import bitsandbytes as bnb
optimizer = bnb.optim.AdamW8bit(
model.parameters(),
lr=2e-5,
weight_decay=0.01,
)
4. Learning Rate Schedulers¶
The LR scheduler controls how the learning rate changes over time. A well-chosen schedule is crucial for convergence.
from transformers import get_scheduler
num_training_steps = num_epochs * len(train_loader)
num_warmup_steps = int(0.06 * num_training_steps) # 6% warmup is common
scheduler = get_scheduler(
name="cosine", # see table below
optimizer=optimizer,
num_warmup_steps=num_warmup_steps,
num_training_steps=num_training_steps,
)
# Call after optimizer.step()
scheduler.step()
| Schedule | Behaviour | When to Use |
|---|---|---|
linear |
Decays to 0 linearly | Safe default |
cosine |
Smooth cosine decay | LLM training |
cosine_with_restarts |
Cosine with periodic restores | Long runs |
constant |
Fixed LR | Debugging |
constant_with_warmup |
Warm-up then fixed | Short fine-tunes |
Visualising the Schedule¶
import matplotlib.pyplot as plt
lrs = []
for _ in range(num_training_steps):
lrs.append(scheduler.get_last_lr()[0])
optimizer.step()
scheduler.step()
plt.plot(lrs)
plt.xlabel("Step")
plt.ylabel("Learning Rate")
plt.title("Cosine LR Schedule")
plt.show()
5. Mixed-Precision Training (fp16 / bf16)¶
Training in 16-bit halves memory usage and increases throughput with no accuracy loss when done correctly.
from torch.cuda.amp import GradScaler, autocast
scaler = GradScaler() # only needed for fp16; bf16 doesn't need scaling
for batch in train_loader:
optimizer.zero_grad()
with autocast(dtype=torch.float16):
outputs = model(**batch)
loss = outputs.loss
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
# Gradient clipping (crucial with mixed precision)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
scaler.step(optimizer)
scaler.update()
scheduler.step()
fp16 vs bf16
- fp16: Higher precision in the fractional part, needs
GradScalerto prevent underflow - bf16: Same range as fp32, no
GradScalerneeded. Preferred on Ampere+ GPUs (A100, 4090) and all TPUs.
6. Gradient Accumulation¶
Simulate larger batch sizes when GPU memory is limited.
ACCUMULATION_STEPS = 8 # effective batch = batch_size × 8
model.train()
optimizer.zero_grad()
for step, batch in enumerate(train_loader):
with autocast(dtype=torch.bfloat16):
outputs = model(**batch)
loss = outputs.loss / ACCUMULATION_STEPS # ← normalise
loss.backward()
if (step + 1) % ACCUMULATION_STEPS == 0:
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
7. Gradient Clipping¶
Large gradients destabilise training. Clip them before every optimizer step.
# Clip by global norm — standard for transformers
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
8. Production Training with HuggingFace Trainer¶
Trainer handles the training loop, mixed precision, gradient accumulation, checkpointing, and logging automatically.
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
DataCollatorForLanguageModeling,
)
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
args = TrainingArguments(
output_dir="checkpoints/gpt2-clm",
# Batch & steps
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=8, # effective batch = 32
num_train_epochs=3,
# Precision
bf16=True, # use fp16=True on older GPUs
# Optimisation
learning_rate=2e-5,
weight_decay=0.01,
warmup_ratio=0.06,
lr_scheduler_type="cosine",
max_grad_norm=1.0,
# Checkpointing
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
# Logging
logging_steps=50,
report_to="wandb", # or "mlflow", "tensorboard", "none"
run_name="gpt2-clm-run1",
)
trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized_train,
eval_dataset=tokenized_val,
data_collator=collator,
)
trainer.train()
trainer.save_model("models/gpt2-clm-final")
9. Checkpointing & Resuming¶
# Save a checkpoint manually
trainer.save_model("checkpoints/epoch2")
tokenizer.save_pretrained("checkpoints/epoch2")
# Resume training from checkpoint
trainer.train(resume_from_checkpoint="checkpoints/epoch2")
Save the tokenizer alongside the model
Always save the tokenizer with the checkpoint. Mismatched tokenizers cause silent, hard-to-debug errors at inference time.
10. Monitoring Training Health¶
Watch these signals during training:
| Signal | Healthy | Warning |
|---|---|---|
train_loss |
Steadily decreasing | Flat from step 1 (LR too low) or NaN |
eval_loss |
Decreasing, tracks train | Diverges upward → overfitting |
grad_norm |
< 5 | Spikes → increase clipping, lower LR |
learning_rate |
Follows schedule | Drops to 0 too fast → reduce decay |
| GPU utilisation | > 80% | < 40% → increase batch or use DataLoader workers |
# Quick NaN check after every backward pass (dev mode only)
if torch.isnan(loss):
raise ValueError(f"NaN loss at step {step}. Last batch: {batch}")
11. Multi-GPU Training with Accelerate¶
# pip install accelerate
# accelerate config ← run this once to set up your environment
from accelerate import Accelerator
accelerator = Accelerator(mixed_precision="bf16")
model, optimizer, train_loader, scheduler = accelerator.prepare(
model, optimizer, train_loader, scheduler
)
for batch in train_loader:
with accelerator.accumulate(model):
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
Launch with: