Part 1 — Datasets¶
Overview
A model is only as good as the data it trains on. This article covers every stage of the dataset pipeline: collection, cleaning, deduplication, splitting, and tokenization — with runnable code for each step.
1. The Dataset Pipeline¶
Raw Sources (web, files, APIs)
│
▼
Collection & Loading
│
▼
Cleaning & Filtering
│
▼
Deduplication
│
▼
Train / Val / Test Split
│
▼
Tokenization & Encoding
│
▼
DataLoader → Training Loop
2. Loading Data¶
From HuggingFace Hub¶
from datasets import load_dataset
# Public dataset
ds = load_dataset("ag_news")
print(ds)
# DatasetDict({
# train: Dataset({features: ['text', 'label'], num_rows: 120000})
# test: Dataset({features: ['text', 'label'], num_rows: 7600})
# })
# Specific split
train_ds = load_dataset("ag_news", split="train")
From Local Files¶
from datasets import load_dataset
# CSV
ds = load_dataset("csv", data_files={"train": "data/train.csv", "test": "data/test.csv"})
# JSONL (one JSON object per line — most common format for LLM data)
ds = load_dataset("json", data_files="data/instructions.jsonl")
# Plain text (one document per line)
ds = load_dataset("text", data_files="data/corpus.txt")
From a Pandas DataFrame¶
from datasets import Dataset
import pandas as pd
df = pd.read_parquet("data/processed.parquet")
ds = Dataset.from_pandas(df)
Prefer JSONL for instruction data
JSONL is the de-facto format for fine-tuning datasets. Each line is a complete example:
3. Building Multimodal Datasets¶
Multimodal models — vision-language models (VLMs) like LLaVA or Qwen-VL, speech models like Whisper, and document-understanding models like Donut — require datasets that pair raw assets (images, PDFs, audio) with text. This section covers extracting each modality and structuring it into a training-ready Dataset.
Images¶
# pip install Pillow datasets
from PIL import Image
from datasets import Dataset, Image as HFImage
import os, json
def load_image_text_pairs(image_dir: str, annotations_jsonl: str) -> Dataset:
"""
Expects one JSON object per line:
{"file": "cat.jpg", "caption": "A tabby cat sitting on a mat"}
"""
records = []
with open(annotations_jsonl) as f:
for line in f:
ann = json.loads(line)
records.append({
"image": os.path.join(image_dir, ann["file"]),
"text": ann["caption"],
})
# cast_column tells HuggingFace to lazy-load & decode images on access
return Dataset.from_list(records).cast_column("image", HFImage())
ds = load_image_text_pairs("data/images/", "data/captions.jsonl")
print(ds[0]["image"]) # PIL Image object
For instruction-tuned VLMs, wrap each example in the model's chat template:
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
def preprocess_vlm(example):
conversation = [{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": example["question"]},
],
}]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=example["image"], text=prompt, return_tensors="pt")
inputs["labels"] = inputs["input_ids"].clone()
return inputs
ds = ds.map(preprocess_vlm)
PDFs¶
PDFs are either text-based (selectable text) or scanned (image-only). Handle both:
# pip install pymupdf (imported as fitz)
import fitz # PyMuPDF
def extract_pdf_pages(pdf_path: str) -> list[dict]:
"""One record per page: selectable text + a rendered PNG of the page."""
doc = fitz.open(pdf_path)
pages = []
for page_num, page in enumerate(doc):
text = page.get_text("text").strip()
# Render to image — useful for layout-aware models or scanned PDFs
pix = page.get_pixmap(matrix=fitz.Matrix(2.0, 2.0)) # ~150 DPI
pages.append({
"source": pdf_path,
"page": page_num,
"text": text,
"image": pix.tobytes("png"), # raw PNG bytes
})
doc.close()
return pages
For scanned PDFs that contain no selectable text, fall back to OCR:
# pip install pytesseract Pillow
import pytesseract
from PIL import Image
import io
def ocr_page(img_bytes: bytes) -> str:
return pytesseract.image_to_string(Image.open(io.BytesIO(img_bytes)), lang="eng")
pages = extract_pdf_pages("report.pdf")
for p in pages:
if len(p["text"].split()) < 20: # likely scanned — no selectable text
p["text"] = ocr_page(p["image"])
Build a dataset from a folder of PDFs:
import glob
from datasets import Dataset
all_pages = []
for pdf_path in glob.glob("data/pdfs/*.pdf"):
all_pages.extend(extract_pdf_pages(pdf_path))
ds = Dataset.from_list(all_pages)
ds = ds.filter(lambda x: len(x["text"].split()) >= 20) # drop near-empty pages
Audio¶
# pip install torchaudio
import torchaudio
import torchaudio.transforms as T
import torch
def load_audio(path: str, target_sr: int = 16_000) -> tuple[torch.Tensor, int]:
waveform, sr = torchaudio.load(path)
if waveform.shape[0] > 1: # stereo → mono
waveform = waveform.mean(dim=0, keepdim=True)
if sr != target_sr:
waveform = T.Resample(sr, target_sr)(waveform)
return waveform, target_sr
def extract_log_mel(waveform: torch.Tensor, sr: int = 16_000) -> torch.Tensor:
mel = T.MelSpectrogram(sample_rate=sr, n_fft=1024, hop_length=160, n_mels=80)(waveform)
return torch.log(mel.clamp(min=1e-9)) # [1, 80, time_frames]
For speech-to-text fine-tuning (e.g. Whisper), use the model's own feature extractor — it handles resampling, padding, and log-Mel internally:
from transformers import WhisperFeatureExtractor, WhisperTokenizer
from datasets import Dataset, Audio
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="English")
ds = Dataset.from_list([
{"audio": "data/audio/clip_001.wav", "transcription": "Hello world"},
])
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
def prepare_whisper(batch):
audio = batch["audio"]["array"]
batch["input_features"] = feature_extractor(
audio, sampling_rate=16_000, return_tensors="pt"
).input_features[0]
batch["labels"] = tokenizer(batch["transcription"]).input_ids
return batch
ds = ds.map(prepare_whisper, remove_columns=ds.column_names)
Unified Multimodal JSONL Format¶
Keep heavy assets (images, audio, PDFs) as files on disk. Store only metadata and annotations in JSONL:
{"id": "img_001", "modality": "image", "file": "images/dog.jpg", "instruction": "Describe the animal.", "response": "A golden retriever playing fetch."}
{"id": "pdf_042", "modality": "pdf", "file": "docs/report.pdf", "page": 3, "instruction": "Summarise this page.", "response": "Q3 revenue grew 12% YoY..."}
{"id": "aud_007", "modality": "audio", "file": "audio/meeting.wav", "instruction": "Transcribe the speech.", "response": "Good morning everyone."}
Load and route by modality at training time:
from datasets import load_dataset
raw = load_dataset("json", data_files="data/multimodal.jsonl", split="train")
images = raw.filter(lambda x: x["modality"] == "image")
audio = raw.filter(lambda x: x["modality"] == "audio")
docs = raw.filter(lambda x: x["modality"] == "pdf")
Pack images for large-scale training
For datasets > 100k images, avoid millions of tiny files on disk. Pack them into WebDataset .tar shards or embed raw bytes directly in a Parquet column. Both formats support streaming and parallel loading without a filesystem bottleneck.
4. Cleaning & Filtering¶
Raw data is always dirty. Cleaning is not optional.
Basic Filtering¶
from datasets import load_dataset
ds = load_dataset("json", data_files="raw.jsonl", split="train")
# Remove short examples
ds = ds.filter(lambda x: len(x["text"].split()) >= 50)
# Remove nulls
ds = ds.filter(lambda x: x["text"] is not None and x["output"] is not None)
# Normalise whitespace
def clean(example):
example["text"] = " ".join(example["text"].split())
return example
ds = ds.map(clean, num_proc=4)
Quality Heuristics¶
import re
def quality_filter(example):
text = example["text"]
# Reject if too many repeated characters
if re.search(r"(.)\1{10,}", text):
return False
# Reject if word/char ratio looks like garbled text
words = text.split()
if not words:
return False
avg_word_len = sum(len(w) for w in words) / len(words)
if avg_word_len > 15 or avg_word_len < 2:
return False
# Reject if too many non-ASCII characters (tune for your domain)
non_ascii = sum(1 for c in text if ord(c) > 127)
if non_ascii / len(text) > 0.3:
return False
return True
ds = ds.filter(quality_filter, num_proc=4)
Language Detection¶
# pip install langdetect
from langdetect import detect, LangDetectException
def is_english(example):
try:
return detect(example["text"]) == "en"
except LangDetectException:
return False
ds = ds.filter(is_english, num_proc=4)
5. Deduplication¶
Duplicates inflate metrics and waste compute. Near-duplicates are even more dangerous — they cause data leakage between train and test splits.
Exact Deduplication¶
seen = set()
def is_unique(example):
key = hash(example["text"].strip().lower())
if key in seen:
return False
seen.add(key)
return True
ds = ds.filter(is_unique)
MinHash Near-Deduplication¶
# pip install datasketch
from datasketch import MinHash, MinHashLSH
import re
def get_shingles(text, k=5):
text = re.sub(r"\s+", " ", text.lower())
return {text[i:i+k] for i in range(len(text) - k + 1)}
def build_minhash(shingles, num_perm=128):
m = MinHash(num_perm=num_perm)
for s in shingles:
m.update(s.encode("utf-8"))
return m
lsh = MinHashLSH(threshold=0.85, num_perm=128)
unique_indices = []
for i, example in enumerate(ds):
shingles = get_shingles(example["text"])
m = build_minhash(shingles)
key = str(i)
if not lsh.query(m): # no similar doc found
lsh.insert(key, m)
unique_indices.append(i)
ds_deduped = ds.select(unique_indices)
print(f"Removed {len(ds) - len(ds_deduped)} near-duplicates")
Deduplicate before splitting
Always deduplicate the full dataset before creating train/val/test splits. If you split first, near-duplicates can land in both train and test, causing artificially high scores.
6. Train / Validation / Test Split¶
# 90% train, 5% val, 5% test
split = ds.train_test_split(test_size=0.1, seed=42)
val_test = split["test"].train_test_split(test_size=0.5, seed=42)
train_ds = split["train"]
val_ds = val_test["train"]
test_ds = val_test["test"]
print(f"Train: {len(train_ds)} | Val: {len(val_ds)} | Test: {len(test_ds)}")
Stratified Split (Classification)¶
from sklearn.model_selection import train_test_split
import pandas as pd
df = train_ds.to_pandas()
train_df, val_df = train_test_split(
df, test_size=0.1, stratify=df["label"], random_state=42
)
train_ds = Dataset.from_pandas(train_df)
val_ds = Dataset.from_pandas(val_df)
7. Tokenization¶
The tokenizer converts raw text into integer token IDs that the model consumes.
Basic Tokenization¶
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
# Single example
tokens = tokenizer("Hello, world!", return_tensors="pt")
print(tokens)
# {'input_ids': tensor([[128000, 9906, 11, 1917, 0]]),
# 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
# Decode back to text
print(tokenizer.decode(tokens["input_ids"][0]))
Batch Tokenization with Padding & Truncation¶
def tokenize(batch, max_length=512):
return tokenizer(
batch["text"],
padding="max_length", # pad short sequences to max_length
truncation=True, # cut sequences longer than max_length
max_length=max_length,
return_tensors=None, # keep as Python lists for datasets
)
tokenized_ds = train_ds.map(
tokenize,
batched=True, # process many examples at once — much faster
num_proc=4,
remove_columns=["text"],
)
tokenized_ds.set_format("torch")
Causal LM Format (next-token prediction)¶
For language model pre-training or SFT, the labels are the same as the input IDs, shifted by one.
def tokenize_clm(batch):
tokenized = tokenizer(
batch["text"],
truncation=True,
max_length=2048,
)
tokenized["labels"] = tokenized["input_ids"].copy()
return tokenized
tokenized_ds = train_ds.map(tokenize_clm, batched=True, remove_columns=train_ds.column_names)
Instruction Format (Chat Template)¶
Modern instruction-tuned models expect a specific prompt format. Use the tokenizer's built-in chat template:
def format_instruction(example):
messages = [
{"role": "user", "content": example["instruction"]},
{"role": "assistant", "content": example["output"]},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False,
)
return {"text": text}
formatted_ds = ds.map(format_instruction)
Only compute loss on the assistant turn
When fine-tuning, mask the loss on user tokens so the model only learns to predict the assistant's response. trl's SFTTrainer does this automatically with DataCollatorForCompletionOnlyLM.
8. DataLoader & VRAM Optimisation¶
Overloading GPU memory is the most common training crash. The techniques below work together — start with gradient accumulation and mixed precision, then layer on the rest as needed.
from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding
collator = DataCollatorWithPadding(tokenizer, return_tensors="pt")
train_loader = DataLoader(
tokenized_ds,
batch_size=16,
shuffle=True,
collate_fn=collator,
num_workers=4,
pin_memory=True, # faster GPU transfer
)
# Inspect a batch
batch = next(iter(train_loader))
print(batch["input_ids"].shape) # [16, max_seq_len]
Gradient Accumulation¶
Simulate a larger effective batch size without fitting it all in VRAM at once.
from transformers import TrainingArguments
args = TrainingArguments(
per_device_train_batch_size=4, # what actually sits in VRAM
gradient_accumulation_steps=8, # effective batch = 4 × 8 = 32
output_dir="out",
)
Gradients are summed over gradient_accumulation_steps mini-batches before a weight update. The model sees the same update as batch size 32, but peak VRAM usage is that of batch size 4.
Mixed Precision (fp16 / bf16)¶
Halves the memory footprint of activations and gradients with minimal accuracy loss.
# bf16 is preferred on Ampere+ GPUs (A100, 3090, 4090); use fp16 on older hardware
args = TrainingArguments(
bf16=True, # or fp16=True
per_device_train_batch_size=16,
output_dir="out",
)
| Dtype | Memory | Stability | Best for |
|---|---|---|---|
| fp32 | 4 B/param | baseline | small models, debugging |
| fp16 | 2 B/param | can overflow | older GPUs (V100, T4) |
| bf16 | 2 B/param | stable | Ampere+ GPUs, LLMs |
Gradient Checkpointing¶
Trades compute for memory: activations are recomputed during the backward pass instead of stored. Typically cuts activation memory by ~60–70 % at a ~30 % speed cost.
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("gpt2")
model.gradient_checkpointing_enable()
# Or via TrainingArguments
args = TrainingArguments(
gradient_checkpointing=True,
output_dir="out",
)
Warning
Gradient checkpointing is incompatible with model.eval() caching. Always call model.train() before the training loop.
Dynamic Padding / Sequence Bucketing¶
Padding every sequence to the global maximum length wastes memory. Pack sequences of similar length into the same batch so padding tokens are minimal.
from transformers import DataCollatorWithPadding
# Dynamic padding: pads only to the longest sequence in the batch
collator = DataCollatorWithPadding(tokenizer, padding="longest")
# Bucket-style: sort dataset by length first, then batch
tokenized_ds = tokenized_ds.sort("input_ids_length") # add a length column first
train_loader = DataLoader(
tokenized_ds,
batch_size=16,
collate_fn=collator,
shuffle=False, # must be False to preserve sorted order
)
# Add length column for sorting
tokenized_ds = tokenized_ds.map(
lambda x: {"input_ids_length": len(x["input_ids"])}
)
Paged / 8-bit Optimizers (bitsandbytes)¶
Optimizer states (Adam momentum + variance) can consume 2× the model size in fp32. bitsandbytes quantises them to 8-bit and pages them to CPU RAM when not in use.
# pip install bitsandbytes
import bitsandbytes as bnb
optimizer = bnb.optim.PagedAdamW8bit(
model.parameters(),
lr=2e-5,
weight_decay=0.01,
)
Or pass it directly to Trainer:
Flash Attention 2¶
Reorders the attention computation to avoid materialising the full N×N attention matrix, cutting memory from O(N²) to O(N) and running faster on modern GPUs.
# pip install flash-attn --no-build-isolation
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
)
Note
Flash Attention 2 requires an Ampere+ GPU (compute capability ≥ 8.0) and sequences with the same length within a batch (or use padding_side="left" with causal masks).
Quick Reference¶
| Technique | VRAM saving | Speed impact | When to use |
|---|---|---|---|
| Gradient accumulation | none (same peak) | neutral | fit large logical batch on small GPU |
| Mixed precision (bf16) | ~50 % | faster | almost always |
| Gradient checkpointing | ~60–70 % | −30 % | when activations OOM |
| Dynamic padding | varies | faster | variable-length text |
| 8-bit optimizer | ~75 % of optimizer state | slight | large models, limited RAM |
| Flash Attention 2 | O(N²)→O(N) | faster | long sequences, Ampere+ GPU |
9. Dataset Quality Checklist¶
- Removed nulls and empty strings
- Normalised whitespace and encoding (UTF-8)
- Applied domain-specific quality filters
- Deduplicated (exact + near-duplicate)
- Verified label distribution (classification) or output length distribution
- Split after deduplication
- Spot-checked 50–100 examples manually
- Stored final dataset in a versioned format (parquet or HuggingFace
save_to_disk)