Skip to content

Part 1 — Datasets

Overview

A model is only as good as the data it trains on. This article covers every stage of the dataset pipeline: collection, cleaning, deduplication, splitting, and tokenization — with runnable code for each step.


1. The Dataset Pipeline

Raw Sources (web, files, APIs)
   Collection & Loading
   Cleaning & Filtering
   Deduplication
   Train / Val / Test Split
   Tokenization & Encoding
   DataLoader → Training Loop

2. Loading Data

From HuggingFace Hub

from datasets import load_dataset

# Public dataset
ds = load_dataset("ag_news")
print(ds)
# DatasetDict({
#     train: Dataset({features: ['text', 'label'], num_rows: 120000})
#     test:  Dataset({features: ['text', 'label'], num_rows: 7600})
# })

# Specific split
train_ds = load_dataset("ag_news", split="train")

From Local Files

from datasets import load_dataset

# CSV
ds = load_dataset("csv", data_files={"train": "data/train.csv", "test": "data/test.csv"})

# JSONL (one JSON object per line — most common format for LLM data)
ds = load_dataset("json", data_files="data/instructions.jsonl")

# Plain text (one document per line)
ds = load_dataset("text", data_files="data/corpus.txt")

From a Pandas DataFrame

from datasets import Dataset
import pandas as pd

df = pd.read_parquet("data/processed.parquet")
ds = Dataset.from_pandas(df)

Prefer JSONL for instruction data

JSONL is the de-facto format for fine-tuning datasets. Each line is a complete example:

{"instruction": "Summarize this text", "input": "...", "output": "..."}
{"instruction": "Translate to French", "input": "Hello", "output": "Bonjour"}


3. Building Multimodal Datasets

Multimodal models — vision-language models (VLMs) like LLaVA or Qwen-VL, speech models like Whisper, and document-understanding models like Donut — require datasets that pair raw assets (images, PDFs, audio) with text. This section covers extracting each modality and structuring it into a training-ready Dataset.

Images

# pip install Pillow datasets
from PIL import Image
from datasets import Dataset, Image as HFImage
import os, json

def load_image_text_pairs(image_dir: str, annotations_jsonl: str) -> Dataset:
    """
    Expects one JSON object per line:
      {"file": "cat.jpg", "caption": "A tabby cat sitting on a mat"}
    """
    records = []
    with open(annotations_jsonl) as f:
        for line in f:
            ann = json.loads(line)
            records.append({
                "image": os.path.join(image_dir, ann["file"]),
                "text":  ann["caption"],
            })
    # cast_column tells HuggingFace to lazy-load & decode images on access
    return Dataset.from_list(records).cast_column("image", HFImage())

ds = load_image_text_pairs("data/images/", "data/captions.jsonl")
print(ds[0]["image"])   # PIL Image object

For instruction-tuned VLMs, wrap each example in the model's chat template:

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")

def preprocess_vlm(example):
    conversation = [{
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": example["question"]},
        ],
    }]
    prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
    inputs = processor(images=example["image"], text=prompt, return_tensors="pt")
    inputs["labels"] = inputs["input_ids"].clone()
    return inputs

ds = ds.map(preprocess_vlm)

PDFs

PDFs are either text-based (selectable text) or scanned (image-only). Handle both:

# pip install pymupdf   (imported as fitz)
import fitz  # PyMuPDF

def extract_pdf_pages(pdf_path: str) -> list[dict]:
    """One record per page: selectable text + a rendered PNG of the page."""
    doc = fitz.open(pdf_path)
    pages = []
    for page_num, page in enumerate(doc):
        text = page.get_text("text").strip()

        # Render to image — useful for layout-aware models or scanned PDFs
        pix = page.get_pixmap(matrix=fitz.Matrix(2.0, 2.0))  # ~150 DPI
        pages.append({
            "source": pdf_path,
            "page":   page_num,
            "text":   text,
            "image":  pix.tobytes("png"),   # raw PNG bytes
        })
    doc.close()
    return pages

For scanned PDFs that contain no selectable text, fall back to OCR:

# pip install pytesseract Pillow
import pytesseract
from PIL import Image
import io

def ocr_page(img_bytes: bytes) -> str:
    return pytesseract.image_to_string(Image.open(io.BytesIO(img_bytes)), lang="eng")

pages = extract_pdf_pages("report.pdf")
for p in pages:
    if len(p["text"].split()) < 20:     # likely scanned — no selectable text
        p["text"] = ocr_page(p["image"])

Build a dataset from a folder of PDFs:

import glob
from datasets import Dataset

all_pages = []
for pdf_path in glob.glob("data/pdfs/*.pdf"):
    all_pages.extend(extract_pdf_pages(pdf_path))

ds = Dataset.from_list(all_pages)
ds = ds.filter(lambda x: len(x["text"].split()) >= 20)  # drop near-empty pages

Audio

# pip install torchaudio
import torchaudio
import torchaudio.transforms as T
import torch

def load_audio(path: str, target_sr: int = 16_000) -> tuple[torch.Tensor, int]:
    waveform, sr = torchaudio.load(path)
    if waveform.shape[0] > 1:                        # stereo → mono
        waveform = waveform.mean(dim=0, keepdim=True)
    if sr != target_sr:
        waveform = T.Resample(sr, target_sr)(waveform)
    return waveform, target_sr

def extract_log_mel(waveform: torch.Tensor, sr: int = 16_000) -> torch.Tensor:
    mel = T.MelSpectrogram(sample_rate=sr, n_fft=1024, hop_length=160, n_mels=80)(waveform)
    return torch.log(mel.clamp(min=1e-9))   # [1, 80, time_frames]

For speech-to-text fine-tuning (e.g. Whisper), use the model's own feature extractor — it handles resampling, padding, and log-Mel internally:

from transformers import WhisperFeatureExtractor, WhisperTokenizer
from datasets import Dataset, Audio

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")
tokenizer         = WhisperTokenizer.from_pretrained("openai/whisper-small", language="English")

ds = Dataset.from_list([
    {"audio": "data/audio/clip_001.wav", "transcription": "Hello world"},
])
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))

def prepare_whisper(batch):
    audio = batch["audio"]["array"]
    batch["input_features"] = feature_extractor(
        audio, sampling_rate=16_000, return_tensors="pt"
    ).input_features[0]
    batch["labels"] = tokenizer(batch["transcription"]).input_ids
    return batch

ds = ds.map(prepare_whisper, remove_columns=ds.column_names)

Unified Multimodal JSONL Format

Keep heavy assets (images, audio, PDFs) as files on disk. Store only metadata and annotations in JSONL:

{"id": "img_001", "modality": "image", "file": "images/dog.jpg",    "instruction": "Describe the animal.",       "response": "A golden retriever playing fetch."}
{"id": "pdf_042", "modality": "pdf",   "file": "docs/report.pdf",   "page": 3, "instruction": "Summarise this page.", "response": "Q3 revenue grew 12% YoY..."}
{"id": "aud_007", "modality": "audio", "file": "audio/meeting.wav", "instruction": "Transcribe the speech.",      "response": "Good morning everyone."}

Load and route by modality at training time:

from datasets import load_dataset

raw    = load_dataset("json", data_files="data/multimodal.jsonl", split="train")
images = raw.filter(lambda x: x["modality"] == "image")
audio  = raw.filter(lambda x: x["modality"] == "audio")
docs   = raw.filter(lambda x: x["modality"] == "pdf")

Pack images for large-scale training

For datasets > 100k images, avoid millions of tiny files on disk. Pack them into WebDataset .tar shards or embed raw bytes directly in a Parquet column. Both formats support streaming and parallel loading without a filesystem bottleneck.


4. Cleaning & Filtering

Raw data is always dirty. Cleaning is not optional.

Basic Filtering

from datasets import load_dataset

ds = load_dataset("json", data_files="raw.jsonl", split="train")

# Remove short examples
ds = ds.filter(lambda x: len(x["text"].split()) >= 50)

# Remove nulls
ds = ds.filter(lambda x: x["text"] is not None and x["output"] is not None)

# Normalise whitespace
def clean(example):
    example["text"] = " ".join(example["text"].split())
    return example

ds = ds.map(clean, num_proc=4)

Quality Heuristics

import re

def quality_filter(example):
    text = example["text"]

    # Reject if too many repeated characters
    if re.search(r"(.)\1{10,}", text):
        return False

    # Reject if word/char ratio looks like garbled text
    words = text.split()
    if not words:
        return False
    avg_word_len = sum(len(w) for w in words) / len(words)
    if avg_word_len > 15 or avg_word_len < 2:
        return False

    # Reject if too many non-ASCII characters (tune for your domain)
    non_ascii = sum(1 for c in text if ord(c) > 127)
    if non_ascii / len(text) > 0.3:
        return False

    return True

ds = ds.filter(quality_filter, num_proc=4)

Language Detection

# pip install langdetect
from langdetect import detect, LangDetectException

def is_english(example):
    try:
        return detect(example["text"]) == "en"
    except LangDetectException:
        return False

ds = ds.filter(is_english, num_proc=4)

5. Deduplication

Duplicates inflate metrics and waste compute. Near-duplicates are even more dangerous — they cause data leakage between train and test splits.

Exact Deduplication

seen = set()

def is_unique(example):
    key = hash(example["text"].strip().lower())
    if key in seen:
        return False
    seen.add(key)
    return True

ds = ds.filter(is_unique)

MinHash Near-Deduplication

# pip install datasketch
from datasketch import MinHash, MinHashLSH
import re

def get_shingles(text, k=5):
    text = re.sub(r"\s+", " ", text.lower())
    return {text[i:i+k] for i in range(len(text) - k + 1)}

def build_minhash(shingles, num_perm=128):
    m = MinHash(num_perm=num_perm)
    for s in shingles:
        m.update(s.encode("utf-8"))
    return m

lsh = MinHashLSH(threshold=0.85, num_perm=128)
unique_indices = []

for i, example in enumerate(ds):
    shingles = get_shingles(example["text"])
    m = build_minhash(shingles)
    key = str(i)
    if not lsh.query(m):          # no similar doc found
        lsh.insert(key, m)
        unique_indices.append(i)

ds_deduped = ds.select(unique_indices)
print(f"Removed {len(ds) - len(ds_deduped)} near-duplicates")

Deduplicate before splitting

Always deduplicate the full dataset before creating train/val/test splits. If you split first, near-duplicates can land in both train and test, causing artificially high scores.


6. Train / Validation / Test Split

# 90% train, 5% val, 5% test
split = ds.train_test_split(test_size=0.1, seed=42)
val_test = split["test"].train_test_split(test_size=0.5, seed=42)

train_ds = split["train"]
val_ds   = val_test["train"]
test_ds  = val_test["test"]

print(f"Train: {len(train_ds)} | Val: {len(val_ds)} | Test: {len(test_ds)}")

Stratified Split (Classification)

from sklearn.model_selection import train_test_split
import pandas as pd

df = train_ds.to_pandas()
train_df, val_df = train_test_split(
    df, test_size=0.1, stratify=df["label"], random_state=42
)
train_ds = Dataset.from_pandas(train_df)
val_ds   = Dataset.from_pandas(val_df)

7. Tokenization

The tokenizer converts raw text into integer token IDs that the model consumes.

Basic Tokenization

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")

# Single example
tokens = tokenizer("Hello, world!", return_tensors="pt")
print(tokens)
# {'input_ids': tensor([[128000,   9906,    11,   1917,      0]]),
#  'attention_mask': tensor([[1, 1, 1, 1, 1]])}

# Decode back to text
print(tokenizer.decode(tokens["input_ids"][0]))

Batch Tokenization with Padding & Truncation

def tokenize(batch, max_length=512):
    return tokenizer(
        batch["text"],
        padding="max_length",       # pad short sequences to max_length
        truncation=True,            # cut sequences longer than max_length
        max_length=max_length,
        return_tensors=None,        # keep as Python lists for datasets
    )

tokenized_ds = train_ds.map(
    tokenize,
    batched=True,         # process many examples at once — much faster
    num_proc=4,
    remove_columns=["text"],
)
tokenized_ds.set_format("torch")

Causal LM Format (next-token prediction)

For language model pre-training or SFT, the labels are the same as the input IDs, shifted by one.

def tokenize_clm(batch):
    tokenized = tokenizer(
        batch["text"],
        truncation=True,
        max_length=2048,
    )
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

tokenized_ds = train_ds.map(tokenize_clm, batched=True, remove_columns=train_ds.column_names)

Instruction Format (Chat Template)

Modern instruction-tuned models expect a specific prompt format. Use the tokenizer's built-in chat template:

def format_instruction(example):
    messages = [
        {"role": "user",      "content": example["instruction"]},
        {"role": "assistant", "content": example["output"]},
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False,
    )
    return {"text": text}

formatted_ds = ds.map(format_instruction)

Only compute loss on the assistant turn

When fine-tuning, mask the loss on user tokens so the model only learns to predict the assistant's response. trl's SFTTrainer does this automatically with DataCollatorForCompletionOnlyLM.


8. DataLoader & VRAM Optimisation

Overloading GPU memory is the most common training crash. The techniques below work together — start with gradient accumulation and mixed precision, then layer on the rest as needed.

from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding

collator = DataCollatorWithPadding(tokenizer, return_tensors="pt")

train_loader = DataLoader(
    tokenized_ds,
    batch_size=16,
    shuffle=True,
    collate_fn=collator,
    num_workers=4,
    pin_memory=True,   # faster GPU transfer
)

# Inspect a batch
batch = next(iter(train_loader))
print(batch["input_ids"].shape)  # [16, max_seq_len]

Gradient Accumulation

Simulate a larger effective batch size without fitting it all in VRAM at once.

from transformers import TrainingArguments

args = TrainingArguments(
    per_device_train_batch_size=4,   # what actually sits in VRAM
    gradient_accumulation_steps=8,  # effective batch = 4 × 8 = 32
    output_dir="out",
)

Gradients are summed over gradient_accumulation_steps mini-batches before a weight update. The model sees the same update as batch size 32, but peak VRAM usage is that of batch size 4.


Mixed Precision (fp16 / bf16)

Halves the memory footprint of activations and gradients with minimal accuracy loss.

# bf16 is preferred on Ampere+ GPUs (A100, 3090, 4090); use fp16 on older hardware
args = TrainingArguments(
    bf16=True,          # or fp16=True
    per_device_train_batch_size=16,
    output_dir="out",
)
Dtype Memory Stability Best for
fp32 4 B/param baseline small models, debugging
fp16 2 B/param can overflow older GPUs (V100, T4)
bf16 2 B/param stable Ampere+ GPUs, LLMs

Gradient Checkpointing

Trades compute for memory: activations are recomputed during the backward pass instead of stored. Typically cuts activation memory by ~60–70 % at a ~30 % speed cost.

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")
model.gradient_checkpointing_enable()

# Or via TrainingArguments
args = TrainingArguments(
    gradient_checkpointing=True,
    output_dir="out",
)

Warning

Gradient checkpointing is incompatible with model.eval() caching. Always call model.train() before the training loop.


Dynamic Padding / Sequence Bucketing

Padding every sequence to the global maximum length wastes memory. Pack sequences of similar length into the same batch so padding tokens are minimal.

from transformers import DataCollatorWithPadding

# Dynamic padding: pads only to the longest sequence in the batch
collator = DataCollatorWithPadding(tokenizer, padding="longest")

# Bucket-style: sort dataset by length first, then batch
tokenized_ds = tokenized_ds.sort("input_ids_length")  # add a length column first

train_loader = DataLoader(
    tokenized_ds,
    batch_size=16,
    collate_fn=collator,
    shuffle=False,   # must be False to preserve sorted order
)
# Add length column for sorting
tokenized_ds = tokenized_ds.map(
    lambda x: {"input_ids_length": len(x["input_ids"])}
)

Paged / 8-bit Optimizers (bitsandbytes)

Optimizer states (Adam momentum + variance) can consume 2× the model size in fp32. bitsandbytes quantises them to 8-bit and pages them to CPU RAM when not in use.

# pip install bitsandbytes
import bitsandbytes as bnb

optimizer = bnb.optim.PagedAdamW8bit(
    model.parameters(),
    lr=2e-5,
    weight_decay=0.01,
)

Or pass it directly to Trainer:

args = TrainingArguments(
    optim="paged_adamw_8bit",
    output_dir="out",
)

Flash Attention 2

Reorders the attention computation to avoid materialising the full N×N attention matrix, cutting memory from O(N²) to O(N) and running faster on modern GPUs.

# pip install flash-attn --no-build-isolation
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
)

Note

Flash Attention 2 requires an Ampere+ GPU (compute capability ≥ 8.0) and sequences with the same length within a batch (or use padding_side="left" with causal masks).


Quick Reference

Technique VRAM saving Speed impact When to use
Gradient accumulation none (same peak) neutral fit large logical batch on small GPU
Mixed precision (bf16) ~50 % faster almost always
Gradient checkpointing ~60–70 % −30 % when activations OOM
Dynamic padding varies faster variable-length text
8-bit optimizer ~75 % of optimizer state slight large models, limited RAM
Flash Attention 2 O(N²)→O(N) faster long sequences, Ampere+ GPU

9. Dataset Quality Checklist

  • Removed nulls and empty strings
  • Normalised whitespace and encoding (UTF-8)
  • Applied domain-specific quality filters
  • Deduplicated (exact + near-duplicate)
  • Verified label distribution (classification) or output length distribution
  • Split after deduplication
  • Spot-checked 50–100 examples manually
  • Stored final dataset in a versioned format (parquet or HuggingFace save_to_disk)
# Save for reproducibility
tokenized_ds.save_to_disk("data/tokenized_train")

# Reload later
from datasets import load_from_disk
tokenized_ds = load_from_disk("data/tokenized_train")