Labs / Lab 18

Prepare a fine-tuning dataset

Prepare and validate an instruction fine-tuning dataset: format conversion, validation, train/val split, and JSONL output. No GPU required.

What you'll build

A small dataset prep script that does the boring part correctly.

Training is the flashy part, but this is where most of the real leverage sits. In this lab you take a handful of raw question-and-answer pairs, convert them into fine-tuning formats, validate them, split them into train and validation sets, and write them out as JSONL.

That sounds modest, and it is. It is also the part that determines whether the later training run is learning from clean examples or from quietly broken data.

Run it

cd ai_ecosystem_labs
python3 18-finetune-prep/finetune_prep.py
Starting here? Quick setup
git clone https://github.com/BanditF/ai_ecosystem_labs
cd ai_ecosystem_labs
python3 18-finetune-prep/finetune_prep.py

Requires Python 3. No GPU and no extra packages required.

Time guide. Setup: ~2 min. Working through it: 25–45 min because dataset cleanup and validation are a little more involved than a toy one-file demo.

Why this piece exists

Fine-tuning fails quietly when the dataset shape is sloppy.

A lot of fine-tuning frustration is not really about the optimizer, rank setting, or GPU. It starts earlier. Fields are missing, message order is inconsistent, assistant responses are empty, or the train and validation split was never created cleanly enough to compare outcomes.

This lab keeps the scope narrow on purpose. You are not training a model here. You are learning the pipeline shape that has to exist before training is even worth paying for.

Walk-through

Two common fine-tuning formats

OpenAI JSONL

The script converts each raw Q/A pair into the chat-style JSONL shape used by OpenAI fine-tuning. Each line is one JSON object with a messages array containing system, user, and assistant turns.

{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Alpaca format

The script also writes a second view of the same raw examples in Alpaca format. That format is common across open-source training stacks and keeps the record flatter: instruction, input, and output.

{"instruction": "What is the capital of France?", "input": "", "output": "Paris."}

The code

finetune_prep.py

Walk through it

Four things worth noticing.

Format conversion is explicit

to_openai_format() and to_alpaca_format() keep the conversion logic small and obvious. That matters because dataset prep gets harder to trust when the shaping logic is hidden inside a larger training script.

Validation happens before writing files

validate_openai_example() checks for the messages key, makes sure user and assistant turns exist, and flags empty content. That is the difference between catching bad records now and discovering them after you have already paid for training.

Token counts are approximate, but still useful

The script uses simple word counts as a rough token estimate. That is not production-grade tokenization, but it is enough to spot obvious outliers before you move into real cost estimates and context-window math.

Train/validation split is part of the pipeline

A lot of first fine-tuning attempts skip the validation split and then wonder why there is no clean way to tell whether behavior improved. This lab bakes that split in from the start.

Expected output

What a successful prep run looks like.

With the included sample data, the script prepares 10 examples, validates them, creates an 8/2 train/validation split, and writes three files into 18-finetune-prep/output/.

Prepared 10 examples

Validation:
  Valid: 10/10
  Avg tokens (approx): 18
  Max tokens (approx): 21

Split: 8 train / 2 validation

Output:
  18-finetune-prep/output/train.jsonl (8 examples)
  18-finetune-prep/output/val.jsonl (2 examples)
  ...

Also wrote Alpaca format: 18-finetune-prep/output/alpaca_format.jsonl

The sample training example printed at the end may vary because the train split is shuffled before writing.

Try this

Three things to try before moving on.

  1. Add five new Q/A pairs to RAW_EXAMPLES. Re-run the script and inspect how the train/validation counts and rough token stats change.
  2. Intentionally add an example with empty content. Re-run the script and confirm that the validation pass catches it before you trust the dataset.
  3. Change val_ratio to 0.3. Re-run and observe how the split changes. That is a simple way to make the idea of holdout data feel concrete.

Concepts behind this

Read Fine-tuning for the broader picture on when this technique makes sense and when it is the wrong tool.

Then read eval and observability if you want the next layer: how to measure whether your fine-tuned model actually got better.