Prepare and validate an instruction fine-tuning dataset: format conversion, validation, train/val split, and JSONL output. No GPU required.
What you'll build
A small dataset prep script that does the boring part correctly.
Training is the flashy part, but this is where most of the real leverage sits. In this lab you take a handful of raw question-and-answer pairs, convert them into fine-tuning formats, validate them, split them into train and validation sets, and write them out as JSONL.
That sounds modest, and it is. It is also the part that determines whether the later training run is learning from clean examples or from quietly broken data.
Run it
cd ai_ecosystem_labs
python3 18-finetune-prep/finetune_prep.py
Starting here? Quick setup
git clone https://github.com/BanditF/ai_ecosystem_labs
cd ai_ecosystem_labs
python3 18-finetune-prep/finetune_prep.py
Requires Python 3. No GPU and no extra packages required.
Time guide. Setup: ~2 min. Working through it: 25–45 min because dataset cleanup and validation are a little more involved than a toy one-file demo.
Why this piece exists
Fine-tuning fails quietly when the dataset shape is sloppy.
A lot of fine-tuning frustration is not really about the optimizer, rank setting, or GPU. It starts earlier. Fields are missing, message order is inconsistent, assistant responses are empty, or the train and validation split was never created cleanly enough to compare outcomes.
This lab keeps the scope narrow on purpose. You are not training a model here. You are learning the pipeline shape that has to exist before training is even worth paying for.
Walk-through
Two common fine-tuning formats
OpenAI JSONL
The script converts each raw Q/A pair into the chat-style JSONL shape used by OpenAI fine-tuning. Each line is one JSON object with a messages array containing system, user, and assistant turns.
The script also writes a second view of the same raw examples in Alpaca format. That format is common across open-source training stacks and keeps the record flatter: instruction, input, and output.
{"instruction": "What is the capital of France?", "input": "", "output": "Paris."}
The code
finetune_prep.py
Walk through it
Four things worth noticing.
Format conversion is explicit
to_openai_format() and to_alpaca_format() keep the conversion logic small and obvious. That matters because dataset prep gets harder to trust when the shaping logic is hidden inside a larger training script.
Validation happens before writing files
validate_openai_example() checks for the messages key, makes sure user and assistant turns exist, and flags empty content. That is the difference between catching bad records now and discovering them after you have already paid for training.
Token counts are approximate, but still useful
The script uses simple word counts as a rough token estimate. That is not production-grade tokenization, but it is enough to spot obvious outliers before you move into real cost estimates and context-window math.
Train/validation split is part of the pipeline
A lot of first fine-tuning attempts skip the validation split and then wonder why there is no clean way to tell whether behavior improved. This lab bakes that split in from the start.
Expected output
What a successful prep run looks like.
With the included sample data, the script prepares 10 examples, validates them, creates an 8/2 train/validation split, and writes three files into 18-finetune-prep/output/.