Fine-tuning is how you take a pre-trained model and push it toward a narrower behavior: a style, a format, a workflow, or a domain-specific response pattern.
Last reviewed: spring 2026
It changes behavior, not magic powers.
Fine-tuning matters because prompting eventually hits a ceiling. If you need a model to answer in the same shape every time, stick to a house style, or act like a specialized assistant for one narrow task, changing the training data can be more reliable than stacking more instructions into the prompt.
Plain English first: you show the model lots of examples of the behavior you want. Technical version: you continue training from pre-trained weights on a smaller supervised dataset so the resulting model shifts toward those examples.
Useful default
Try prompt engineering first. If the model mostly knows what to do but will not do it consistently enough, that is when fine-tuning starts to make sense.
Section 1
What fine-tuning is
Fine-tuning continues training a pre-trained model on a smaller, task-specific dataset. The model's weights update to reflect the new examples. The result is a model that behaves differently from the base: it adopts new style, tone, format, or domain habits from the training data.
This is not magic. The model does not suddenly gain new reasoning ability. It learns patterns from your examples. Garbage data in, garbage model out.
Section 2
Three types
Full fine-tuning
Plain English: Retrain the whole model, not just a small add-on.
Technical view: Update all model weights. It is the most flexible option, but also the most expensive and the easiest to get wrong without strong data hygiene and serious GPU compute.
LoRA
Plain English: Keep most of the model frozen and learn a much smaller patch.
Technical view: Low-Rank Adaptation freezes the base weights and adds small trainable adapter matrices. It usually cuts compute needs by roughly 10-100x versus full fine-tuning, which is why it became the default move for most practical work.
QLoRA
Plain English: Do LoRA on a compressed model so normal hardware can participate.
Technical view: QLoRA combines LoRA with quantized weights. That is the main reason fine-tuning became accessible to smaller teams: you can adapt a much larger model with far less memory than full-weight training would require.
Section 3
When fine-tuning is the right choice
Good fit
You need consistent output style or format across thousands of calls, where prompt engineering gets flaky at scale.
The task requires behavior the model does not naturally show well yet. Not new facts: behavior.
You want a smaller, faster, cheaper model that performs like a larger one on one narrow task.
Probably not the right move
You just need the model to know new facts. That is usually a retrieval problem, not a fine-tuning problem.
You have not tried prompt engineering first.
Your dataset is tiny, roughly under 100 examples, so generalization is unlikely.
You do not have eval data, which means you will not know whether the run actually improved anything.
Section 4
Dataset requirements
Instruction fine-tuning is the common starting point. The usual shape is a conversation record with system, user, and assistant turns, written as JSONL so each line is one training example.
{"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "Paris."}
]}
Quality beats quantity here. One hundred excellent examples will often beat ten thousand noisy ones. Diversity matters too: cover the kinds of inputs the model will actually see in the real task.
A reasonable starting point for a narrow task is 50-200 high-quality examples. If results are poor, scale the dataset and improve the coverage before you assume the method itself failed.
Section 5
The training pipeline
Prepare the dataset as validated JSONL.
Choose a base model. Instruct-tuned starting points are usually better than raw base models.
Configure LoRA or QLoRA hyperparameters such as rank, alpha, learning rate, and epochs.
Train. Depending on data size and hardware, that may take minutes or hours.
Evaluate against your eval suite. If you do not have one yet, the eval and observability layer is the right place to start.
Merge the adapter weights or serve the base model with the adapter attached.
Section 6
Tools
Open-source stack
Hugging Face transformers + peft — the standard Python stack.
Unsloth — faster LoRA and QLoRA training on consumer GPUs.
Axolotl — config-driven fine-tuning pipeline with less boilerplate.
Hosted path
OpenAI fine-tuning API — managed fine-tuning for selected OpenAI models, with no GPU setup. Model availability changes fast, so check the current docs.
Together AI and Replicate — hosted APIs for training and serving adapters without managing your own cluster.
Section 7
Compute reality
These are rough ballpark numbers — verify current GPU pricing before planning a run, not guarantees. Exact hardware needs vary a lot by model family, sequence length, batch size, and training setup.
QLoRA on a 7B model: roughly one 24GB consumer GPU, often measured in hours.
QLoRA on a 70B model: roughly one 48GB to 80GB-class accelerator if the rest of the setup is sensible, often still measured in hours to days.
LoRA on a 70B model: usually a multi-GPU or very large VRAM setup, because 16-bit training needs meaningfully more memory than QLoRA.
Full fine-tuning a 7B model: usually several datacenter GPUs, which gets expensive quickly.
No GPU available: use a hosted path like OpenAI fine-tuning or Together AI.
Next move
Start with the dataset pipeline before you touch training.
Build the boring but essential part first: format conversion, validation, train/validation split, and JSONL output.
Why this order helps
If the dataset is messy, the training run mostly teaches you that messy data creates messy behavior. Lab 18 keeps the lesson focused on the part you can control first.