RAG — AI Tooling Field Guide

What RAG is

Retrieve first, answer second.

Retrieval-Augmented Generation means the model does not rely only on its training-time knowledge. At inference time, you retrieve relevant documents and inject them into the prompt so the model answers using that retrieved content.

Why this matters is pretty straightforward. Models have a knowledge cutoff, they can hallucinate facts, and they cannot know your private data unless you give it to them somehow. RAG addresses all three by pulling in the right documents at answer time, which also makes source citation possible.

The mental model

Plain English: bring the notes into the room before asking the question. Technical version: retrieve relevant context at query time and condition generation on that context.

The basic pattern

RAG is a two-stage system.

One stage prepares the corpus. The other stage answers the question.

Index time: chunk documents, embed chunks, then store those vectors in a vector store.

Query time: embed the user query, run similarity search, retrieve top-k chunks, inject them into the prompt, then generate.

Each of those steps sounds simple in outline, but each one has design choices that change the quality of the final answer. Most bad RAG systems are not failing because “RAG does not work.” They are failing because one of those choices was weak.

Chunking strategies

How you split documents matters more than people expect.

Too small and you lose the surrounding meaning. Too large and you drag unrelated text into retrieval.

Fixed-size

Plain English: Cut every document into equally sized slices.

Technical view: Split every N characters or tokens. It is simple and predictable, but it ignores semantic boundaries.

Sentence or paragraph

Plain English: Split where the writing naturally pauses.

Technical view: Better for prose because chunks stay closer to complete thoughts.

Recursive

Plain English: Try big natural boundaries first, then fall back to smaller ones.

Technical view: Paragraphs first, then sentences, then characters if needed. This is often the best general-purpose default.

Semantic

Plain English: Split where the meaning changes.

Technical view: More accurate, but more complex. Usually you need an embedding model or another scoring method to detect the boundaries.

Overlap

Plain English: Let chunks share a little text so important context does not fall through the crack.

Technical view: A sliding overlap of roughly 10–15% is a common starting point. It costs more storage, but it helps with boundary problems.

Embedding models

Embeddings turn text into coordinates.

Similar text should land near similar text, which is what makes vector retrieval possible.

An embedding model converts text into a dense vector. Once you have those vectors, you can compare them with cosine similarity or a similar metric to find semantically related chunks.

`text-embedding-3-small`

Plain English: Cheap and solid.

Technical view: OpenAI's text-embedding-3-small is a common managed default, especially in OpenAI-centric stacks, when you want decent quality without spending much. It is a useful shorthand example, not the only serious hosted option anymore.

`nomic-embed-text`

Plain English: Strong open-source option.

Technical view: A practical local-first choice when you want good retrieval quality without relying on a hosted embedding API.

`all-MiniLM-L6-v2`

Plain English: Fast and lightweight.

Technical view: Small enough for quick prototypes and experiments, even if you may later graduate to a stronger model.

Dimension count matters because it changes the tradeoff. More dimensions usually means better discrimination, but also more memory use and more compute.

Retrieval types

There is more than one way to decide what is relevant.

Dense retrieval

Plain English: Match by meaning.

Technical view: Embedding similarity search is good at semantic matches even when the wording differs.

Sparse retrieval (BM25)

Plain English: Match by terms.

Technical view: Keyword-based scoring is especially good when exact language matters.

Hybrid retrieval

Plain English: Use both kinds of evidence.

Technical view: Combine dense and sparse retrieval, often with a re-ranking pass afterward. This is frequently the best quality setup, at the cost of more system complexity.

Re-ranking

Plain English: Do a second, more careful sort.

Technical view: A cross-encoder or similar model re-scores the first batch of candidates. It is slower and more expensive, but it often improves precision a lot.

The context injection problem

Retrieval is only half the job.

The model still has to pay attention to the retrieved material once it is inside the prompt.

Lost in the middle

Models tend to pay more attention to the beginning and end of long context windows. Put your most relevant chunks near the start.

Context window limits

You can only inject so much text before cost, latency, and relevance start to fall apart. Top-3 or top-5 chunks is a common starting point.

Source attribution

Always keep source metadata with the chunk. That gives the model something to cite and gives the user a path back to the original document.

When RAG beats fine-tuning

Use retrieval when the problem is fresh knowledge, not new behavior.

RAG usually wins when…

knowledge changes frequently
knowledge is private or proprietary
you need source attribution
you do not have the budget or infrastructure for fine-tuning

Fine-tuning usually wins when…

The job is not mostly about missing knowledge. If you need the model to behave differently, follow a very specific style, or stay consistent across a huge number of calls, that is closer to a fine-tuning problem than a retrieval problem.

Link to Lab 15

Build the smallest useful version yourself.

Lab 15: RAG Pipeline

If you want to feel the mechanics instead of just reading about them, this lab walks through chunking, indexing, retrieval, and prompt augmentation in plain Python.

Why this version is intentionally small

The lab uses simple term-frequency retrieval instead of a real embedding server so you can see the pipeline shape clearly before adding heavier infrastructure.