Fixed-size
Plain English: Cut every document into equally sized slices.
Technical view: Split every N characters or tokens. It is simple and predictable, but it ignores semantic boundaries.
Deep dive
RAG is the move from “hope the model knows” to “hand the model the right material at answer time.” Once you see the pipeline, a lot of AI product behavior stops feeling mystical.
What RAG is
Retrieval-Augmented Generation means the model does not rely only on its training-time knowledge. At inference time, you retrieve relevant documents and inject them into the prompt so the model answers using that retrieved content.
Why this matters is pretty straightforward. Models have a knowledge cutoff, they can hallucinate facts, and they cannot know your private data unless you give it to them somehow. RAG addresses all three by pulling in the right documents at answer time, which also makes source citation possible.
Plain English: bring the notes into the room before asking the question. Technical version: retrieve relevant context at query time and condition generation on that context.
The basic pattern
One stage prepares the corpus. The other stage answers the question.
Index time: chunk documents, embed chunks, then store those vectors in a vector store.
Query time: embed the user query, run similarity search, retrieve top-k chunks, inject them into the prompt, then generate.
Each of those steps sounds simple in outline, but each one has design choices that change the quality of the final answer. Most bad RAG systems are not failing because “RAG does not work.” They are failing because one of those choices was weak.
Chunking strategies
Too small and you lose the surrounding meaning. Too large and you drag unrelated text into retrieval.
Plain English: Cut every document into equally sized slices.
Technical view: Split every N characters or tokens. It is simple and predictable, but it ignores semantic boundaries.
Plain English: Split where the writing naturally pauses.
Technical view: Better for prose because chunks stay closer to complete thoughts.
Plain English: Try big natural boundaries first, then fall back to smaller ones.
Technical view: Paragraphs first, then sentences, then characters if needed. This is often the best general-purpose default.
Plain English: Split where the meaning changes.
Technical view: More accurate, but more complex. Usually you need an embedding model or another scoring method to detect the boundaries.
Plain English: Let chunks share a little text so important context does not fall through the crack.
Technical view: A sliding overlap of roughly 10–15% is a common starting point. It costs more storage, but it helps with boundary problems.
Embedding models
Similar text should land near similar text, which is what makes vector retrieval possible.
An embedding model converts text into a dense vector. Once you have those vectors, you can compare them with cosine similarity or a similar metric to find semantically related chunks.
text-embedding-3-smallPlain English: Cheap and solid.
Technical view: OpenAI's text-embedding-3-small is a common managed default, especially in OpenAI-centric stacks, when you want decent quality without spending much. It is a useful shorthand example, not the only serious hosted option anymore.
nomic-embed-textPlain English: Strong open-source option.
Technical view: A practical local-first choice when you want good retrieval quality without relying on a hosted embedding API.
all-MiniLM-L6-v2Plain English: Fast and lightweight.
Technical view: Small enough for quick prototypes and experiments, even if you may later graduate to a stronger model.
Dimension count matters because it changes the tradeoff. More dimensions usually means better discrimination, but also more memory use and more compute.
Retrieval types
Plain English: Match by meaning.
Technical view: Embedding similarity search is good at semantic matches even when the wording differs.
Plain English: Match by terms.
Technical view: Keyword-based scoring is especially good when exact language matters.
Plain English: Use both kinds of evidence.
Technical view: Combine dense and sparse retrieval, often with a re-ranking pass afterward. This is frequently the best quality setup, at the cost of more system complexity.
Plain English: Do a second, more careful sort.
Technical view: A cross-encoder or similar model re-scores the first batch of candidates. It is slower and more expensive, but it often improves precision a lot.
The context injection problem
The model still has to pay attention to the retrieved material once it is inside the prompt.
Models tend to pay more attention to the beginning and end of long context windows. Put your most relevant chunks near the start.
You can only inject so much text before cost, latency, and relevance start to fall apart. Top-3 or top-5 chunks is a common starting point.
Always keep source metadata with the chunk. That gives the model something to cite and gives the user a path back to the original document.
When RAG beats fine-tuning
The job is not mostly about missing knowledge. If you need the model to behave differently, follow a very specific style, or stay consistent across a huge number of calls, that is closer to a fine-tuning problem than a retrieval problem.
Link to Lab 15
If you want to feel the mechanics instead of just reading about them, this lab walks through chunking, indexing, retrieval, and prompt augmentation in plain Python.
The lab uses simple term-frequency retrieval instead of a real embedding server so you can see the pipeline shape clearly before adding heavier infrastructure.