1. Document corpus
DOCUMENTS is the toy knowledge base. It mixes Python facts with a little noise from JavaScript and Rust so retrieval has something to rank instead of always returning the obvious answer.
Lab 15
Build a minimal retrieval-augmented generation pipeline in plain Python: index a tiny corpus, retrieve the most relevant documents with TF-IDF, assemble an augmented prompt, and generate a grounded answer.
What you'll build
This lab keeps the moving parts small on purpose. There is no embedding server, no vector database, and no framework hiding the steps. You can see the pipeline shape directly: documents go in, an index gets built, a query retrieves the top matches, and those matches get stuffed into a prompt.
That makes it easier to understand what RAG really is before the infrastructure gets heavier. The retrieval here is simple, but the control flow is the same shape you would keep in a more serious system.
cd ai_ecosystem_labs
python3 15-rag-pipeline/rag_pipeline.py "Who created Python?"
git clone https://github.com/BanditF/ai_ecosystem_labs
cd ai_ecosystem_labs
python3 15-rag-pipeline/rag_pipeline.py "Who created Python?"
No dependencies needed. Python 3 is enough.
Time guide. Setup: ~2 min. Working through it: 20–35 min because retrieval, prompt assembly, and grounding each add one more layer.
Walk through it
DOCUMENTS is the toy knowledge base. It mixes Python facts with a little noise from JavaScript and Rust so retrieval has something to rank instead of always returning the obvious answer.
tokenize() lowercases and splits text into word tokens. build_index() then computes term-frequency values for each document so later scoring can reward words that are both present and relatively distinctive.
idf(), score(), and retrieve() make up the retrieval layer. This is not semantic search. It is lexical scoring. That is useful here because you can see exactly why some documents rank above others.
build_augmented_prompt() takes the top retrieved chunks and turns them into a prompt with explicit source IDs. That is the key RAG move: add external context at runtime instead of retraining the model.
mock_generate() stands in for a real model call. It is intentionally simple, but it makes the pipeline runnable without keys or dependencies and keeps the focus on retrieval plus grounding.
The code
Expected output
Query: Who created Python?
Indexed 6 documents
Top retrieved chunks:
[doc3] score=0.1407 The Python Package Index (PyPI) hosts thousands of third-party modules. pip is t...
[doc1] score=0.138 Python is a high-level programming language known for its clear syntax and reada...
[doc2] score=0.1114 Python supports multiple programming paradigms including procedural, object-orie...
Augmented prompt (529 chars):
────────────────────────────────────────
Answer the question using only the provided context. Cite the source ID.
Context:
[doc3] The Python Package Index (PyPI) hosts thousands of third-party modules. pip is the standard package manager for Python.
[doc1] Python is a high-level programming language known for its clear syntax and readability. It was created by Guido van Rossum and first released in 1991.
[doc2] Python supports multiple programming paradigms including procedural, object-oriented, and functional programming.
Question: Who created Python?
Answer:
────────────────────────────────────────
Generated answer:
Python was created by Guido van Rossum and first released in 1991. [doc1]
A small but useful detail: lexical retrieval is not perfect. doc3 ranks slightly above doc1 because of term overlap, even though doc1 contains the fact you actually need.
Try this
python3 15-rag-pipeline/rag_pipeline.py "What is pip?" and compare which docs rise to the top.DOCUMENTS. Re-run the script and see whether the new text gets retrieved for relevant questions.top_k from 3 to 1. Observe how the prompt gets shorter and whether answer quality gets more brittle.