Infrastructure layer

The Stack Below the Stack

Most AI tooling guides start at the assistant layer. This page goes one layer lower: the engines, formats, proxies, eval platforms, and retrieval systems that the higher-level tools sit on top of.

Last reviewed: spring 2026

The part you use all the time without usually seeing.

A lot of what feels like one AI product is really a stack of quieter components underneath it. Copilot, Claude, Cursor, Ollama, and similar tools are all sitting on top of inference engines, model formats, serving layers, structured output helpers, observability systems, and retrieval plumbing.

You do not need to become an infrastructure engineer to use the tools above this layer. But understanding this layer helps you make better decisions: what is local versus remote, what scales well, what is mature versus still growing, and where the real tradeoffs actually live.

Plain English first: this is the machinery. Technical version: model execution, API mediation, artifact formats, constrained generation, tracing and evals, and vector retrieval.

A useful framing

When a higher-level tool feels magical, the next question is usually not magic at all. It is something like: what is serving the model, what format is it in, where are calls being routed, and how is output being validated or measured?

Section 1

Inference engines

This is the software that actually runs model weights on hardware.

llama.cpp

Plain English: The local model engine hiding underneath a lot of consumer-friendly apps.

Technical view: A C++ transformer inference runtime used by Ollama, LM Studio, LocalAI, and many other local runners. It runs well on CPUs and Apple Silicon, and GGUF is its native model format. If you have run a local model, there is a good chance you used llama.cpp indirectly.

URL: github.com/ggerganov/llama.cpp

Maturity: foundational

vLLM

Plain English: The fast production server people reach for when they need one GPU box to handle a lot more traffic.

Technical view: A Python GPU inference server built around PagedAttention, which makes concurrent serving much more efficient than naive request handling. In practice, vLLM is now the mainstream default for open-source production GPU inference, exposes an OpenAI-compatible API, and sets the baseline that fast-moving competitors like SGLang are compared against.

URL: github.com/vllm-project/vllm

Maturity: production

Text Generation Inference (TGI)

Plain English: Hugging Face's production serving layer for text models.

Technical view: TGI is Hugging Face's production inference server with continuous batching, quantization support, and token streaming. It still matters historically and inside existing Hugging Face-heavy deployments, but current momentum has shifted: Hugging Face now treats TGI as maintenance-mode infrastructure and points new serving work toward vLLM or SGLang.

URL: github.com/huggingface/text-generation-inference

Maturity: production

Section 2

API routing and proxies

Once you have more than one model provider, routing starts to matter almost immediately.

LiteLLM

Plain English: One front door for a lot of different model providers.

Technical view: LiteLLM is a unified API proxy that presents a long list of providers through the OpenAI API shape. In practice that means you can point an OpenAI-compatible client at LiteLLM and swap between Anthropic, Azure, Bedrock, local Ollama, and others without rewriting the client. It also adds logging, cost tracking, routing, and failover behavior.

URL: github.com/BerriAI/litellm

Maturity: widely adopted

Section 3

Model formats

Model files come in different formats depending on who made them and how they are optimized.

GGUF

Plain English: The all-in-one file format most local model runners expect.

Technical view: GGUF is the format used by llama.cpp and by most local desktop runners built on top of it. A GGUF file typically bundles weights, tokenizer data, and metadata together. When you download a model for Ollama or LM Studio, it is very often GGUF. Suffixes like Q4_K_M or Q8_0 indicate the quantization level.

safetensors

Plain English: The safer default file format for a lot of Hugging Face model distribution.

Technical view: safetensors is a safer alternative to pickle-based .pt files because loading it does not execute arbitrary code. It is becoming the default distribution format for many Hugging Face models.

Quantization

Plain English: Squeezing a model into less memory so it can run faster or on smaller hardware.

Technical view: Quantization reduces weight precision, for example from 32-bit floats down to 8-bit or 4-bit representations. Q8 is usually fairly close to full quality; Q4 is more compressed. The quality tradeoff is often smaller than people expect, especially for local experimentation.

Section 4

Structured output

Getting reliable structured data back from a language model is harder than it looks.

instructor

Plain English: Ask for a schema, then keep retrying until the answer actually fits it.

Technical view: instructor is a Python library that wraps LLM calls with Pydantic models. You define the shape you want, and instructor handles parsing, validation, and retries. It is one of the most practical and widely used ways to get dependable structured output in Python workflows.

URL: github.com/jxnl/instructor

Maturity: widely adopted

outlines

Plain English: Instead of hoping the model follows the format, limit what it is allowed to generate.

Technical view: outlines uses constrained generation at the sampling layer, with actual grammar or schema constraints on token generation rather than prompt-only instructions. It is more rigorous than instructor, but it depends on compatible backends and a bit more infrastructure awareness.

URL: github.com/dottxt-ai/outlines

Maturity: growing

Section 5

Eval and observability

You cannot improve what you cannot measure. This layer is about tracing LLM calls, scoring outputs, and catching regressions before they quietly become product behavior.

Langfuse

Plain English: Open-source tracing and evals for LLM apps, with your own data if you want it.

Technical view: Langfuse covers traces, prompt management, datasets, evals, and scoring. It is one of the main self-hosted alternatives to LangSmith and has strong adoption among teams that want full data ownership.

URL: langfuse.com

Maturity: production, widely adopted

LangSmith

Plain English: The tracing and eval product most closely tied to the LangChain world.

Technical view: LangSmith is LangChain's observability and evaluation platform. It is usually the best fit for teams already invested in LangChain or LangGraph, and it is widely referenced across LLM app discussions.

URL: smith.langchain.com

Maturity: production

promptfoo

Plain English: Unit tests for prompts.

Technical view: promptfoo is a CLI and CI-oriented tool for testing prompts and model behavior against expected outcomes. It is more prevention-focused than observability-focused, which makes it especially useful for catching regressions before a change ships.

URL: promptfoo.dev

Maturity: growing

Helicone

Plain English: Drop in a proxy, get logs and cost tracking with very little code change.

Technical view: Helicone sits between your client and providers like OpenAI or Anthropic. Because it works as a proxy, it can add request logging, cost tracking, caching, and observability with minimal application changes.

URL: helicone.ai

Maturity: production

Section 6

Vector and retrieval

This is the layer most people meet through RAG systems and semantic search.

ChromaDB

Plain English: The easy local starting point.

Technical view: ChromaDB is often the default choice for local prototyping because the Python API is simple and it can run in-process. It is not the strongest option at scale, but it is one of the lowest-friction ways to get retrieval working.

URL: trychroma.com

Maturity: widely used for prototyping

Qdrant

Plain English: A more serious vector database when you need performance and filtering, not just a demo.

Technical view: Qdrant is a production-grade vector database written in Rust, with fast similarity search, rich metadata filtering, and both cloud and self-hosted options. It is a common step up from Chroma when performance guarantees start to matter.

URL: qdrant.tech

Maturity: production

pgvector

Plain English: If you already run Postgres, this is often the boring correct answer.

Technical view: pgvector is a PostgreSQL extension for vector similarity search. It lets you keep embeddings and metadata in the same database you already operate, which is why it is often the right choice when a separate vector database would add more complexity than value.

URL: github.com/pgvector/pgvector

Maturity: widely adopted

FAISS

Plain English: The low-level search library a lot of higher-level tools are standing on.

Technical view: FAISS is a high-performance similarity search library from Facebook AI Research. It is more of a building block than a full product, so people often encounter it indirectly as a backend inside other retrieval systems.

URL: github.com/facebookresearch/faiss

Maturity: foundational