Building RAG Systems That Actually Work in Production
Building RAG Systems That Actually Work in Production
Most RAG demos look great. You embed a handful of documents, run a similarity search, feed the results to an LLM, and get a coherent answer. It takes an afternoon to build. Then you try to run it on real data, at real scale, with real users — and everything gets harder.
The gap between demo and production
A demo RAG system works because the documents are clean, the queries are well-formed, and you're the only user. Production breaks all three assumptions at once.
Messy documents. Real knowledge bases contain PDFs with broken formatting, tables that don't parse cleanly, headers that look like body text. Naive chunking destroys the semantic structure you're trying to preserve.
Unpredictable queries. Users don't ask clean questions. They typo. They ask in the wrong language. They refer to things by nicknames or abbreviations that don't appear in the documents. They ask questions that span multiple documents simultaneously.
Scale and cost. Embedding thousands of documents is cheap once. Re-embedding when they change is an engineering problem. Running inference on every query adds up fast at volume.
What we've learned
Chunking strategy matters more than model choice
Before you pick your embedding model, design your chunking strategy. Sentence-level chunks preserve meaning better than fixed-length token windows. Paragraph-level chunks give more context per retrieval. For structured documents, section-aware chunking (respecting headers and list boundaries) dramatically improves retrieval quality.
We often use a hybrid: small chunks for retrieval, larger surrounding context sent to the LLM for generation.
Retrieval is a ranking problem
Don't treat RAG as a simple nearest-neighbor lookup. Layer multiple retrieval signals:
- Dense retrieval (embeddings) for semantic similarity
- Sparse retrieval (BM25) for keyword matching
- Metadata filtering for document type, date, access level
Reciprocal rank fusion across these signals consistently outperforms any single approach.
Observability from day one
The most important thing we've added to every RAG system we've shipped: logging every query, every retrieved chunk, and every generated answer. Without this, you're debugging blind. With it, you can identify failure modes — missing documents, bad chunks, hallucinations — and fix them systematically.
When to re-rank
Add a re-ranking step between retrieval and generation when precision matters more than latency. A cross-encoder re-ranker running on the top-10 retrieved chunks before passing to the LLM meaningfully improves answer quality. The cost is 100-200ms of extra latency. For most enterprise knowledge base use cases, that's worth it.
The boring parts that matter
The infrastructure around RAG matters as much as the retrieval pipeline itself:
- Incremental indexing — update only changed documents, don't re-embed everything on every change
- Chunk-level caching — cache embeddings for unchanged content
- Graceful degradation — if retrieval returns nothing, say so rather than hallucinating
Boring Code ships RAG systems across finance, legal, and technical domains. The models change. The infrastructure patterns don't.