Blog/Engineering·Apr 5, 2026

Building RAG Systems That Actually Work in Production

Boring CodeBoring Code · 6 min read
Building RAG Systems That Actually Work in Production

Building RAG Systems That Actually Work in Production

Most RAG demos look great. You embed a handful of documents, run a similarity search, feed the results to an LLM, and get a coherent answer. It takes an afternoon to build. Then you try to run it on real data, at real scale, with real users — and everything gets harder.

The gap between demo and production

A demo RAG system works because the documents are clean, the queries are well-formed, and you're the only user. Production breaks all three assumptions at once.

Messy documents. Real knowledge bases contain PDFs with broken formatting, tables that don't parse cleanly, headers that look like body text. Naive chunking destroys the semantic structure you're trying to preserve.

Unpredictable queries. Users don't ask clean questions. They typo. They ask in the wrong language. They refer to things by nicknames or abbreviations that don't appear in the documents. They ask questions that span multiple documents simultaneously.

Scale and cost. Embedding thousands of documents is cheap once. Re-embedding when they change is an engineering problem. Running inference on every query adds up fast at volume.

What we've learned

Chunking strategy matters more than model choice

Before you pick your embedding model, design your chunking strategy. Sentence-level chunks preserve meaning better than fixed-length token windows. Paragraph-level chunks give more context per retrieval. For structured documents, section-aware chunking (respecting headers and list boundaries) dramatically improves retrieval quality.

We often use a hybrid: small chunks for retrieval, larger surrounding context sent to the LLM for generation.

Retrieval is a ranking problem

Don't treat RAG as a simple nearest-neighbor lookup. Layer multiple retrieval signals:

  • Dense retrieval (embeddings) for semantic similarity
  • Sparse retrieval (BM25) for keyword matching
  • Metadata filtering for document type, date, access level

Reciprocal rank fusion across these signals consistently outperforms any single approach.

Observability from day one

The most important thing we've added to every RAG system we've shipped: logging every query, every retrieved chunk, and every generated answer. Without this, you're debugging blind. With it, you can identify failure modes — missing documents, bad chunks, hallucinations — and fix them systematically.

When to re-rank

Add a re-ranking step between retrieval and generation when precision matters more than latency. A cross-encoder re-ranker running on the top-10 retrieved chunks before passing to the LLM meaningfully improves answer quality. The cost is 100-200ms of extra latency. For most enterprise knowledge base use cases, that's worth it.

The boring parts that matter

The infrastructure around RAG matters as much as the retrieval pipeline itself:

  • Incremental indexing — update only changed documents, don't re-embed everything on every change
  • Chunk-level caching — cache embeddings for unchanged content
  • Graceful degradation — if retrieval returns nothing, say so rather than hallucinating

Boring Code ships RAG systems across finance, legal, and technical domains. The models change. The infrastructure patterns don't.