Building a Production RAG: Architecture, Chunking, and the Subtle Science of Context

I have built three RAG systems that made it to production and maintained two others that someone else built. Here is what I learned: the demo is the easy part. Getting a prototype to answer questions from your documents takes a weekend. Getting it to answer correctly, consistently, at scale, without bankrupting your cloud budget -- that takes months of work that nobody warned you about.

Most RAG tutorials stop at the demo stage. They show you a happy path: upload a PDF, embed it, query it, get an answer. What they do not show is what happens when you have 50,000 documents in seven formats, half of them updated quarterly, and the system starts confidently citing last year's pricing in response to customer questions. That is when the real engineering begins.

I want to walk through what a production RAG system actually looks like, with specific attention to the part that causes the most failures: the document-processing pipeline. If you are building or maintaining a RAG system, this is the stuff that will save you from the 3 AM pages.

The Architecture Nobody Draws on Whiteboards

The whiteboard version of RAG is three boxes and two arrows: documents go in, vectors get stored, queries get answered. The production version has about fifteen more boxes, most of them dealing with things that went wrong.

A production RAG includes these stages, and each one is a potential failure point:

Ingestion and normalization -- bringing raw text from PDFs, HTML, APIs, internal wikis, and databases into a consistent structure. This stage alone can take 40% of your total development time. I am not exaggerating.
Chunking and embedding -- segmenting text into semantically meaningful units and converting them into vector representations. This is where most teams make their biggest mistakes.
Vector storage and retrieval -- storing and searching embeddings efficiently, typically using approximate nearest-neighbor methods. The choice of index type determines your latency and accuracy tradeoffs.
Reranking and prompt assembly -- refining search results and packaging them for the LLM within token limits. This is the difference between relevant answers and hallucinated ones.
Monitoring, caching, and updates -- keeping latency, cost, and data freshness in check. Without this, your system degrades silently until someone notices the answers are wrong.

Each stage leaks assumptions into the next. Chunking affects retrieval quality. Retrieval quality affects prompt construction. Prompt construction affects model cost and accuracy. You cannot optimize any stage in isolation. The system is only as good as its weakest boundary.

The demo is the easy part. Getting a RAG system to answer correctly, consistently, at scale -- that is where the real engineering begins.

The Document Processing Pipeline: Where Production Failures Begin

I spent three weeks debugging a RAG system that gave inconsistent answers to the same question. The model was fine. The vector database was fine. The problem was in the document-processing pipeline: two versions of the same policy document had been ingested, one from a PDF export and one from a Confluence page. They had different formatting, different metadata, and slightly different wording. The system was retrieving chunks from both, and the LLM was trying to reconcile contradictions that should not have existed.

This is typical. The document-processing pipeline is where your "knowledge" becomes machine-readable, and it is where most production failures start.

Metadata: The Context You Will Wish You Had

Capture timestamps, authors, section titles, document versions, and source identifiers during ingestion. Not later. During ingestion. I cannot stress this enough.

Metadata lets you filter, debug, and rank results. Without it, you end up with semantic collisions -- the same fact appearing twice, from two different time periods, with no way to determine which version is current. I have seen this cause a customer-facing system to quote a product feature that was deprecated six months earlier. The fix was not better prompting. It was better metadata.

Format Normalization: Every Format Is a Dialect of Chaos

PDFs come with phantom line breaks that split sentences mid-word. HTML hides text behind CSS display rules. Word exports produce text that looks like a ransom note. Markdown from different tools has different conventions for tables and code blocks.

Normalize early. Extract text, flatten structure, but preserve hierarchy. You will need those heading boundaries later for hierarchical chunking. I learned this the hard way when I tried to add hierarchical chunking to a system that had already flattened all structure. We had to re-ingest everything.

Language Detection for Multilingual Corpora

If your organization operates in multiple languages, mixing embeddings from different languages confuses similarity space. I work in a bilingual environment -- English and Polish -- and learned early that you need to detect language, split by section, and tag accordingly. A Polish paragraph embedded next to an English one will produce unpredictable retrieval results.

Deduplication: Duplicates Poison Everything

Duplicates inflate costs and degrade retrieval accuracy. Keep only canonical versions. Run hash-based deduplication on ingestion, and semantic deduplication periodically. One system I maintained had 30% duplicate content because the same documents were uploaded through three different channels. Cleaning that up improved retrieval precision by 15% with zero changes to the model or prompts.

Chunking: The Most Misunderstood Layer

Chunking determines how your documents breathe. It decides what the model can see in one glance. Too coarse, and retrieval returns blobs of unrelated context. Too fine, and you lose coherence. Every chunking decision is a tradeoff, and the right answer depends on your specific corpus.

Semantic Chunking

Split where meaning shifts, not where a token counter says so. Compute embeddings over small windows, measure cosine similarity between adjacent sections, and cut when similarity drops below a threshold. This is computationally heavier but yields contextually pure chunks. I use this for technical documentation and legal contracts where precision matters more than speed.

Sliding Window with Overlap

The pragmatic workhorse for most use cases. Slice text into fixed-length segments with 20-30% overlap. The overlap ensures continuity across chunk boundaries -- critical when ideas span multiple paragraphs. You pay a small storage and embedding penalty, but you avoid losing context at the seams. For most internal knowledge bases, this is where I start.

Hierarchical Chunking

Documents have structure, and that structure carries meaning. Hierarchical chunking captures it by creating parent-child relationships: section summaries at the top, detailed paragraphs below. Retrieval can start broad and zoom in, mimicking how humans actually search long documents. This approach works well for structured documentation like API references and compliance manuals.

Token-Aware Chunking

All chunking strategies must respect the LLM's context window. For a model with an 8K context window, I allocate roughly 1K per chunk, leaving space for instructions, the query, and multiple retrieved contexts. Exceeding the window leads to silent truncation -- your system drops information without telling you, and the answers get subtly worse.

I use a combination of these strategies in practice. Technical manuals get semantic chunking. FAQs get sliding window. Structured policy documents get hierarchical. The worst approach is picking one strategy and applying it uniformly to everything.

Optimizing for Reality

Elegant architecture means nothing if your system cannot meet SLAs. Production RAGs demand balance between three forces that pull in opposite directions: relevance, latency, and cost.

Chunk size versus retrieval quality -- Larger chunks dilute topical focus. Smaller ones fragment meaning. The sweet spot depends on your corpus: roughly 500-800 tokens for FAQs and conversational content, 1-1.5K for dense technical documents. I test different sizes on a representative query set before committing.

Retrieval speed -- Use approximate nearest-neighbor indexes like HNSW or IVFPQ. Retrieval latency above 200ms feels sluggish in a chat interface. Parallelize embedding generation and cache frequent queries. At one organization, caching the top 200 most common queries reduced average latency by 60%.

Cost control -- Re-embed only when content changes. Cache both embeddings and generation results. Use smaller models for reranking where possible. On one project, switching the reranker from a large model to a lightweight cross-encoder cut our monthly costs by 35% with minimal accuracy impact.

Observability -- Track retrieval precision and recall, not just uptime. Log which chunks contribute to correct answers. In RAG systems, "it runs" is not the same as "it retrieves well." I set up a weekly sample review where we manually check 50 random query-answer pairs. It is tedious. It catches problems that automated metrics miss.

The Pragmatic Stack

For solo developers or small teams, the production setup can stay lean. Here is what I recommend based on what I have actually deployed:

TypeScript or Python for the runtime, depending on your team's strength. PostgreSQL with pgvector for local experiments and small-scale production. Qdrant or Pinecone when you need to scale beyond what pgvector handles comfortably. OpenAI text-embedding-3-large or Cohere embed-v3-multilingual for embeddings, depending on whether you need multilingual support. GitHub Actions for automated ingestion pipelines.

This stack keeps costs predictable, deployments simple, and latency low enough for real-time applications. You do not need a Kubernetes cluster to run a RAG system. You need clean data and good chunking.

What I Wish Someone Had Told Me

Production RAG is not a prompt-engineering problem. It is a systems-design problem that looks like a prompt-engineering problem until you try to scale it. The difference between a demo and a dependable system lives in the preprocessing scripts, the chunking logic, and the quiet discipline of metadata hygiene.

Every chunk is a paragraph in your organization's collective memory. If you slice, tag, and store them well, your LLM can answer truthfully without pretending to know more than it does. If you do it poorly, you get a system that sounds confident and is wrong -- the most dangerous kind of AI failure.

Good RAGs do not hallucinate less because the models are smarter. They hallucinate less because the data is well-behaved.

Good RAGs do not hallucinate less because the models are smarter. They hallucinate less because the data is well-behaved. Start there.