mlengineeringllmrag

RAG in Production: The Failure Modes Nobody Writes About

Jan 25, 20245 min readUtso Sarkar

Every RAG tutorial ends with a green checkmark. Your vector database returns relevant chunks. Your LLM synthesizes a coherent answer. You ship it Friday afternoon and spend Monday explaining why the bot told a customer that your refund policy covers time travel.

I have deployed RAG systems in three production environments. Two of them required architectural rewrites within six weeks. The third worked because we treated retrieval as an engineering problem, not a LangChain recipe. Here is what actually breaks.

The Demo Lie

Demo RAG uses clean PDFs, short documents, and questions written by the person who indexed the corpus. Production RAG ingests Confluence pages from 2019, Slack exports with broken threading, and PDF tables that OCR turned into abstract art. The embedding model does not know your org chart. It knows cosine similarity between token sequences.

The failure is not hallucination. Hallucination is downstream. The failure is retrieval that looks confident and is wrong. Users trust retrieved context more than raw model output. That makes bad retrieval worse than no retrieval at all.

Chunking Is Not a Hyperparameter

Everyone treats chunk size like learning rate: sweep 256, 512, 1024, pick the best on a dev set of twelve questions, declare victory.

Real documents have structure. A chunking strategy that splits mid-paragraph destroys referential context. Split mid-table and you get numeric fragments that embed near unrelated financial data. Split API documentation at function boundaries and your retrieval returns half a signature with no return type.

We ran an audit on a legal corpus: 34% of chunks contained a pronoun whose antecedent lived in an adjacent chunk. No amount of re-ranking fixes that. You need structure-aware chunking: respect headings, tables, code blocks, and list hierarchies. For technical docs, chunk by semantic section, not token count.

Overlap helps but is not a cure. Overlap without boundary awareness just gives you duplicate wrong answers with higher recall.

Hybrid Search: When Vectors Lie

Pure vector search fails on exact-match queries. User asks for error code E-4471 and your embedding model returns chunks about “error handling best practices” because the semantics overlap. Hybrid search (BM25 + dense vectors) fixes this, but introduces new failure modes.

BM25 dominates on rare tokens. Dense retrieval dominates on paraphrase. Without score normalization and fusion tuning, one modality silently wins and you get bimodal behavior: some queries work perfectly, others fail consistently, and your logs look random until you plot query type against retrieval source.

We use reciprocal rank fusion with per-collection calibration. We also log which modality contributed the winning chunk for every query. That logging paid for itself in the first week when we discovered that 40% of support tickets referenced SKU numbers that vector search never surfaced.

Re-Ranking: The Expensive Band-Aid

Cross-encoder re-rankers improve precision dramatically. They also add 200-800ms latency per query and do not fix garbage chunks. Re-ranking is a filter, not a foundation.

The production pattern that works: retrieve wide (top 50-100), re-rank narrow (top 5), then apply a confidence threshold before generation. If the top re-ranked score falls below that threshold, refuse to answer or escalate to human. Most teams skip the check because it makes demos look bad. Production is not a demo.

We also learned to re-rank on the full query-chunk pair, not truncated pairs. Truncation for speed silently drops the constraints that matter in long enterprise queries.

Failure Points in the Pipeline

flowchart TD
    A[User Query] --> B[Query Preprocessing]
    B --> C{Embedding Model}
    C -->|Stale index| D[Wrong Vector Space]
    C -->|OK| E[Hybrid Retrieval]
    E --> F[BM25 Results]
    E --> G[Dense Results]
    F --> H[Score Fusion]
    G --> H
    H -->|Poor calibration| I[Wrong Modality Wins]
    H -->|OK| J[Re-Ranker]
    J -->|Low confidence| K[Should Refuse - Often Skipped]
    J -->|OK| L[Context Assembly]
    L -->|Bad chunking| M[Broken Context]
    L -->|OK| N[LLM Generation]
    N --> O[Confident Wrong Answer]
    D --> O
    I --> O
    K --> O
    M --> O

    style D fill:#8b0000,color:#fff
    style I fill:#8b0000,color:#fff
    style K fill:#8b0000,color:#fff
    style M fill:#8b0000,color:#fff
    style O fill:#8b0000,color:#fff

Every red node is a failure mode we hit in production. Most tutorials skip straight from H to N.

Evaluation That Actually Matters

Offline metrics on synthetic QA pairs lie. Build an evaluation set from production failures: every escalated ticket, every thumbs-down, every “that is not what our docs say” message. Label the failure stage (retrieval, re-rank, generation, chunking). Fix the stage, not the symptom.

Track retrieval recall at k=10 separately from end-to-end answer quality. If recall is low, re-ranking and prompt engineering are theater.

What We Changed

After the second rewrite we landed on: structure-aware chunking with metadata (source, section, timestamp), hybrid retrieval with logged modality contribution, cross-encoder re-ranking with confidence gating, and a refusal path that routes to human support with the retrieved context attached for faster resolution.

Latency went up 300ms. Wrong answers dropped 60%. Support escalations from the bot dropped because we stopped pretending low-confidence retrieval was good enough.

RAG in production is not an LLM problem. It is a search problem with an LLM stapled on the end. Treat it that way or your users will treat your product as broken. They will be right.

--claps