The "RAG is dead" debate, looked at honestly
A common take through 2024 and into 2025 was that long context would make retrieval-augmented generation redundant. The argument had momentum: Gemini 1.5 shipped a 2M-token window, GPT-4.1 and Claude Sonnet extended their windows into the hundreds of thousands of tokens, prompt caching cut the cost of re-sending long documents, and a number of well-run evaluations (Needle-in-a-Haystack variants, LongBench, RULER) showed models recalling facts at unprecedented lengths.
The reality by Q1 2026 is more boring and more useful than "RAG is dead." Long context has genuinely retired some RAG use cases – especially single-document reasoning – but the shape of enterprise retrieval problems hasn't fundamentally changed. Most production knowledge bases are terabytes, not gigabytes. Most queries touch a small, unpredictable slice of that corpus. Freshness, permissioning, attribution, and audit still matter. Those constraints keep retrieval alive. What has changed is the geometry of a good pipeline.
Where long context has actually replaced RAG
Being specific helps. Long context and prompt caching have, in our experience, displaced retrieval in three fairly narrow situations:
- Single-document workflows. Summarising a 300-page contract, extracting structured data from a deck, answering questions over a specific technical manual – cases where the entire corpus fits in context and stays stable across sessions. With caching, re-sending the document costs less than standing up and maintaining an index.
- Small, slow-changing knowledge bases. An internal HR handbook or onboarding guide that updates quarterly and fits comfortably in 200-400k tokens is often cheaper to serve via prompt caching than via a retrieval pipeline, especially at small query volumes.
- Evaluation and debugging flows. When a developer is iterating on how a model reasons over a corpus, pushing the corpus into the prompt removes retrieval as a variable. Once the reasoning stabilises, retrieval goes back in.
Where RAG still beats long context
For most enterprise workloads, long context is a complement, not a replacement. Four constraints consistently push us back to retrieval:
- Scale. A few hundred million tokens of documentation exceeds any current window, and pushing a filtered slice through retrieval is orders of magnitude cheaper than attempting to fit the whole corpus in prompt. Cost per query, not window size, is the binding constraint.
- Freshness. If the answer depends on documents written yesterday – support tickets, pull requests, incident reports – a vector index (or hybrid index) updated on ingestion will beat a static prompt every time.
- Permissioning and attribution. Retrieval lets you filter on ACLs before the model sees content. It also produces per-chunk provenance, which regulated teams (finance, health, legal) need to show audit trails. Long context gives you neither.
- Evaluation leverage. A retrieval pipeline gives you two distinct knobs – retrieval quality and generation quality – and two evaluation loops. Collapsing them into one prompt means every regression is harder to diagnose.
What's genuinely new: contextual retrieval, GraphRAG, hybrid search
The retrieval stack has kept moving, and three patterns are now table stakes in serious production deployments.
Anthropic's contextual retrieval, published in late 2024, showed that prepending a short LLM-generated summary of each chunk's place in its parent document (before embedding and before BM25 indexing) reduced failed retrievals by around 35-50% on standard benchmarks, with further gains from reranking. The trick is inexpensive in 2026 – with prompt caching, the per-chunk context-generation cost is measured in cents per thousand chunks – and it delivers the largest quality jump of any single retrieval technique we've tested this year.
GraphRAG, pushed into the mainstream by Microsoft Research, addresses the other major RAG failure mode: questions that require reasoning across many related chunks rather than retrieving a single best one. Building a knowledge graph over the corpus (entities, relationships, communities) and routing multi-hop queries through the graph outperforms vector-only retrieval on global-summary and multi-entity questions by a wide margin. The cost is real – indexing is expensive – which means GraphRAG is the right answer for high-value, low-query-volume corpora (regulatory filings, research literature, M&A diligence) and overkill for customer support.
Hybrid search – BM25 combined with dense vector retrieval, reranked by a cross-encoder – remains the baseline that beats every single-method approach in our benchmarks. If a team is still running vector-only, that is the first thing to fix before anything more ambitious.
The agentic retrieval shift
The subtler change is that retrieval is increasingly a tool the agent calls, not a preprocessing step the application runs. Instead of a single top-k lookup before the model responds, the agent decides whether to retrieve, with what query, how many passes, and when to stop – sometimes issuing follow-up retrievals to clarify an earlier result.
This is a better fit for how real questions are asked. Users rarely phrase a query that's trivially embeddable. An agent that rephrases "what were we doing about the Q3 pricing issue" into two or three targeted retrievals (against different indexes, possibly against different time ranges) consistently beats a single-shot pipeline. The cost of the extra LLM calls is typically dwarfed by the cost of a wrong answer.
Designing for agentic retrieval shifts where your work goes. Index quality, chunk metadata (especially timestamps and source types), and latency per retrieval matter more than squeezing the last percentage point out of a reranker. And it shifts evaluation from "did we retrieve the right chunk?" to "did the agent construct the right retrieval plan?" – which most RAG evaluation suites are not yet measuring well.
Evaluation is still the hardest part
Most RAG systems we review are undermeasured. Teams ship accuracy numbers from a small golden set that was hand-crafted during the PoC and never updated. Six months later, the distribution has drifted, the corpus has doubled, and nobody trusts the number.
The minimum viable RAG evaluation in 2026 has four layers: a retrieval-only metric (recall@k, precision@k, MRR) on a regression set; a generation-quality metric (faithfulness, answer relevance) computed with an LLM-as-judge approach calibrated against a small human-labelled slice; a failure-mode catalogue – hallucinations, over-refusals, stale answers, permission leaks – tracked over time; and a production sample pipeline that rolls live traffic back into the regression set weekly.
RAGAS, TruLens, and Ragas-adjacent internal tooling have become the usual starting points. The harder discipline is institutional: treating evaluation as the load-bearing asset, not as a checkbox on the pre-launch list.
What to retire, what to build next
If you're looking at a RAG system that shipped in 2023 or 2024, here's a sensible prune list. Naive single-chunk, single-vector retrieval is obsolete – upgrade to hybrid + reranking. Over-engineered custom chunk sizes are rarely worth the effort; structural chunking (paragraph, section, logical unit) plus contextual retrieval beats most hand-tuned schemes. Single monolithic vector stores for everything tend to underperform a layered approach (different indexes for different content types, different freshness SLAs).
What to build next depends on the pressure point. If cost is the binding constraint, lean into prompt caching, smaller embedding models (the quality gap between a 1B and a 7B embedding model has narrowed substantially in 2025-2026), and smarter retrieval planning. If quality is the binding constraint, contextual retrieval and reranking are the highest-leverage upgrades. If multi-hop reasoning is failing, GraphRAG is worth piloting on a scoped corpus. And in all cases, treat agentic retrieval – letting the model drive the search – as the direction of travel, not as an optional flourish.