Why Enterprise RAG Gets Retrieval Wrong From the Start

The Wrong Mental Model Is Doing Real Damage

There is a quiet assumption baked into most enterprise RAG implementations: that retrieval is fundamentally a search problem. You embed the query, embed the passages, compute cosine similarity, and surface the top-k results. The vector index becomes the centerpiece, and everything upstream - parsing, structuring, labeling - gets treated as preparation for that one similarity operation. The assumption feels reasonable. It is also wrong, or at least incomplete enough to fail regularly on documents that professionals navigate every day without difficulty.

Watch what actually happens when someone needs to find a policy detail inside a corporate PDF. They open the file, press Ctrl+F, type “vacation,” and jump through the hits. If the document uses “PTO” instead, Ctrl+F returns nothing. The expert tries a synonym, then another. When that still fails, they open the table of contents, scan section titles, click “Leave and Time Off,” and read the body. That two-step - keyword first, TOC navigation when keywords fail - has driven professional document work for thirty years. Embedding similarity appears nowhere in it.

What the Filtering Model Actually Looks Like

The mental model introduced in the Enterprise Document Intelligence series - a four-brick system covering parsing, question parsing, retrieval, and generation - reframes retrieval entirely. Retrieval is the third brick. Its job is not to find similar passages but to filter two structured tables: line_df, which holds the extracted text line by line, and toc_df, which holds the document’s table of contents as a navigable map. The query drives filters against those tables, not a similarity search against an embedding index.

This matters because it changes what the system is trying to optimize. A similarity search tries to rank passages by semantic closeness to the query. A filtering operation on line_df and toc_df tries to replicate the expert’s navigation logic - keyword co-occurrence on the same page, TOC section matching, scoped body search within the right section. The end state is the same (a passage reaches the language model), but the path there is structurally different and handles common failure modes that embedding search misses entirely.

The article works through its examples using Attention Is All You Need (Vaswani et al., 2017, 15 pages, arXiv non-exclusive distribution license). The paper carries a clean native TOC in the PDF outline - 22 entries, 3 levels deep - and its content (encoder, decoder, attention, queries, keys, values) is familiar territory for any engineer working in RAG. That familiarity keeps attention on the retrieval mechanics rather than on decoding a domain-specific corpus.

Three Places Where the Expert Hits a Wall

The filtering model is framed explicitly around what the series calls the “amplify the expert” stance: codify what the skilled human does, then do it better than they can manually. The three specific lifts are concrete.

First, the expert types one keyword at a time. A programmatic system can detect co-occurrence of multiple keywords on the same page or within the same section in a single pass - something no manual Ctrl+F workflow can do efficiently. Second, the expert sees nothing when text is locked inside scanned images. If the parsing brick runs OCR at ingestion, image-bound text becomes searchable like any other line in line_df, removing the failure mode entirely. Third, the expert scans the TOC manually to orient themselves before reading. The system joins toc_df and line_df programmatically: identify the right section from the map, then scope the keyword filter inside that section’s body. Each of those three pain points maps to one programmatic operation, and none of them require a vector database.

Anchors and Context Are Not the Same Thing

One of the cleaner distinctions the model introduces is between the anchor and the context. The anchor is where the match lands - the specific line or passage that satisfies the filter condition. The context is what actually gets passed downstream to the generation step. Those two things should not be the same chunk.

Picking anchors small and expanding context large is the governing principle. A keyword match might land on a single line that names the right concept, but a single line rarely gives the language model enough material to answer a question well. The right move is to detect the anchor at a fine-grained level - a line, a heading, a TOC entry - then expand outward to the surrounding section before passing anything to generation. This distinction will drive the mechanics in the two subsequent articles in the series: one covering parallel anchor detectors followed by a single LLM call, and another covering an arbiter pattern that ranks the retrieved results before generation sees them.

Why Similarity Search Doesn’t Disappear

None of this means embedding-based similarity search is useless inside an enterprise RAG pipeline. It means it has been given the wrong job description. When a document has been parsed into line_df and toc_df, keyword and structural filters handle the cases that similarity search handles poorly - exact terminology, document structure, section-scoped queries, OCR-recovered text from scanned pages. Similarity search handles the cases where the query uses different vocabulary than the source, or where the relevant passage contains no keywords from the query at all. Those cases exist. They are not the majority of professional document queries, but they are real.

The filtering model does not replace similarity search. It repositions it as one detector among several, rather than the architectural center of gravity around which everything else orbits.

The Table Is the Index

The clearest implication of the filtering model is that the DataFrame is the index. Once the parsing brick produces clean, structured line_df and toc_df tables, retrieval becomes a series of pandas-style operations: filter by keyword, join on section boundaries, score by co-occurrence, scope by TOC entry. The LLM enters the pipeline late - after anchors have been detected and context has been expanded - not early, not as the ranker, not as the retriever.

That sequencing matters for cost and latency. Running an LLM over raw candidate passages to rank them is expensive. Running structured filters over a DataFrame first, then passing a small set of anchors to an LLM arbiter, is substantially cheaper and more auditable. You can inspect the intermediate tables. You can see which lines matched which filters. You can debug a failure without probing a black-box vector index.

The series uses Attention Is All You Need as its working document throughout - 15 pages, 22 TOC entries - because the parsing assumptions (clean native PDF outline, no domain-specific terminology to decode) let the retrieval logic stand on its own. Recovering a TOC from raw text when no outline exists is noted explicitly as follow-up work outside the current scope.

One Constraint Worth Keeping in Mind

This framing assumes the document carries its own table of contents. That assumption holds for most enterprise PDFs - contracts, policy documents, technical specifications, annual reports - but not universally.

When the TOC is absent, toc_df cannot be populated from the PDF outline, and the section-scoping operations that make the filtering model precise lose their anchor. How to reconstruct a usable TOC from raw text lines - through heading detection, font-size analysis, or numbering patterns - is the open problem the series defers. The 22-entry, 3-level outline in Vaswani et al. is a clean case. Most enterprise document corpora are not.