Anchor Detection for RAG: Parallel Detectors and One LLM Call

A three-stage RAG retrieval pipeline runs keyword and embedding detectors in parallel, then resolves candidates with a single LLM call.

Anchor Detection for RAG: Parallel Detectors and One LLM Call

This article is part of Enterprise Document Intelligence, a series that builds an enterprise RAG system from four bricks: parsing, question parsing, retrieval, and generation. It is the second of the retrieval brick’s three parts. The previous part, Article 7A (retrieval as filtering), set the mental model; this one builds the machine: parallel anchor detectors, keyword always, embeddings alongside, and one LLM call at the end.

Where this article sits in the series: Article 7 (retrieval), the anchor-detection part, inside Part II (the four bricks) Where this article sits in the series: Article 7 (retrieval), the anchor-detection part, inside Part II (the four bricks) – Image by author

Retrieval in an enterprise RAG system is filtering on two structured tables (line_df and toc_df), and every candidate carries an anchor (where the match lands) plus a context (what gets expanded for generation). That mental model is the subject of Article 7A (retrieval as filtering). This article zooms into how the anchors are produced: a three-stage pipeline that runs keyword detection and embeddings in parallel, aggregates the hits to a structural unit, and ends with a single LLM call that ranks the candidates with reasons.

The user types “How is attention computed?” on the Transformer paper. Six candidate pages match attention. The right one mentions softmax, query, key, d_k together, and sits in the section the TOC calls “Scaled Dot-Product Attention”. Two retrievers — keyword and embedding — both spot the candidate set. Neither alone can tell which page actually answers the question. A third step has to read the candidates side by side, with the section each one sits in, and pick the right one with a reason the auditor can read months later.

The three-stage pipeline that follows runs on three principles:

  • Keywords run always. Keyword detection is free. There’s no scenario where you wouldn’t want its signal. It runs on both line_df and toc_df from the first millisecond.
  • Embeddings run in parallel and optional. When vocabulary mismatch is expected or the question is conceptual, embeddings catch what keyword misses. With pre-computed indices, the query-time cost is microseconds. Skip them when the keyword signal is already clean.
  • One LLM call at the end. No mid-pipeline LLM “TOC reasoning” step. The arbiter at stage 3 sees the TOC, the keyword hits, the embedding hits, and the structural attachment of each candidate, in a single call. It does the reasoning over the TOC implicitly as part of ranking.

This article walks the detectors on each table (Section 2 on toc_df, Section 3 on line_df), then the combinations across both tables (Section 4). The arbiter call itself, the decision tree, and the output JSON live in Article 7C (the LLM arbiter and the retrieval output JSON).

Throughout this article we work on a single document, Attention Is All You Need (Vaswani et al. 2017, 15 pages; arXiv non-exclusive distribution license, declared on the arXiv abstract page). It carries a clean native TOC in the PDF outline (22 entries, 3 levels deep), and the content is familiar territory for any engineer touching RAG: encoder, decoder, attention, queries, keys, values. That keeps the focus on the retrieval methods rather than on parsing a domain-specific corpus. This article also assumes the document carries its own TOC; recovering one from raw text is left to follow-up work.

Every method in this article starts from line_df and toc_df Every method in this article starts from line_df and toc_df – Image by author

1. The Anchor-Detection Pipeline

Anchor detection runs in three stages. Stage 1 runs keyword detection and embedding similarity in parallel on line_df and toc_df. Stage 2 aggregates the hits to a structural unit (section via toc_df if available, otherwise page or chunk). Stage 3 hands the aggregated units to a single LLM call that ranks them and writes its reasoning per pick.

Keyword detection is the always-on baseline. It matches rows whose text contains the question’s keywords, with co-occurrence boosts when several keywords land in the same line or page. Cheap, deterministic, auditable. There’s no reason not to run it: it costs nothing, and when it hits cleanly, it gives the LLM strong signal at stage 3.

Embeddings run in parallel as an optional second signal. Useful when vocabulary mismatch is expected (the question says “prime”, the document says “montant annuel”), or when the question is conceptual rather than lexical. If you’ve pre-computed the embeddings, the marginal cost is microseconds at query time. If not, you can skip embeddings entirely on questions where the keyword signal is already clean.

The LLM at the end sees everything: keyword hits, embedding hits, the structural unit each candidate belongs to. It ranks the units once, with reasons. Two design consequences of putting the LLM at the end rather than mid-pipeline:

  • The LLM does the reasoning over the TOC implicitly. Asked “what happens if we exit early?” against a document whose TOC has Termination and Penalties (and no Exit section), the LLM picks both at ranking time. There’s no separate “TOC reasoning” LLM step earlier in the pipeline; the arbiter does that work as part of its single call.
  • The LLM resolves subtle title matches. If the question is about “the premium” but the relevant section is titled “Summary of the contract”, no keyword will match the title. The LLM, given the keyword hits in the body lines plus the structural attachment to that section, will still pick it.

Three stages: detection (parallel) → aggregate → one LLM call Three stages: detection (parallel) → aggregate → one LLM call – Image by author

The rest of this article walks the detection methods (Section 2 on toc_df, Section 3 on line_df, Section 4 on how the two tables collaborate). Article 7C (the LLM arbiter) is where that final call lives: the single call that turns aggregated candidates into a ranked answer.

2. Filtering on toc_df

Two detectors run on the TOC: keyword match (always, free) and embedding match (optional, parallel). Both are pure scoring — no LLM at this stage. The cognitive work (picking the right sections from a question like “what happens if we exit early?” when the relevant section is titled “Termination”) happens later, in the arbiter call. The arbiter sees the TOC and the keyword/embedding hits in a single LLM call.

A standalone reason_on_toc function is shown below as a pedagogical aside: it isolates what the arbiter does internally when it reasons over the TOC. In production you can either run it as a separate call (extra LLM cost, useful for debugging) or fold it into the arbiter (one LLM call total, the preferred default).

2.1 What the Arbiter Reasons About

The toc_df is small enough to pass in its entirety to an LLM. The arbiter (developed in Article 7C) exploits this: it reads the whole TOC and reasons about which sections answer the question. The standalone reason_on_toc function below isolates the same logic as a separate call, useful when you want to inspect or debug the TOC reasoning step on its own.

Why this matters. The LLM understands semantics, but more importantly it understands implications. “What happens if we exit early?” does not share vocabulary with “Termination”, but the LLM identifies that exiting a contract is what termination means. “How does the insurer handle a flood?” does not share vocabulary with “Claims procedure”, but the LLM identifies that handling damage is the claims process. “Are there fees for changing the coverage?” may match both “Coverage modification” and “Schedule of fees”, and the LLM picks both, with reasoning that explains why. A subtle case in production: a question about “the premium” lands on a section titled “Summary of the contract”. No keyword matches, but the LLM, given the body lines that mention premium amounts attached to that section, will still pick it.

An embedding model captures “exit early ≈ termination” through similarity, but it cannot capture “exit early implies penalties”. That is reasoning, not similarity.

The cost is one mid-tier LLM call (a few thousand tokens for a typical TOC), a few hundred milliseconds of latency. When folded into the arbiter, it costs nothing extra: the arbiter would see the TOC anyway. The method is infeasible on line_df: passing 12,000 lines of content to an LLM and asking it to “pick the relevant ones” is too expensive, too slow, and too unreliable. The TOC’s small size is what unlocks this method.

class SectionSelection(BaseModel):
    section_ids: list[str]
    reasoning: str

def reason_on_toc(question: str, toc_df: pd.DataFrame) -> SectionSelection:
    """Pass the full TOC to an LLM, ask which sections are relevant, with reasoning.

    The prompt uses [id=N] markers so the LLM returns our internal section_id, not
    the title's leading number (e.g. "5.2") which would not match line_df.
    """
    toc_text = "\n".join(
        f"[id={row.section_id}] {row.title} (level {row.level}, pp. {row.start_page}-{row.end_page})"
        for row in toc_df.itertuples()
    )
    prompt = (
        "Given this question and the document's table of contents, "
        "identify which sections most likely contain the answer. "
        "Consider implications and related concepts, not just keyword overlap.\n\n"
        "IMPORTANT: return the value inside the [id=...] brackets -- just the bare integer, "
        "e.g. \"9\" not \"id=9\" and not \"5.2\".\n\n"
        f"Question: {question}\n\nTable of contents:\n{toc_text}"
    )
    return client.responses.parse(
        model=model_chat,
        input=prompt,
        text_format=SectionSelection,
    ).output_parsed

# A reader's question about the paper.
selection = reason_on_toc(
    "How does the Transformer handle long-range dependencies between words?",
    toc_df,
)
print("Picked sections:", selection.section_ids)
print("Reasoning:", selection.reasoning)

The reason_on_toc function illustrates the reasoning the arbiter applies over the TOC — either as a standalone debugging tool or folded into the single arbiter call that concludes the full pipeline in Article 7C.