Context Engineering for RAG: The Four Typed Inputs Behind Every Answer

This article is a companion to Enterprise Document Intelligence, a series whose stance is that enterprise RAG amplifies the expert — it does not replace them. The architecture follows from that: four bricks (document parsing, question parsing, retrieval, generation), each emitting typed pieces that converge on one LLM call. The industry now calls that practice context engineering. Scope here is the single-document case; corpus, conversation, and tool-call extensions are follow-up work.

where this article sits in the series: Article 7bis (context engineering), the reframing companion to the four bricks

📓 Runnable notebooks are on GitHub: doc-intel/notebooks-vol1.

The public companion-code repo at doc-intel/notebooks-vol1

By the time the four bricks of a single-document RAG are built, the assembly is settled. Parsing produces relational tables. Question parsing produces a typed ParsedQuestion. Retrieval produces a filtered subset of lines, plus an audit of how it picked them. Generation produces a Pydantic answer with cited evidence. The whole thing converges on one LLM call, with a fixed system prompt and user content assembled from upstream pieces.

That pipeline has a name now. In June 2025 Tobi Lütke tweeted that “prompt engineering” was the wrong frame, and proposed “context engineering” instead: “the art of providing all the context for the task to be plausibly solvable by the LLM.” Andrej Karpathy endorsed it a week later as “the delicate art and science of filling the context window with just the right information for the next step.” Within months the term was on the cover of an O’Reilly book and structured into a taxonomy by LangChain.

What follows reads the single-document RAG pipeline through that lens. Each brick emits typed pieces; the assembly stage threads them into the LLM call; the system prompt stays fixed for caching. Naming the practice does not change the architecture. It changes what to call it when an auditor asks how the system works, and it tells the reader that the architecture is the one production teams converged on in 2025.

1. The Name, and What It Covers

Prompt engineering used to mean two related things: tuning the wording of one prompt to coax better behaviour, and writing example shots so the model knew what good output looked like. Both are narrow. They concern one block of text sent to one call.

Context engineering covers everything that lands in the model’s context window for one call:

The system prompt (the role, the rules, the examples).
The retrieved documents or rows.
Conversation history when there is one.
Tool definitions and their outputs.
Memory, scratchpads, agent state.
Structured metadata about the document, the corpus, the project.
The actual user input.

In a long-running agent that calls the model dozens of times, the prompt is one of six or eight slots. The rest comes from somewhere upstream: a retriever, a tool, a memory store, a profile lookup. The discipline shifts from “what should I write in the prompt” to “what should I assemble in the context, where does each piece come from, and how do I keep the assembly stable across calls.”

That is engineering work. It looks like software architecture: typed objects, contracts between components, audit trails, caching. The 2025 term is overdue, because the practice was already there in working production systems. Lütke and Karpathy named what teams were already doing.

The series happens to have done it from the start, brick by brick. The next sections walk through what each brick contributes to a single-document RAG payload, then through the four typed pieces that land in the LLM call and the code that produces each one. The corpus, conversation, and tool-call cases come up at the end as out-of-scope work, with pointers to where in the series they will be addressed.

Seven typed bricks feeding the LLM's context window, grouped by source: question, documents, infrastructure.

2. Every Brick Emits Typed Context

The four bricks emit typed context channels that converge on the assembly band on top, where PromptContext, the fixed system prompt, and the user template combine before the LLM call.

The schema above is the recap of what the series shipped. Each brick is a typed-context emitter. The names on the boxes are the actual fields of the actual Pydantic classes and DataFrames the code produces.

Parsing emits relational tables and one synthesis dict. line_df carries one row per line with bbox. page_df carries one row per page with type and column count. toc_df carries the table-of-contents entries with start page and depth. image_df carries embedded images with phash and metadata. parsing_summary is the doc-level synthesis: doc_type, n_pages, typical_fields, summary, plus the mechanics fields. The retrieval brick consumes the per-row tables. The question parsing brick consumes the semantic subset of parsing_summary via DocContext.

Question parsing emits a ParsedQuestion. Its fields are not free-form. keywords is a short list of content noun phrases for retrieval. intent is a literal label from a fixed enum that drives shape dispatch in generation. structural_hints.pages_hint carries pinned pages when the user said “on page 3”. answer_shape carries the expected output shape (text, amount, date, list, table, address) for the generation schema lookup. Each field is consumed by a different downstream brick. None of them are passed as raw strings to the LLM. Three articles build this row, each worth reading for a different reason:

Article 6A (question parsing thesis): the case for parsing the question before searching, and the split into a retrieval brief and a generation brief.
Article 6B (question parsing extraction): the fields the parser reads from the user string — keywords, scope, shape, decomposition, and clarification.
Article 6C (question parsing dispatch): how the parsed row picks a chunk strategy, a model tier, and the activation flags.

Retrieval emits a filtered DataFrame and an audit dict. filtered_line_df is the subset of line_df the generation brick sees. anchor_pages is the page IDs that were kept and why. The retrieval_audit carries the method that won (keyword, TOC, LLM arbiter), the LLM TOC reasoning when applicable, and the selected sections. The filtered frame is what the LLM reads. The audit is what an auditor reads. Three articles build this brick, in the order the pieces run:

Article 7A (retrieval as filtering): the mental model — narrow the candidate set rather than search it.
Article 7B (anchor detection): the keyword, embedding, and TOC detectors run in parallel to find the anchor pages.
Article 7C (the LLM arbiter): one LLM call picks the final page and says why.

Generation is a consumer, not an emitter. It takes the question, the filtered lines, the PromptContext, and the answer schema. It calls the LLM. It returns a Pydantic typed answer. The dashed border on the Generation box signals that role.

The violet “PROMPT ASSEMBLY” zone is where context engineering happens as code. It is implemented via three primitives:

A PromptContext(BaseModel) aggregator with one field per upstream context source: doc_context, future corpus_context, future project_context.
A fixed MODULE_SYSTEM_PROMPT at the module level for each brick that calls the LLM.
A MODULE_USER_TEMPLATE with named placeholders the brick fills via str.format(...).

Article 1 (the minimal four-brick RAG) introduced the bricks as a flow. Article 6A (the question parsing thesis) made the question parser typed. Article 8A (the typed generation contract) makes the generation schema typed. This article reads the same four bricks through the lens of “what context does each one contribute, and how do they reach the LLM call without polluting each other” — same code, different lens.

3. The Four Typed Pieces of a Single-Document Payload

What lands in the LLM call for a single-document RAG is four pieces, each produced by a different piece of code, each with a different cost-and-cache profile. This section walks the four in the order they appear in the user content the LLM reads.

3.1 The Fixed System Prompt

The first piece is the system message: the role description, the rules, the examples. It does not change across calls. The series writes it as a Python constant at the module level, then exposes it as a kwarg with a default so a caller can override per domain without forking:

PARSE_QUESTION_SYSTEM_PROMPT = (
    "You extract content noun phrases from the user's question..."
)

def parse_question(question, *,
                   system_prompt=PARSE_QUESTION_SYSTEM_PROMPT):
    ...

Keeping the system prompt fixed is a caching decision as much as a design decision. Providers cache the prefix of the context window when the text is identical across calls. A system prompt that varies — because it embeds the document title or the user name — breaks that cache on every call. The rule is: everything that varies goes in the user content, not the system prompt.

3.2 The Document Context

The second piece is the document context: what the parser learned about the document before any question was asked. This is where parsing_summary enters the LLM call. It carries doc_type, n_pages, typical_fields, and a short summary. It tells the model what kind of document it is reading before it sees any retrieved lines.

The series routes this through a DocContext dataclass rather than passing parsing_summary directly. DocContext holds only the fields generation needs: the semantic fields, not the mechanics. The mechanics (bbox offsets, rendering flags) stay in the DataFrames. The distinction matters because the LLM context window is finite, and filling it with rendering metadata wastes tokens the model cannot use.

3.3 The Retrieved Lines

The third piece is the filtered subset of line_df the retrieval brick produced. This is the document evidence the model answers from. It is the largest piece by token count and the one with the highest per-call variance — a question about one clause retrieves a few lines; a summary question retrieves a full section.

Two things keep this piece well-formed. First, the retrieval brick returns a DataFrame, not a string. The assembly code serialises it just before the LLM call, so the format (plain text, JSON, markdown table) is a single decision in one place. Second, the retrieval_audit travels alongside the frame as a separate field. The audit does not go to the LLM. It goes to the caller, who can log it, display it, or pass it to a downstream compliance check.

3.4 The Parsed Question

The fourth piece is the ParsedQuestion fields the generation brick needs. Not the whole object — only the fields relevant to answering: the rewritten question text, the answer_shape, and any structural_hints that survived retrieval. The keywords field, which drove retrieval, is not forwarded; it has done its job.

Passing only the relevant slice of the parsed question is the same discipline as DocContext: send what the model can use, leave behind what it cannot. The assembly function has an explicit list of forwarded fields, which makes it easy to audit what the LLM sees versus what the pipeline computed.

4. The Assembly Step

The four pieces arrive at assemble_user_content, a pure function that takes typed inputs and returns a string. Pure means no side effects, no I/O, no model calls. It is the seam between the pipeline and the LLM call, and it is the right place to enforce the contract between them.

The function signature makes the contract explicit:

def assemble_user_content(
    parsed_question: ParsedQuestion,
    doc_context: DocContext,
    filtered_lines: pd.DataFrame,
    *,
    line_serialiser: Callable = default_line_serialiser,
) -> str:

The line_serialiser kwarg is the extension point for format changes. The default serialises lines as plain text with page numbers prepended. A caller that wants JSON or markdown can pass a different serialiser without touching the assembly logic.

The output is a single string the caller passes as the content of the user message. The system prompt is a separate message. The two never merge. That separation is what makes prefix caching work.

5. What Changes in the Corpus, Conversation, and Tool-Call Cases

The single-document case is the baseline. Three extensions change what lands in PromptContext without changing the assembly contract.

Corpus context adds a corpus_context field to PromptContext. The field carries the cross-document summary the corpus indexer produced: how many documents, what document types, what date range, what shared vocabulary. Generation uses it when the question spans more than one document. Retrieval changes too: instead of filtering one line_df, it filters a joined frame across documents, then re-ranks by document relevance before passing to assembly. The typed contract at the assembly seam does not change; only the inputs to PromptContext grow.

Conversation context adds a conversation_history field. The field carries the prior turns serialised to a format the model reads. The assembly function inserts them between the document context and the retrieved lines, so the model reads: what the document is, what was said before, what the document says about the current question. Question parsing gains a coreference pass that rewrites “what about the second clause” into a self-contained question before retrieval runs.

Tool-call context adds tool definitions and tool outputs. The system prompt gains a tools block. The user content gains an interleaved sequence of tool calls and results. The assembly function gains a mode that serialises that sequence. The ParsedQuestion gains an activations flag that tells the dispatcher whether to run tool-call mode. Article 6C covers the activation flags; the tool-call assembly is follow-up work.

6. Why the Typed Contract Matters for Production

The reason to be explicit about typed inputs is not academic. Three production concerns make it practical.

Auditability. When a user disputes an answer, the audit trail has to show exactly what the model saw. A typed PromptContext with named fields makes that straightforward: log the object, and the log contains the full context. An untyped pipeline that builds the prompt via string concatenation makes the audit a reverse-engineering exercise.

Stability under model upgrades. When the underlying model changes, the assembly contract stays fixed. The serialisers may need tuning; the typed fields do not. The test suite for the assembly step exercises the contract, not the model, so it runs fast and catches regressions without API calls.

Token budget management. A typed assembly function can measure the token count of each piece before the call and apply a budget. The document context is small and fixed. The retrieved lines are variable. The budget logic trims the retrieved lines last, because they are the piece with the most variance and the least semantic density per token. That decision is only possible when the pieces arrive as separate typed inputs, not as a pre-concatenated string.

These three concerns — auditability, stability, and budget management — are what make context engineering an engineering discipline rather than a prompt-writing craft. The 2025 framing is useful precisely because it names the gap between the two.

Summary

The single-document RAG pipeline is a context engineering system. Parsing emits relational tables and a synthesis dict. Question parsing emits a typed ParsedQuestion. Retrieval emits a filtered DataFrame and an audit dict. Assembly combines a fixed system prompt, a DocContext, the retrieved lines, and the relevant slices of the parsed question into a single user content string. Generation calls the LLM and returns a typed Pydantic answer.

Naming the practice context engineering does not change the code. It changes the vocabulary for explaining the code to auditors, to new team members, and to the practitioners who will extend it to corpus, conversation, and tool-call cases. The architecture described here is the one that production teams converged on in 2025 — and the one this series has been building, brick by brick, from the start.