Reconstructing a PDF Table of Contents for RAG Section Scoping

This article is a document parsing companion in Enterprise Document Intelligence, the series that builds an enterprise RAG system from four bricks. It extends Article 5 (document parsing) on one table: toc_df, the document’s section structure, which Article 5 fills from the PDF’s native outline (PyMuPDF’s doc.get_toc) when there is one. This part is about the case where there isn’t — reconstructing that structure from what the document still shows on the page.

where this companion sits: it extends Article 5 (document parsing), inside Part II (the four bricks), reconstructing the table of contents when the PDF ships none

Open NIST FIPS 202, the SHA-3 standard (a US Government work, public domain, see the NIST copyright statement), and turn to page seven. There is a clean table of contents: section titles on the left, page numbers on the right. Now open the same file in any PDF viewer and look at the bookmarks pane. Empty. The contents page is ink on a page, not structure the machine can use. The author wrote a perfectly good table of contents, and the file shipped without exposing it.

Article 5 (document parsing) and Article 5B (the relational data model) leaned on doc.get_toc(), the PDF’s native outline, to fill toc_df. It is exact when it exists. It often does not. Plenty of real documents — papers exported straight from LaTeX, contracts printed to PDF, government standards — carry a printed contents page but no outline. For those, toc_df comes back empty, even though the document is telling you its structure in plain sight on page seven.

That structure is not a nicety. Retrieval scopes by section (Article 7). The chunker cuts on heading boundaries (Article 5B). Summarization walks the document section by section. Every one of those steps reads toc_df. When it is empty, retrieval falls back to scanning every page, the chunker splits on blind page breaks, and the answer loses the document’s own structure. So the question this article answers is narrow and practical: when the file ships no outline but prints a contents page, how do you turn that page back into a toc_df?

One thing up front, because it is easy to conflate. This is about documents that have a contents page. A document with no contents page at all — a paper that just opens with “1. Introduction”, a five-page memo, an export that stripped every heading — is a different problem. Recovering a skeleton from the body of an unstructured document is summarization, a separate intent that builds the map from the chunks rather than reading one off a page. Here we only ever read a contents page the document already has.

1. Two Halves: Read the Entries, Then Find Their Real Pages

It helps to separate two things a contents page gives you. The first is a list of sections with titles and a hierarchy: what the document is about, in what order. The second is a map from each section to where it physically starts in the file. The native outline hands you both for free. Reading a printed contents page hands you the first directly, but the second only as printed labels, which are not physical pages. The two halves have different failure modes, so the rest of this article keeps them separate: first read the entries, then align them to physical pages.

In: a PDF whose doc.get_toc() returns nothing but that prints a contents page. Out: a toc_df with the same shape Article 5B defined (level, title, start_page, end_page, breadcrumb), so everything downstream keeps working unchanged.

The contents page comes in two flavours, and they cost different amounts to read.

2. Three Cases, by Ascending Cost

The cascade tries each case in turn and stops at the first that yields a usable TOC.

Each case has a detection step and an extraction step, and falls through to the next when it fails or returns too little.

Case 1, native outline. Handled in Article 5 by build_toc_df. Free, exact, hierarchical. When it works there is nothing to do. We recap it only to set the cost baseline.
Case 2, contents page with links. No outline, but an early page lists titles as hyperlinks pointing inside the file. The link target is the physical page, so this case skips the alignment problem entirely.
Case 3, contents page without links. A page that looks like a printed contents (titles, dot leaders, right-aligned page numbers) but carries no links. The page numbers it prints are labels in the document’s own numbering, not physical pages, so this case needs the alignment step.

All of this lives in a module of its own, separate from the native path so Article 5 stays readable. The entry point is reconstruct_toc_df.

3. Follow the Links

Case 2 is the lucky one. Some documents have no outline but do ship a clickable contents page. The NIST Cybersecurity Framework is one: page two lists every section as a hyperlink that jumps into the document. PyMuPDF exposes those links per page, and each internal link carries its target page directly.

In: the PDF (links are not in line_df, so this reader opens the file). Out: entries with a title and the physical target page, already resolved.

The detection is a density check: a page with five or more internal links is a navigation page, not a body page with the odd footnote link. The extraction joins each link’s clickable rectangle back to the text under it, then strips the leaders and the trailing page label.

import fitz   # PyMuPDF

def extract_toc_from_links(pdf_path, min_links=5):
    """The contents page is the page carrying the most internal links."""
    doc = fitz.open(pdf_path)
    best = []
    for page in doc:
        entries = []
        for link in page.get_links():
            if link["kind"] != fitz.LINK_GOTO:        # internal jump only
                continue
            label = clean(text_under_rect(page, link["from"]))
            if label:
                entries.append({"title": label,
                                "start_page": link["page"] + 1,  # target page
                                "level": 1})
        if len(entries) >= min_links and len(entries) > len(best):
            best = entries                            # richest link page wins
    return best

Run it on the Framework and the recovered contents are clean:

Every title resolved to a real page, no LLM, no guesswork.

Put the detector’s output next to the page it read and you can check it by eye. The Framework’s contents page lists each section, then a List of Figures and a List of Tables; the detector recovers all three groups, titles and target pages matching line for line.

Left, the document's own contents page; right, what the detector returns.

This is the case to hope for. It is deterministic, it is exact, and the page mapping is solved by the document itself. The catch is that most documents that lack a native outline also lack clickable links, which takes us to the harder case.

4. Read the Printed Contents Page, Then Find Its Real Pages

Case 3 is the common one: a printed table of contents with no links behind it — a page headed “Contents” or “Table of contents”, a column of titles, a column of page numbers, often joined by dot leaders. FIPS 202 has exactly this. A human reads it at a glance. Parsing it has two distinct steps, and the second is the one people skip.

4.1 Detecting and Reading the Contents Page

First, find the contents page. The signal that actually separates a contents page from prose is dot-leader density: several lines of the shape Some title .......... 42. A keyword like “contents” raises confidence but is not required, and on its own is a weak signal (a sentence can say “table of contents”). The reader works on line_df alone, so it is engine-agnostic.

In: line_df. Out: entries with a title and a displayed_page, the page number as printed on the line.

import re
# "Introduction ......... 12"             "Introduction       12"
DOTTED   = re.compile(r"^(.*?\S)[.…](?:[.…\s]){2,}(\d{1,3})$")
TRAILING = re.compile(r"^(.{2,70}?\S)\s{2,}(\d{1,3})$")

def extract_toc_from_contents(line_df):
    entries = []
    for page in find_contents_pages(line_df):    # pages dense in dot leaders
        for line in lines_of(line_df, page):
            m = DOTTED.match(line) or TRAILING.match(line)
            if m:
                title, label = m.group(1).strip(), int(m.group(2))
                entries.append({"title": title,
                                "displayed_page": label,      # printed label
                                "level": infer_level(title)}) # "2.3.1" -> 3
    return entries

4.2 The Label Is Not the Page

Here is the subtlety. The contents page says Introduction .... 1. Page 1 of the file is the cover, not the introduction. A front matter of a cover, a foreword, and the contents page itself sits in front, so the printed label and the physical page live in different numbering spaces. The alignment step resolves this offset by scanning the document for pages whose extracted text matches the printed label, anchoring the printed numbering to the file’s physical page indices and shifting every entry accordingly. Once that offset is known, every displayed_page maps cleanly to a start_page the rest of the pipeline can use.