Building a Cost-Ordered Image Reading Cascade for PDF RAG

This article is a companion in Enterprise Document Intelligence, the series that builds an enterprise RAG system from four bricks. It extends Article 5 (document parsing) on one table: image_df, which locates every picture in the PDF without reading any of them. This part builds the reading toolbox: a cost-ordered cascade — a cheap filter, a type check, classic OCR, a vision model — that turns the few images worth paying for into searchable text.

where this companion sits: it extends Article 5 (document parsing), inside Part II (the four bricks), reading the images the parser only located

The parsing brick gives you image_df: one row per image in the PDF, with its page, its bounding box, its size, and a content hash. That locates every picture. It does not say what any of them shows. For retrieval, that is the same as not having them: a bounding box is not something a user can search, and the image’s text slot — the place a description would live — is empty.

The reflex is to throw a vision model at every image and be done. That is the wrong default. A real document is full of images that carry nothing a reader would ever search for: the company logo in every page header, a horizontal rule drawn as a 2-pixel-tall picture, a bullet glyph, a decorative banner. Captioning those with a vision LLM is paying a model to describe a logo three hundred times.

So the job splits in two. First, the methods that turn an image into text, and what each one costs: a cheap filter, a type check, classic OCR, a vision model. Second, which images are actually worth spending on in a given run. That second half is driven by context. A body line that reads “Figure 3 below shows…” is the cue to read that figure with a vision model, and not its neighbours; the question being asked narrows it further. This article lays down the methods and shows what each returns, ordered by cost. Choosing which images to pay for, per document and per query, is adaptive parsing, and it has its own article (Article 10). Here we build the toolbox.

one extracted image in, a searchable description out, paying the cheapest method that can read it

1. Most Images Are Not Worth a Model Call

The first step spends nothing. Before any OCR or vision call, a cheap filter looks at signals already in image_df plus a couple of pixel statistics, and drops the images with no retrieval value:

Too small. An image whose shortest side is a few dozen pixels, or whose total area is below a small floor, is an icon or a bullet, not a figure. A size threshold removes most of them.
The wrong shape. A picture that is very long and very thin is a rule or a divider, not content. An aspect-ratio guard catches those.
Repeated everywhere. The same content hash on most pages of the document is chrome: a header logo, a footer mark, a watermark. Counting how many pages an image hash appears on flags it as decoration, not information.

is_worth_analyzing applies these size and shape rules per image, and flag_worth_analyzing first derives the per-page repeat frequency from the content hash, then adds a worth_analyzing column to image_df. Both live in docintel.parsing.pdf.images. The thresholds are deliberately loose: a false keep costs one model call later, a false drop loses content with no trace, so when in doubt the filter keeps the image. Flat, contentless images that are too big to fail the size test (a solid colour panel, say) are not forced through here; they are caught one step later as decorative and skipped just the same.

In: image_df (+ per-image pixel stats). Out: the same table with a worth_analyzing flag.

On a typical report, this alone removes the large majority of images before a single model runs. What’s left is the handful that actually carry meaning.

2. What Kind of Image Is It?

The images that survive the filter are not all read the same way. A screenshot of a table is text: classic OCR reads it cheaply and exactly. A line chart is not text at all; its meaning is in the axes and the trend, and only a vision model can put that into words. Sending the chart to OCR returns a few stray axis labels; sending the screenshot to a vision model pays chart prices for something OCR does for free.

So the second step classifies each kept image into one type:

decorative: a blank or near-uniform panel. Skip.
text: a screenshot, a scanned region, a table rendered as an image. Reads with OCR.
chart / diagram / photo: the meaning is visual. Reads with a vision model.

classify_image returns one ImageType from cheap pixel signals: how much the pixels vary, how saturated they are, how much of the image is near-white background, how dense its edges are. A near-uniform panel is decorative. The test there is worth dwelling on, because the obvious version is wrong: you cannot detect a blank panel by counting its colours. A real “all-black” or “all-white” region is never pixel-perfect; sensor noise and JPEG compression give it hundreds of near-identical colours, so a colour count sails right past it. What stays near zero on a blank panel, noise and all, is the dispersion of the pixel values — their standard deviation. Low dispersion means blank, whatever the colour count, so that is the signal. Black ink on a white page, near-zero saturation with real stroke structure, is text. A saturated, full-bleed image with no white margins is a photo. Everything else — every uncertain case — falls through to chart.

Notice what is not in that list: a step that decides “this looks like a logo.” That is on purpose, and it is the same lesson as the blank panel. A logo can be two flat colours, a black wordmark on white, or a full-colour gradient with soft edges. Counting colours catches the first and misses the second, and worse, the two-colour test also catches a bilevel scan of real text you wanted to read. Appearance does not tell you it is a logo. Behaviour does: a logo is chrome because it repeats, the same mark in every page header. That signal already ran, back in the filter, which drops an image whose content hash recurs across pages no matter how many colours it has. A logo that appears only once — a mark on a cover page — is not worth a special case; it gets read like anything else, a wordmark falling to free OCR, a graphic to a single vision call. The rule throughout is the same: skip only what you are sure is empty or chrome, and read everything else, because a wrong skip loses content silently.

That fall-through to chart is the other important design choice. Classifying a chart against a diagram against a photo on cheap signals alone is not reliable, so the classifier does not try to be clever: it only diverts an image to cheap OCR when it is confident the image is clean monochrome text, and sends everything else to the vision model, which reads charts, diagrams, photos, and any text they happen to contain. The bias is asymmetric on purpose. A missed OCR shortcut costs one vision call; OCR run on a diagram returns a handful of stray axis labels and nonsense. So when in doubt, the classifier pays for vision. Classification itself stays cheap — no model call — because it has to be cheaper than the analysis it is there to avoid.

In: an image that passed the filter. Out: its ImageType.

3. The Cascade: The Cheapest Method That Can Read It

Type decides method. METHOD_BY_TYPE maps each type to one of three actions, ordered by cost, and describe_figure dispatches on it. The whole decision, for the cases you actually meet in a document, fits in one table: what catches the image, what it costs, and what you get back.

the cascade decision for every image kind you meet in a real document, from free to paid

Read it top to bottom and you read the cascade in order. The first three rows never reach a model at all: the filter throws them out on size, shape, or repetition. The next row is caught by the classifier as a blank panel and skipped too. Only the bottom five cost anything, and of those only the genuine text-image gets the free path. The rest reach the vision model, which is exactly where you want your money going.

Watch out: sideways figures. A wide table or a landscape chart is often laid at 90 degrees to fit a portrait page. The turn rarely shows up where you would look first: the page’s rotation flag stays at 0, and the angle sits in the image’s own placement matrix instead. Rendered as-is, the figure reaches OCR or the vision model on its side, where OCR returns noise and a vision model reads it with misplaced confidence and no warning that it struggled. So the cascade reads the placement angle and counter-rotates the region before either method sees it: automatic, exact, no orientation-guessing. The one residual case is a scan with the turn baked into its pixels, with no matrix to read; there the OCR branch retries the quarter-turns and keeps the best-scoring one.

3.1. Skip: Pay Nothing for the Noise

decorative: no call. A blank or near-uniform panel keeps its empty text slot. Together with the images the filter already dropped — the too-small, the wrong-shaped, the repeated chrome — this is where most of a clean document’s images end up, which is the point.

3.2. Classic OCR for Text-Images

text: a screenshot, a scanned table, a figure that is really rendered text. Classic OCR reads it locally, in milliseconds, for free. The series uses EasyOCR (docintel.parsing.pdf.easyocr); Tesseract is the other common choice. OCR is exact on clean printed text and never invents words, which makes it the right tool whenever the classifier is confident the image contains only monochrome text — and the wrong tool for anything else, where it returns noise instead of meaning.