Parsing Scanned PDFs for RAG with EasyOCR: Text Without Layout

This article is part of Enterprise Document Intelligence, the series that builds an enterprise RAG system from four bricks. Article 5 (document parsing) built the parser with PyMuPDF (fitz), which returns empty on a scanned page with no text layer. This companion swaps the engine for EasyOCR, a free OCR package that recovers that text. It is the one case in this family where the new engine gives you less, not more: it recovers the text and nothing around it, and that gap is the lesson.

where this companion sits: it extends Article 5 (document parsing), inside Part II (the four bricks), with a different parsing engine

Scanned PDFs are not solved by “just throw OCR at it”. The OCR step recovers text; that’s necessary but not sufficient for an enterprise RAG pipeline. What the pipeline also needs is everything around the text: where the page boundaries are, which lines are section headings, what is a figure, what is a table row vs a free paragraph. “Traditional OCR” (the term of art for text-detection + text-recognition engines like EasyOCR, Tesseract, PaddleOCR) gives you the text. It gives you nothing else. The rest is the layout problem, and the layout problem is the harder half.

This article runs that distinction concretely. The traditional-OCR engine is EasyOCR: the simplest, fastest, free, JaidedAI’s text-detection + recognition library (Apache 2.0, declared in the project’s LICENSE file). The layout-aware engine is Docling (Article 5ter; MIT license, declared in the project’s LICENSE file). Both can OCR a scanned page. They differ on what they do with the result. The whole article is a setup for the head-to-head on a real public-domain 1974 scan in section 5.

EasyOCR is the OCR floor: line_df only, no layout. The rest of the family adds structure

1. What “Traditional OCR” Does (and Doesn’t)

Traditional OCR reads pixels and returns text rectangles. Everything else — sections, tables, figures, reading order — is a separate layout problem the engine refuses to look at. The two models behind it are text detection (find rectangular regions of the image that contain text) and text recognition (read each region’s pixels and return characters with a confidence score). The output is a flat list of (bbox, text, confidence) per detected region.

That is everything EasyOCR (or Tesseract, or PaddleOCR) does. The engine reads pixels and returns text rectangles. A two-column page comes back as a flat list of left-and-right text boxes intermixed by y-coordinate; the engine does not know there are two columns. A table comes back as a grid of disconnected cells the engine cannot tell apart from regular paragraphs. A figure caption is just another text box. The page header, page footer, and marginalia all show up as boxes too.

Anything that needs “this text is a section heading” or “these four boxes are one table row” needs a second model on top — a layout model. The layout model reads the OCR output plus the page image and classifies each region (heading, paragraph, table cell, figure, caption, footer…) and groups them into a reading order. That is what Article 5bis (Azure DI), Article 5ter (Docling), and Article 5quater (vision LLM) all add over the OCR step. Without one, you have “OCR output”, not “a parsed document”.

2. EasyOCR: The Canonical Traditional OCR

EasyOCR is the cleanest demonstration of “traditional OCR” as a class. The library is small (~150 MB of model weights cached on first call), free, CPU-only by default, and runs locally. The whole library API is two calls: build a Reader for the languages you need, then hand readtext an image. Each detection comes back as a triple: the polygon around the text, the recognised string, and the recogniser’s own confidence.

import easyocr
import fitz
import numpy as np

reader = easyocr.Reader(["en"], gpu=False)        # first call downloads ~150 MB

# render page 1 of a scanned PDF to a numpy array EasyOCR can read
page = fitz.open("data/contracts/scanned_amendment.pdf")[0]
pix = page.get_pixmap(matrix=fitz.Matrix(2.0, 2.0))   # 2x zoom = ~144 DPI
img = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
    pix.height, pix.width, pix.n,
)

# the recogniser: one image in, one triple per detected text region out
detections = reader.readtext(img)
for quad, text, conf in detections:
    # quad = [[x0,y0], [x1,y0], [x1,y1], [x0,y1]]  in pixel coords
    print(round(conf, 2), text)

parse_pdf_easyocr wraps that loop. It walks every page of the PDF, renders each to a numpy array, calls readtext, converts the pixel-space polygons back to PDF coordinates, and packs the detections into the same dict-of-tables contract as the other parsers — same line_df, same parsing_summary, same downstream consumers — except that only those two keys carry data. Every other slot (page_df, image_df, toc_df, span_df, object_registry, cross_ref_df) comes back as an empty DataFrame. That isn’t a missing-feature bug; it’s exactly what “traditional OCR” means.

parsed = parse_pdf_easyocr(
    "data/contracts/scanned_amendment.pdf",
    languages=("en",),         # add "fr", "de", ... for multilingual scans
    render_scale=2.0,          # 2.0 = ~144 DPI ; raise for small fonts
    gpu=False,                 # CPU-only by default ; set True if CUDA available
    confidence_threshold=0.0,  # filter low-confidence detections if needed
)

parsed["line_df"]              # text + bbox + confidence per detection
parsed["parsing_summary"]      # method, page count, line count, render scale
# Every other key (page_df, image_df, toc_df, span_df, object_registry,
# cross_ref_df) is an empty DataFrame ; EasyOCR has nothing to put there.

The signature kwargs are the only knobs:

languages: tuple of ISO-639-1 codes (en, fr, de, zh, …). A multilingual corpus loads one Reader per language set; the @lru_cache in get_easyocr_reader keeps a handful of these in memory across calls.
render_scale: how many pixels per PDF unit when rasterising each page. 1.0 is native (~72 DPI, often too small). 2.0 is the sweet spot for body text. Raise to 3.0 for tiny fonts; lower if you’re memory-bound.
gpu: CPU is the default so the module works on any machine. CUDA gives a 3–5x speedup on text-heavy pages.
confidence_threshold: drop low-confidence detections. 0.0 keeps everything (the column is preserved so downstream code can filter); 0.3 cuts most noise on degraded scans.

3. What `line_df` Looks Like

Sample rows from the NIST FIPS 199 cover (US Government work, public domain in the US; see NIST copyright statement), one per detected text region: the page coordinate, the OCR’d text, and the recogniser’s own confidence score. That is the whole output.

Same column shape as fitz's line_df, plus a confidence column EasyOCR adds for free

The shape is deliberately small:

text + bbox: the recogniser’s payload, one row per detected text region.
confidence: float between 0 and 1, EasyOCR’s self-score. Useful both as a filter (drop below 0.3 on noisy scans) and as a feedback signal — downstream generation can flag low-confidence passages to the user.
character_count: kept for symmetry with the other parsers; on EasyOCR it’s just len(text).
No column or reading-order field. A two-column page comes back as a flat list, left-and-right boxes intermixed by y-coordinate.

Every other key in the returned dict (page_df, image_df, toc_df, span_df, object_registry, cross_ref_df) is an empty DataFrame. A consumer that calls parsed["image_df"] does not crash; it iterates an empty frame.

4. What Traditional OCR Misses: The Layout Gap

Five structural artefacts that the RAG pipeline needs and that traditional OCR cannot produce, regardless of how large the recognition model is. Each one breaks a downstream operation the rest of the series relies on.

TOC / sections. Cross-reference resolution and section-scoped corpus retrieval both rely on toc_df. EasyOCR returns zero rows. The dispatcher cannot route “answer in Section 3.2” questions because Section 3.2 has no boundary.
Individual figures inside the page. A scanned 30-page contract may contain six chart screenshots embedded in the body text. EasyOCR treats the whole page as one image and returns text from around the figures; the figures themselves never become rows. A downstream pipeline that needs to retrieve “the chart on page 14” has no handle.
Reading order on multi-column / multi-zone pages. A two-column scientific paper page comes back top-to-bottom across both columns intermixed: left-line-1, right-line-1, left-line-2, right-line-2… Generation reads garbage. Sidebars, footnotes, and marginalia all leak into the main flow.
Table cells. A scanned schedule of charges or premium table returns as a flat sequence of disconnected text boxes. The engine has no way to group boxes into rows, columns, or headers — the table structure is entirely lost.
Semantic region labels. There is no mechanism to distinguish a heading from a paragraph, a caption from body text, or a footer from content. Every detected region is equal in the output, leaving all classification work to whatever comes next in the pipeline.

These gaps are not limitations of EasyOCR specifically — they apply equally to Tesseract, PaddleOCR, and any other engine in the traditional-OCR class. The ceiling is the architecture, not the implementation. When layout structure matters to the RAG pipeline, a layout-aware engine such as Docling is the appropriate next step.

1. What “Traditional OCR” Does (and Doesn’t)

2. EasyOCR: The Canonical Traditional OCR

3. What line_df Looks Like

4. What Traditional OCR Misses: The Layout Gap

3. What `line_df` Looks Like