ADR-0019: PDF extraction pipeline
Status: Accepted Date: 2026-05-20
Context
Every resource class in CONTEXT.md’s citation grammar table (DP, CSG, SMM, BR, future UC-secondary works) starts life as a PDF in resources-raw/{Class}/. ADR-0015 defined the target shape of a resource note (frontmatter + ## section headings + cleaned prose + footnotes) but left the path from a PDF to that shape implicit. Existing BR and CSG chapters were produced via a since-discarded messy script plus heavy ad-hoc AI cleanup. As more classes get ingested (PDC, 365-Tao, additional SMM speech archives), an unspecified pipeline becomes a recurring cost.
Constraints:
- Local-only. No paid OCR or PDF-API services.
- Quality > speed (multi-minute per-PDF runtimes are acceptable).
- Available compute: AMD Radeon RX 6650 XT (gfx1032) on ROCm, plus Ryzen 7 5800X CPU fallback.
- PDFs vary: digitally-generated with text layer (BR, CSG, PDC) and Archive.org scans needing OCR (365-Tao). Bookmark coverage varies (CSG has full hierarchy, BR has none).
Decision
Two-stage ingestion. Stage 1 is mechanical: a deterministic script (scripts/ingest_pdf.py) that turns a PDF into one-or-many markdown files under resources-raw/{Class}/extracted/. Stage 2 is semantic: a staging batch (per ADR-0009) where AI reads the extracted markdown and produces the final cleaned chapter notes with frontmatter + section headings + footnote reshape, landing in resources/{Class}/.
Stage 1 specifics:
- Engine: marker-pdf. Chosen for strong handling of book-like prose with footnotes and blockquotes, native + scanned PDF support via Surya OCR, and active development.
- GPU: ROCm via
HSA_OVERRIDE_GFX_VERSION=10.3.0. gfx1032 (RX 6650 XT) is not in PyTorch’s official ROCm target list, but the override tells PyTorch to treat the device as gfx1030 (RX 6800 XT) which IS supported. Torch detects the GPU correctly (hip: 6.4,cuda.is_available: True). However, the 8 GB VRAM is below Marker’s working set for Surya’s layout vision encoder — empirically the encoder tries to allocate ~10 GB for a single SDPA attention call and OOMs.LAYOUT_BATCH_SIZE=2does not help because the bottleneck is sequence length per image, not batch dimension. CPU fallback (--no-gpu) is the current recommended default on this hardware; ~40 min for a 63-page native PDF on Ryzen 7 5800X. GPU path is retained for future use on a card with ≥12 GB VRAM or a future Marker release with reduced encoder footprint. - Splitting: PDF bookmarks only.
--split-level Nselects the bookmark depth; the script emits one file per bookmark at that depth. If a PDF lacks bookmarks (BR, 365-Tao), the script writes a singleextracted/full.mdand downstream stage-2 work handles the split. No heuristic heading-detection split — too easy to misidentify and corrupt boundaries silently. - Output:
resources-raw/{Class}/extracted/.resources-raw/is in.gitignore, so extracted artefacts never bloat the repo. - Cleanup in stage 1: minimal and mechanical — strip orphan page-number lines, strip image refs, demote H1→H2 when bookmark-split (chapter title will live in frontmatter once). Footnotes, blockquotes, tables passed through as Marker emits.
- Dependencies: PEP 723 inline metadata, executed via
uv run. Norequirements.txt, no venv ceremony. - CLI:
uv run scripts/ingest_pdf.py <pdf-path> [--split-level N] [--force] [--no-gpu]. Class is inferred from the parent folder name.
Alternatives considered
- Docling (IBM): user’s prior attempt. CPU-friendly, MIT-licensed, but weaker on book-prose footnotes and blockquotes in recent head-to-heads. Rejected.
- pymupdf4llm + heuristics: lightweight and deterministic but no ML layout detection — produces paragraphs without headings, drops semantic structure. Quality below the threshold for this vault.
- Hosted API (Mistral OCR, LlamaParse): highest quality but requires API keys and ongoing cost. Rejected per user constraint.
- Marker + Docling fallback chain: two ML stacks to maintain for marginal robustness gain. Rejected as over-engineering.
- Script calls Claude API to do stage 2 in one shot: deferred. User explicitly does not want per-run API costs at this scale. Stage 2 stays as a staging-batch step that runs in Claude Code chat (or a future scripted batch when token economics warrant it).
- Output into
staging/batch-XXX/: ties extraction to a specific batch ID and forces re-extraction to spawn a new batch. Rejected — extraction is upstream of batches. - Heuristic heading-detection split for bookmark-less PDFs: false splits would propagate silently into citations. Rejected; better to surface “no bookmarks → one file” honestly and let humans/AI do the boundary call.
Consequences
- (+) New resource classes get a one-command extraction step that doesn’t require remembering tool flags or model setup.
- (+) Separation of stage-1 (deterministic) and stage-2 (judgment) keeps re-runs cheap. Re-extracting a PDF doesn’t re-spend AI tokens.
- (+)
resources-raw/gitignore means extracted artefacts are local-only — fine, since the canonical artefact is the cleanedresources/{Class}/file. - (+) HSA_OVERRIDE workaround is documented so future readers don’t trip on it when a card with sufficient VRAM is available.
- (−) On this 8 GB RX 6650 XT, GPU path OOMs in Surya’s vision encoder; we run on CPU. A 63-page native PDF (BR) extracts in ~40 min on Ryzen 7 5800X. Acceptable given the rarity of ingestion runs.
- (−) Marker’s PyTorch + Surya install is heavy (~5 GB across torch + ROCm libs + Surya weights). First
uv runwill be slow; subsequent runs reuse the uv-managed cache.
Update 2026-05-20: engine made switchable
Added --docling flag alongside --marker (default). The script now supports IBM Docling as an alternative engine; both share the same bookmark-split, post-process, and output-path code paths. Page-range translation is engine-internal (Marker takes 0-indexed "start-end" string; Docling takes 1-indexed (start, end) tuple).
Rationale: side-by-side quality comparison against the user’s prior tool, and a fallback if Marker’s footprint becomes a problem (notably, Marker’s Surya encoder OOMs the 8 GB RX 6650 XT — Docling’s smaller layout-model stack is expected to fit). Marker stays default; behavior is unchanged for existing invocations.
Empirical comparison on BR (63 pages, 2026-05-20): Marker on laptop CPU = ~42 min, high-fidelity output (em-dashes, smart quotes, parseable <sup>N</sup> footnotes). Docling on laptop GPU (RX 6650 XT, ROCm) = ~55 sec, but flattens em-dashes to hyphens, smart quotes to straight quotes, promotes styled pull-quotes to spurious H2 headings, and duplicates a few sentences at column/page boundaries. Marker chosen as the production engine; Docling kept as a fast-preview / regression alternate.
Update 2026-05-20: remote-GPU execution
The laptop’s 8 GB AMD GPU can’t fit Marker (Surya encoder OOMs); CPU runs are ~42 min for a 63-page native PDF. A 20 GB NVIDIA GPU lives in a Proxmox container on the user’s LAN, normally hosting an LLM. Added scripts/ingest_remote.sh: rsyncs the PDF and scripts/ingest_pdf.py to ~/zettle-scratch/ on $ZETTLE_GPU_HOST, runs uv run --no-sources scripts/ingest_pdf.py … there, rsyncs the extracted/ directory back.
The --no-sources flag is the load-bearing trick: it makes uv ignore the [tool.uv.sources] redirection in the script’s PEP 723 block (which points torch at the ROCm wheel index for the laptop), causing torch to resolve from default PyPI — which serves the CUDA build. One script, one dependency block, two execution targets.
The wrapper exports larger Marker batch sizes (LAYOUT_BATCH_SIZE=8, DETECTOR_BATCH_SIZE=8, RECOGNITION_BATCH_SIZE=64, TABLE_REC_BATCH_SIZE=8, OCR_ERROR_BATCH_SIZE=8) tuned for the 20 GB card; the script’s os.environ.setdefault(...) respects these. User can override per-invocation by setting the env vars in front of the wrapper.
Remote prerequisites (documented; not automated by the wrapper): SSH key access from the laptop, uv installed in the container, NVIDIA driver + CUDA 12.x runtime, rsync on both sides, the LLM stopped before ingest. No changes to ingest_pdf.py itself.
- (−) PEP 723 inline-metadata scripts are still a relatively new uv pattern; some contributors may need to install uv first.
- (−) Bookmark-less PDFs (BR, 365-Tao) get only a single
full.md; human/AI must do the chapter-split downstream. Acceptable cost — these PDFs are rare and the boundaries are interpretive anyway.
Update 2026-05-21: TOC-driven bookmark addition for bookmark-less PDFs
The “bookmark-less → single full.md” fallback works for short PDFs (BR is 63 pages) but is infeasible for long ones. Unification Thought (~640 pages, printed TOC, no publisher bookmarks) cannot be extracted as a single file — Marker’s per-call working set OOMs even the remote 20 GB NVIDIA GPU when the page range spans the whole book. Splitting is required, but the existing --split-level N path needs an outline.
Added a small pre-ingest sub-pipeline: parse the printed TOC pages → produce a draft bookmarks TSV → AI sanity-checks → apply bookmarks → ingest via existing --split-level path. Two pymupdf-only scripts (no Marker dep):
scripts/extract_toc.py <pdf> --pages X-Y --offset N— scans the TOC pages, regex-matches dotted/wide-space leader patterns, font-bucketed by size, merges multi-line continuation rows, assigns levels by leading marker (arabic/roman/letter), and emits bothbookmarks.tsv(consumable by add_bookmarks) and a sidecarbookmarks.review.mdfor AI verification.scripts/add_bookmarks.py <pdf> <tsv>— validates (level monotonicity hard-fail, page monotonicity warning), applies viapymupdf.set_toc(), prints a one-line-per-bookmark checklist to stdout so the user can eyeball alongside the PDF viewer’s outline pane.
The sidecar markdown has four sections: (1) verbatim pymupdf text of the scanned TOC pages, (2) the draft TSV in a fenced block, (3) a target-page probe (first ~100 chars of each bookmarked page, marked ✓ if the page’s first line contains the TSV title’s leading marker), (4) a flags list (non-monotonic page numbers, running-header detection, out-of-range targets). AI in chat reads this and patches the TSV via the Edit tool; user reviews the diff.
Two-phase verification:
- Pre-bookmark — AI sanity-checks the sidecar. Catches missing rows, level errors, big offsets, continuation orphans. ~90% reliable on UT-class books.
- Post-bookmark — user opens bookmarked PDF in viewer, clicks 5-10 outline entries against the printed checklist. Catches residual errors AI missed (running-header probe failures, source-TOC omissions).
Why this doesn’t contradict the original “rejected heuristic heading-detection split” decision: that rejection was about unreviewed heuristic boundaries derived from Marker’s noisy markdown output, propagating silently into citations. The new path derives boundaries from the source PDF’s printed TOC (cleaner signal), reviews them at two checkpoints before applying, and lets stage-2 chapter cleanup absorb ±1-page residual errors. Three checkpoints versus one silent heuristic.
Script-vs-AI split: scripts fix only what they can do deterministically (continuation-line merge, level-jump validation). Detection-only flags (page non-monotonicity, running headers) are surfaced for AI/human resolution — the script can detect them but can’t pick which row is wrong without context.
- (+) Long printed-TOC PDFs (UT, future similar UC-secondary works) become ingestible. Without this they’d be stuck at the “would extract as single 640-page file”
--max-chunk-pagesguard. - (+) Stage-1 deterministic / stage-2 judgment separation preserved — bookmark addition is mechanical; the AI review is constrained to TSV editing, not free-form prose.
- (+) Sidecar is gitignored (
resources-raw/is gitignored), so no repo bloat. - (−) AI sanity check is per-book manual work in chat. ~5-10 min of editing per UT-sized book. Acceptable given ingestion is infrequent.
- (−) Books with running headers (chapter title repeated on every page — UT has them) defeat the target-page probe’s ✓/⚠ signal. Sidecar flags this; user falls back to viewer spot-check.
- (−) Source-TOC omissions (a section that exists in the book but was left out of the printed TOC) are invisible to the pre-bookmark pipeline. Caught only at viewer check or stage-2 markdown reading.
Update 2026-05-21: cross-file section-boundary normalizer
Bookmark-driven splitting cuts on PDF page boundaries, but section prose rarely respects page boundaries. Empirically across DP’s 35 section files: ~25 had a tail paragraph from section-(K-1) leaking into the top of section-K’s file, several had a heading split across a blank line (## **Section 3** + blank + ## **<Title>**), two had Marker emit the entire prior section verbatim at the head of the next section file (chapter 7 part-1 / chapter 7 part-2), and one had section-K’s bookmark range bleed into section-(K+1) territory at its tail. All four classes degrade stage-2 atomic extraction by misattributing prose to the wrong section.
Added normalize_sections() to scripts/ingest_pdf.py. Runs after the chunk-write loop (or as standalone --normalize-only EXTRACTED_DIR for backfill). Three passes:
- Heading-rejoin: collapse
## **Section N**+ blank +## **<Title>**to one line. - Leading-orphan: for each adjacent pair
(section-(K-1), section-K), find the first## Section Kanchor in section-K (anchor regex enforces expected K — otherwise a duplicated prior heading at offset 0 would fool the matcher into a no-op). Everything before the anchor is orphan. Compare orphan paragraphs (SHA-256 hashes) against pristine section-(K-1). If they appear as a contiguous run there, the orphan is a Marker duplicate → trim. Otherwise it’s a real leak → move to section-(K-1) tail. - Trailing-overflow: scan section-K for any
## Section Jwith J > K. If the tail-from-that-anchor’s first ≥2 paragraphs appear as a contiguous run in pristine section-J, trim the entire tail.
All comparisons use pristine snapshots taken before any mutation, so per-pair decisions don’t see state another pair already changed. Writes happen at the end. Pass-1 modifies one file at a time; pass-2 modifies at most two; pass-3 modifies one. Per-action stdout log: moved N paragraph(s) from … → … tail / trimmed N-paragraph duplicate prefix in … / trimmed N-paragraph trailing overflow in …. Idempotent.
Why this doesn’t contradict the original “rejected heuristic heading-detection split” decision: that rejection was about guessing boundaries from unstructured prose. The signals here are explicit markers in the text — named ## Section K anchors, exact paragraph hashes against the neighboring file’s content. The same script-vs-AI split rule from the bookmark-addition sub-pipeline applies: the script trims/moves only when the match is deterministic (anchor regex matches, paragraph hashes line up); it logs a warning and skips when the expected anchor is missing.
- (+) Backfill of existing DP (52 actions across 35 files in two runs) unblocks atomic extraction without re-running Marker. Per-action stdout log gives a sanity-check trail.
- (+) Pass-1 in ingest_pdf.py means future extractions of DP, CSG, and any other class with
section-K-*.mdorchapter-K-*.mdfiles auto-normalize.--no-normalizeavailable for debug.--normalize-only EXTRACTED_DIRlets you re-run normalization standalone (or apply to other classes retroactively). - (+) Snapshot-based cross-file decisions: each pair’s trim-vs-move decision compares against pristine content, not against state already mutated by an earlier pair. Avoids order-dependent miscategorization.
- (−) Heading-depth inconsistencies (e.g., DP chapter 7 part-1 has
## **4.1 Rebirth**in section-3’s overflow but### **4.1 Rebirth**in section-4 proper) prevent full-tail hash match. Trailing-overflow trim accepts partial-prefix match (≥2 paragraphs) and drops the whole tail from the anchor onward, which loses any residual non-matching content along with the overflow. Acceptable since the residual is itself Marker-bug output. - (−) Chapter-heading inconsistencies (
### Chapter 2.vs## Chapter 2, missing chapter title in part-2/01) and last-section truncations are not in scope. Picked up separately if they become recurring.
Update 2026-05-21: front/back-matter subfolder routing
A PDF’s flat bookmark list typically intermixes front matter (about-the-author, foreword, copyright), body chapters, and back matter (resources, appendices, index). Until now the script ran numbered_slug() over every chunk title: titles with a leading number (“1. The Saddleback Story”) produced 01-the-saddleback-story.md, but titles without a number got plain slugify() and no prefix — so the directory listing for The Purpose-Driven Church alphabetized the unnumbered chunks, putting “resources” (a back-matter appendix) above “surfing-spiritual-waves” (an intro chapter) and scrambling reading order. The user couldn’t tell at a glance what to read first.
Added classify_chunks() to tag each chunk as front / body / back by position. Within each parent_chain group, the first and last chunks whose title matches ^\s*\d+\.\s+ (numbered chapter) bracket the body span. Chunks before the first → front; chunks after the last → back. Groups with no numbered chapter classify all-as-body, preserving current behavior for unnumbered TOCs.
chunk_outpath() now routes:
front→00-front-matter/<position>-<slug>.mdback→99-back-matter/<position>-<slug>.mdbody→ unchanged:<numbered_slug>.mdat the parent_chain folder root
The position prefix is the chunk’s 1-indexed order within its class, so the front/back subfolders preserve PDF reading order. Body chunks keep their canonical chapter-number filename — citation grammar ([[Class/NN-slug#anchor|anchor]]) is unchanged.
Why position-based, not vocabulary-based: order-aware. PDC’s “About the Author” and “About the Publisher” actually sit after chapter 20 in the PDF, so they’re back matter — vocabulary matching on “about” would have mis-tagged them as front. Position is correct for any layout. The ^\d+\.\s+ anchor is the deterministic signal; we trust the source PDF’s chapter numbering.
No auto-skip. Obvious chaff (copyright, publisher boilerplate) still gets extracted into the matching subfolder. Subfolder grouping makes the user’s manual cleanup a single rm per category, consistent with the project’s flag-don’t-fix principle: the script orders, the user prunes.
Grouping by parent_chain generalizes to nested resources (CSG Book → Chapter). For PDC’s flat bookmarks every chunk has empty parent_chain[1:], so they form a single top-level group and the front/back subfolders sit at the extracted/ root. For CSG the classification runs per-Book, but since CSG’s bookmark hierarchy has no front-matter siblings at the chapter level, the layout is unchanged in practice.
- (+) New extractions produce a directory tree where
lsreveals reading order at a glance. - (+) Existing
numbered_slug()body behavior preserved — re-running on CSG, BR, DP produces the same chapter filenames. - (+)
--inspectshows the classification preview so the user can dry-run before committing Marker time. - (−) Part-divider bookmarks at the chapter level (PDC has “Part One • Seeing the Big Picture” etc. at the same bookmark depth as chapters) classify as body and emit unprefixed slugified files that alphabetize at the tail of the body listing. The user still does the manual Part-folder restructure for those — out of scope for this change.
- (−) Books where front/back matter sits in nested folders (e.g. a Book-level “Book 0: Preface” in a CSG-style hierarchy) would need a different signal to detect. Not encountered yet; address when it appears.
Update 2026-05-21: stage_resources.py — extracted/ → staging/ onramp
Stage-2 (“AI reads extracted markdown and produces cleaned chapter notes in resources/{Class}/”) was previously all-AI-in-chat. Comparing a finalized CSG chapter (resources/CSG/Book01/csg-01-01-the-original-being-of-god.md) against its source (resources-raw/CSG/extracted/contents/book-1-true-god/01-the-original-being-of-god.md) reveals three tiers of work happening:
- Tier 1 — class-agnostic deterministic. File/folder rename into the
resources/{Class}/layout, frontmatter inject,[Text](#page-N)page-anchor link strip, ASCII → curly quote normalization. - Tier 2 — class-specific deterministic. Heading-depth normalization (CSG
## [Section 1]→# Section 1,#### 1.1.→## 1.1.), citation line-break (CSGparagraph. (35-156, 1970.10.13)→ break onto own line). - Tier 3 — interpretive. Section-break decisions, footnote reshape, OCR-error fixes. Per ADR-0015.
Doing tier-1 via Edit calls per chapter for every new ingest is repetitive and token-expensive. ADR-0015 anticipated this: “Per-chapter cleaning effort is real; for sources with consistent OCR quality, this can be partially scripted later.”
Added scripts/stage_resources.py. One invocation per class:
uv run scripts/stage_resources.py <class> [--force]
Walks resources-raw/{class}/extracted/, matches each .md against a per-class regex (hardcoded CLASSES registry: CSG, BR, PDC, WorldScripture initially), and writes a staging batch at staging/{class-slug}-resource-ingest/ mirroring the resources/{Class}/ shape. Each staged file has:
- YAML frontmatter (
type:,class:,book:/part:,chapter:,*-title:, plus class-specific metadata like author/year). Titles derived from slug (the-saddleback-story→"The Saddleback Story"); lossy on possessives (god-s→God S), user fixes during review before promotion. - Body text with
[Text](#page-N(-N)?)page-anchor link wrappers stripped. - Body text with ASCII
"/'normalized to curly“”‘’via a stateful single-pass walk (opening if preceded by start/whitespace/punctuation, closing otherwise).
REVIEW.md is generated listing every staged file (book, chapter, derived title, target path), every unmatched extracted file (skipped, not silently promoted), and a stub for the standard “Proposed work” section. Files under the reserved 00-front-matter/ and 99-back-matter/ subdirs (from the front/back-matter classifier above) skip the regex match entirely and route to “unmatched” — keeps a Part-folder-shaped class regex (e.g. PDC’s) from false-positiving on 00-front-matter/1-foo.md as Part 00.
Tier 2 (class-specific deterministic) is intentionally left to AI in chat. Reasons: (a) it’s small per-class, so encoding rules in the script means writing per-class rule modules anyway; (b) the user explicitly chose “no class-specific cleanups” during the grilling pass. Tier 3 stays AI-in-chat per ADR-0015’s review gate.
Why hardcoded CLASSES dict instead of per-class config files: matches the precedent of scripts/add-csg-frontmatter.js, and adding a class is already heavy enough (new ADR per CONTEXT.md) that one more registry entry is noise-level cost. The dict has the regex for extracted/, the target_layout template, the frontmatter schema, and the resource_folder name. No external config to load, no schema versioning to worry about.
Why staging (staging/{class}-resource-ingest/) instead of writing directly to resources/{Class}/: preserves the ADR-0009 review gate. AI in chat does tiers 2-3 in place on the staged files, user reviews, then a plain mv promotes the directory.
Why derive titles from slug instead of authoring hardcoded title maps per class (like add-csg-frontmatter.js does): bulk-ingest doesn’t justify hand-typing every chapter title up front. The review pass already touches each chapter for tier-2/3 work, so fixing 2-3 lossy titles per chapter is the same swing.
- (+) Per-ingest tier-1 work is now
uv run scripts/stage_resources.py <class>— sub-second, no token spend. - (+) Unmatched files surface in REVIEW.md instead of silently being skipped or wrongly classified.
- (+) PDC’s manual-Part-folder restructure (per the front/back-matter update above) is encoded in the class regex; the user’s Part-folder names become
part:/part-title:frontmatter on every chapter automatically. - (+)
--forcereuses the rmtree-and-rewrite semantics established foringest_pdf.py --force. - (−)
add-csg-frontmatter.jsis superseded for new CSG additions but retained for now; existing CSG files already have frontmatter, no need to retrofit. - (−) Title derivation is lossy on possessives, hyphenated compounds, and stop-word capitalization (
"twenty-first"→"Twenty First"). User fixes in review. Acceptable since the alternative is hand-typed per-class maps for every chapter. - (−) Classes whose
extracted/layout doesn’t fit theNN-slug.mdorNN-name/NN-slug.mdshape (e.g. SMM speeches, Transcripts) need a fresh registry entry with a different regex. Add when first ingested.