ADR-0023: Parallel atomization pipeline via Batch API
Status: Accepted Date: 2026-05-24
Context
ADR-0021 specified a foundation-pass workflow targeted at ~200 chapter-sessions across DP, CSG Books 2-16, and WS. Pace in practice: ~10-15 minutes per chapter when chat-driven, which is faster than the ADR’s budget but still produces ~30-50 hours of wall-time work for the remaining ~170 chapters plus the SMM speech corpus (~800 units) and tparents.org books in queue. The user reports the bottleneck is sequential atomization: each atomic in a section references earlier atomics in the same section, so AI must produce atomics one-at-a-time, even though the writing itself is AI-driven. Reviewing has also degraded — for foundation material the user has been rubber-stamp approving REVIEW.md because the attention budget is reserved for branches and threads, not the trunk.
A separate user concern is bulk reference ingestion. Hundreds of Sun Myung Moon speeches and other tparents.org books need to enter the vault. Ingesting these one chapter at a time via chat will not complete in any reasonable timeframe.
ADR-0011 anticipated a productionized scripts/process_batch.py for AI batch work but the file does not yet exist; pilot mode (pure chat per ADR-0011) has continued past the originally-projected 5 batches. The pre-conditions ADR-0011 listed for productionization (stable prompts, known shape) are met for atomization specifically.
A bad earlier draft of this plan attempted to mitigate review burden by introducing a “reference-tier” classification with auto-finalization for low-priority material. Domain grilling rejected this: SMM speeches are not low-priority — they are quoted in sermons, contain insights not in DP, and SMM’s teaching evolves over decades (early: foundation-laying / indemnity emphasis; later: sustainability / long-term emphasis). Tier classification by source class is therefore not the right axis. The right axis is atomization style (per-class prompt profile) plus staging-as-leisure-zone for the review attention problem.
Decision
Implement scripts/process_batch.py as the productionized AI batch script ADR-0011 anticipated. First subcommand: atomize. Future subcommands (weave, sermon) follow the same pattern.
Pipeline shape
Single linear script. No “Phase 1 / Phase 2” framing, no separate workflow concept beyond the existing ADR-0009 batch shape.
- Discover atomization units inside the resource_dir per class config (citation grammar in CONTEXT.md determines unit granularity).
- Pre-flight summary (interpretive amendment to ADR-0009 Gate 1). Script validates heading structure of all discovered units against the class config and prints a summary: N units, proposed batch carving (see “Batching by semantic boundary” below), sample heading shapes from 3 random files, citation grammar to be used, estimated batch cost in $. User says go/no-go once. Aborts with offending files listed if shape-validation fails. This honors Gate 1’s spirit (no broken-citation cascades from misformatted sources) without requiring per-file user confirmation.
- Atomize step (parallel). Submit one Anthropic Batch API request per atomization unit, Opus 4.7, current atomization prompt (chat-pilot prompts frozen and parameterized by class). Each unit produces atomics with
sources:frontmatter + inline source[[wikilinks]](dual citation per ADR-0005);related:left empty. - Linking step (single pass) over staged atomics:
- Adds
related:wikilinks between atomics in the same batch where claim-relationships warrant (Opus judgment). - Adds inline
[[glossary-term]]wikilinks in atomic bodies where the body mentions a registered glossary term (Haiku grep + apply). - Flags suspected duplicates against
/reference/,/experience/, and other open/staging/folders. Surfaced in REVIEW.md; not auto-resolved. - Proposes new glossary stubs when a recurring term emerges; user approves at review.
- Adds
- Stage atomics + REVIEW.md under
/staging/{batch-id}/. - User review (ADR-0009 Gate 2) at leisure. Multiple batches may sit in staging concurrently; Obsidian wikilinks are filename-based, so staging vs final is incidental during review.
- Finalize (ADR-0009 + ADR-0021). Move atomics to final locations, regenerate Person/Glossary
## Referenced byfor every index entry touched by this batch (validated by pre-commit hook per ADR-0017 rule 7), copy REVIEW.md to_meta/batch-reviews/{batch-id}.md, append parked items to_meta/parking/*.md, delete the staging folder, commit.
Batching by semantic boundary
Batches are not sized by a fixed unit count. They honor semantic boundaries (chapter / section / subsection / whole-speech / date+location-range) so all atomics in a batch share one mental context for review.
The script walks down the class hierarchy until the unit count for a container is ≤ a soft ceiling (default ~14 units, user-confirmed acceptable). Examples:
- A DP Part with 30 sections → split into chapter-level batches.
- A DP Chapter with 5 sections, each with 3 atomization units → batch the whole chapter (15 units, close to ceiling).
- A DP Subsection with 2 units → batch at the subsection level if its parent section would otherwise exceed the ceiling.
- SMM speeches → group by year + location (no chapter container exists; date + location is the cohesion axis).
The pre-flight summary reports the proposed carving; user can adjust before submission.
Per-class atomization-style profiles
The script carries a style profile per resource class, affecting the atomization prompt (not the review workflow):
- DP — methodically written doctrine; tight extraction; expect clean atomic boundaries.
- CSG — anthology of SMM teachings, structured by editors; attribution to underlying speech where indicated.
- WS — anthology of multiple traditions; atomics may cite the underlying original tradition per ADR-0020 hybrid policy.
- SMM — oral, repetitive, teaching evolves over decades; dedup within-speech repetition; era-context is recoverable from the speech date already in the citation.
- BR — secondary literature, prose argument; atomize claims plus the author’s evidential structure.
- tparents books — varied per book; default to BR-style, profile-tune per book as added.
Style profiles live in the script’s class config alongside the existing citation-grammar mapping from CONTEXT.md.
Model routing per ADR-0011
- Opus 4.7 for: atomization (Generate),
related:link reasoning, duplicate semantic judgment, new-glossary-stub proposal. - Haiku 4.5 for: heading-shape validation (Check), registered-term grep, inline wikilink syntax application, REVIEW.md cost-report summation, schema validation, index regeneration.
- Chat mode (existing pure-chat batches) still uses Opus only per ADR-0011.
Embedding-based dedup (amends ADR-0011)
ADR-0011 deferred embeddings “until 500+ atomics.” A parallel atomization run produces a corpus crossing that threshold (current 343 + run output). AI-grep dedup at this scale is O(n²) calls and impractically expensive. The deferral is lifted:
- Embed every atomic via an external embedding API (OpenAI
text-embedding-3-smallat ~1 for full-corpus recompute) or Anthropic embedding when available. - Persist as
_meta/embeddings/atomics.npz(or a sqlite-vss table — implementation choice during script build). Excluded from Quartz publish viaignorePatterns. Tracked in git for portability. - Linking step uses cosine-similarity top-K (default K=10) per new atomic, then Opus judges semantic equivalence on the shortlist only. AI-grep is replaced by AI-rerank.
REVIEW.md shape
ADR-0021 lean foundation-pass shape, with one additional sub-section:
## Linking applied
- `related:` additions: N (table: source atomic | target atomic | reason-1-line)
- Inline glossary wikilinks added: N (table: atomic | glossary term)
- Suspected duplicates flagged: N (table: new atomic | candidate-match | similarity score)
- New glossary stubs proposed: N (table: term | trigger atomic)
All other lean-shape sections (Atomics, Glossary/Person updates, Suspected duplicates, Tag-registry requests, Parked items) are preserved. The ADR-0011 cost report appended (productionized phase makes per-batch cost reports load-bearing).
Concurrent staging — ADR-0009 Hard Rule 6 reinterpretation
“No commit that touches files outside the active staging batch (during a batch run)” interpreted with parallel batches in flight: each finalize commit touches only its own staging folder + final-folder moves for atomics from that batch + Person/Glossary index updates triggered by that batch. Concurrent batches do not share files in staging; finalize is per-batch atomic. No structural change to ADR-0009 needed. Linking-step dedup scan includes all open /staging/ folders so cross-batch duplicate proposals are surfaced before they finalize.
ADR-0014 default-park behavior
ADR-0021 inverted ADR-0014’s default-promote rule to default-park for foundation-pass questions. process_batch.py atomize inherits this: any questions surfaced during atomization default-park to _meta/parking/questions.md. User can promote interactively during Gate 2 review.
Pilot before scaling
Pilot inputs span the class shapes the script must handle:
- 5 SMM speeches — mix of date ranges (Belvedere 1970s, Day of Hope, Hoon Dok Hae morning, Cheongpyeong, post-2012 Sanctuary if available). Tests: whole-unit class, oral/repetitive style profile, era-context handling, batch of independent files.
- 1 small CSG book or 1 DP part with 3–6 sections. Tests: multi-section book, in-book cross-references handled by linking step.
- 1 BR chapter. Tests: chapter-unit class, regression-parity with the existing manually-driven BR atomics (existing baseline for side-by-side comparison).
Side-by-side compare against existing manually-driven Opus output for equivalent material on:
- Atomics-per-unit granularity (target: comparable count)
- Claim quality (1-sentence claim + 1-2 paragraph elaboration, within 400-word cap per ADR-0005)
- Citation accuracy (every
sources:resolvable; inline citations match frontmatter) - Cross-link coverage after linking step (every atomic has ≥1
related:or glossary backlink where warranted) - Pre-commit hook (ADR-0017) clean — no tag violations, no Person/Glossary backlink failures
Pilot cost target: under $5 total. If atomic-quality is indistinguishable for all three pilot shapes, scale to the remaining ~170 foundation chapters + SMM/tparents corpora. If a specific class shape underperforms, diagnose (prompt? context? model? chunking?) and re-pilot just that shape.
Alternatives considered
- A. Keep chat-driven sequential atomization; just preach patience. Rejected — ~17 days of subscription wall-time per ~800-unit corpus, with no leverage on the bottleneck (within-section serial cross-referencing). The work is not actually time-sensitive but the user attention required to drive it is.
- B. Fine-tune a UC-specific model to “know” the corpus. Rejected — fine-tuning teaches style/patterns, not facts. Fine-tuned models hallucinate confidently-styled quotes. The Bible Q&A quality everyone references is a pre-training + RLHF coverage artifact, not replicable cheaply. RAG / structured retrieval over the citation-grammared vault is what produces real-quote accuracy, and the vault is already RAG-ready.
- C. Skip atomization entirely; query raw resources via RAG. Rejected — atomic-note distillation is a different epistemic act than retrieval. The atomic note is the user’s prior interpretation of a passage, hand-curated into the vault. RAG gives raw material; atomic notes give synthesized claims. Also, the published vault is read by humans following wikilinks — atomic notes are the human-readable surface, not internal scaffolding.
- D. Introduce “reference-tier” classification with auto-finalization for low-priority sources. Rejected during domain grilling — SMM speeches are not low-priority. The right axis is per-class atomization style (prompt profile), not content tier.
- E. “Review-light” mode bypassing ADR-0009 Gate 2. Rejected — staging-as-leisure-zone solves the attention problem without bypassing review. ADR-0009 Gate 2 stays intact.
- F. Build a separate
ingest_batch_parallel.pyscript instead of honoring ADR-0011’sprocess_batch.pyname. Rejected —process_batch.pyis the script ADR-0011 specified for productionized AI batch work; this IS that script (first subcommand:atomize). Two scripts would proliferate names and obscure the architectural intent. - G. Two-phase parallel + sweep with explicit “Phase 1 / Phase 2” framing. Rejected — the framing collides connotatively with ADR-0022 weaving-pass (also “linking”). The pipeline is a single linear script with named steps (atomize → link → stage → review → finalize); no separate workflow concept.
Consequences
- (+) Foundation-pass wall time drops from weeks of sequential chat-driven work to days of batch-API-driven work, at fixed marginal cost (~50-100 for SMM+tparents corpora).
- (+) Per-class style profiles capture domain-real differences (SMM oral-and-repetitive vs DP methodically-written) that were implicit in chat-pilot prompts but not codified.
- (+) Embedding-based dedup scales sub-linearly with corpus size; AI-grep was about to break at the new scale.
- (+) Concurrent staging matches user’s actual review pattern (multiple bites at leisure) without changing ADR-0009 review-gate principles.
- (+) Cost-report integration per ADR-0011 makes per-batch spend visible again (chat pilot used dashboard only).
- (+) Productionized script unblocks future
weaveandsermonsubcommands on the same infrastructure (subcommand pattern, prompt caching, model routing, cost reports). - (−) Adds an external embedding API dependency to ADR-0011’s stdlib-leaning posture. Mitigated by treating embeddings as a cache (recomputable from atomics + the prompt template), persisted but not load-bearing.
- (−) Requires
ANTHROPIC_API_KEYand the embedding-provider key to run the script. Chat-pilot path remains available for one-off batches without env setup. - (−) Pilot before full rollout adds ~24-48 hours of wall-time delay before the full SMM corpus can be atomized. Acceptable cost to validate quality before committing batch budget.
- (−) Multiple concurrent staging batches make
git statusnoisier during active ingestion. Acceptable — finalize commits clear staging folders on approval.