ADR-0023: Parallel atomization pipeline via Batch API

Status: Accepted Date: 2026-05-24

Context

ADR-0021 specified a foundation-pass workflow targeted at ~200 chapter-sessions across DP, CSG Books 2-16, and WS. Pace in practice: ~10-15 minutes per chapter when chat-driven, which is faster than the ADR’s budget but still produces ~30-50 hours of wall-time work for the remaining ~170 chapters plus the SMM speech corpus (~800 units) and tparents.org books in queue. The user reports the bottleneck is sequential atomization: each atomic in a section references earlier atomics in the same section, so AI must produce atomics one-at-a-time, even though the writing itself is AI-driven. Reviewing has also degraded — for foundation material the user has been rubber-stamp approving REVIEW.md because the attention budget is reserved for branches and threads, not the trunk.

A separate user concern is bulk reference ingestion. Hundreds of Sun Myung Moon speeches and other tparents.org books need to enter the vault. Ingesting these one chapter at a time via chat will not complete in any reasonable timeframe.

ADR-0011 anticipated a productionized scripts/process_batch.py for AI batch work but the file does not yet exist; pilot mode (pure chat per ADR-0011) has continued past the originally-projected 5 batches. The pre-conditions ADR-0011 listed for productionization (stable prompts, known shape) are met for atomization specifically.

A bad earlier draft of this plan attempted to mitigate review burden by introducing a “reference-tier” classification with auto-finalization for low-priority material. Domain grilling rejected this: SMM speeches are not low-priority — they are quoted in sermons, contain insights not in DP, and SMM’s teaching evolves over decades (early: foundation-laying / indemnity emphasis; later: sustainability / long-term emphasis). Tier classification by source class is therefore not the right axis. The right axis is atomization style (per-class prompt profile) plus staging-as-leisure-zone for the review attention problem.

Decision

Implement scripts/process_batch.py as the productionized AI batch script ADR-0011 anticipated. First subcommand: atomize. Future subcommands (weave, sermon) follow the same pattern.

Pipeline shape

Single linear script. No “Phase 1 / Phase 2” framing, no separate workflow concept beyond the existing ADR-0009 batch shape.

  1. Discover atomization units inside the resource_dir per class config (citation grammar in CONTEXT.md determines unit granularity).
  2. Pre-flight summary (interpretive amendment to ADR-0009 Gate 1). Script validates heading structure of all discovered units against the class config and prints a summary: N units, proposed batch carving (see “Batching by semantic boundary” below), sample heading shapes from 3 random files, citation grammar to be used, estimated batch cost in $. User says go/no-go once. Aborts with offending files listed if shape-validation fails. This honors Gate 1’s spirit (no broken-citation cascades from misformatted sources) without requiring per-file user confirmation.
  3. Atomize step (parallel). Submit one Anthropic Batch API request per atomization unit, Opus 4.7, current atomization prompt (chat-pilot prompts frozen and parameterized by class). Each unit produces atomics with sources: frontmatter + inline source [[wikilinks]] (dual citation per ADR-0005); related: left empty.
  4. Linking step (single pass) over staged atomics:
    • Adds related: wikilinks between atomics in the same batch where claim-relationships warrant (Opus judgment).
    • Adds inline [[glossary-term]] wikilinks in atomic bodies where the body mentions a registered glossary term (Haiku grep + apply).
    • Flags suspected duplicates against /reference/, /experience/, and other open /staging/ folders. Surfaced in REVIEW.md; not auto-resolved.
    • Proposes new glossary stubs when a recurring term emerges; user approves at review.
  5. Stage atomics + REVIEW.md under /staging/{batch-id}/.
  6. User review (ADR-0009 Gate 2) at leisure. Multiple batches may sit in staging concurrently; Obsidian wikilinks are filename-based, so staging vs final is incidental during review.
  7. Finalize (ADR-0009 + ADR-0021). Move atomics to final locations, regenerate Person/Glossary ## Referenced by for every index entry touched by this batch (validated by pre-commit hook per ADR-0017 rule 7), copy REVIEW.md to _meta/batch-reviews/{batch-id}.md, append parked items to _meta/parking/*.md, delete the staging folder, commit.

Batching by semantic boundary

Batches are not sized by a fixed unit count. They honor semantic boundaries (chapter / section / subsection / whole-speech / date+location-range) so all atomics in a batch share one mental context for review.

The script walks down the class hierarchy until the unit count for a container is ≤ a soft ceiling (default ~14 units, user-confirmed acceptable). Examples:

  • A DP Part with 30 sections → split into chapter-level batches.
  • A DP Chapter with 5 sections, each with 3 atomization units → batch the whole chapter (15 units, close to ceiling).
  • A DP Subsection with 2 units → batch at the subsection level if its parent section would otherwise exceed the ceiling.
  • SMM speeches → group by year + location (no chapter container exists; date + location is the cohesion axis).

The pre-flight summary reports the proposed carving; user can adjust before submission.

Per-class atomization-style profiles

The script carries a style profile per resource class, affecting the atomization prompt (not the review workflow):

  • DP — methodically written doctrine; tight extraction; expect clean atomic boundaries.
  • CSG — anthology of SMM teachings, structured by editors; attribution to underlying speech where indicated.
  • WS — anthology of multiple traditions; atomics may cite the underlying original tradition per ADR-0020 hybrid policy.
  • SMM — oral, repetitive, teaching evolves over decades; dedup within-speech repetition; era-context is recoverable from the speech date already in the citation.
  • BR — secondary literature, prose argument; atomize claims plus the author’s evidential structure.
  • tparents books — varied per book; default to BR-style, profile-tune per book as added.

Style profiles live in the script’s class config alongside the existing citation-grammar mapping from CONTEXT.md.

Model routing per ADR-0011

  • Opus 4.7 for: atomization (Generate), related: link reasoning, duplicate semantic judgment, new-glossary-stub proposal.
  • Haiku 4.5 for: heading-shape validation (Check), registered-term grep, inline wikilink syntax application, REVIEW.md cost-report summation, schema validation, index regeneration.
  • Chat mode (existing pure-chat batches) still uses Opus only per ADR-0011.

Embedding-based dedup (amends ADR-0011)

ADR-0011 deferred embeddings “until 500+ atomics.” A parallel atomization run produces a corpus crossing that threshold (current 343 + run output). AI-grep dedup at this scale is O(n²) calls and impractically expensive. The deferral is lifted:

  • Embed every atomic via an external embedding API (OpenAI text-embedding-3-small at ~1 for full-corpus recompute) or Anthropic embedding when available.
  • Persist as _meta/embeddings/atomics.npz (or a sqlite-vss table — implementation choice during script build). Excluded from Quartz publish via ignorePatterns. Tracked in git for portability.
  • Linking step uses cosine-similarity top-K (default K=10) per new atomic, then Opus judges semantic equivalence on the shortlist only. AI-grep is replaced by AI-rerank.

REVIEW.md shape

ADR-0021 lean foundation-pass shape, with one additional sub-section:

## Linking applied
- `related:` additions: N (table: source atomic | target atomic | reason-1-line)
- Inline glossary wikilinks added: N (table: atomic | glossary term)
- Suspected duplicates flagged: N (table: new atomic | candidate-match | similarity score)
- New glossary stubs proposed: N (table: term | trigger atomic)

All other lean-shape sections (Atomics, Glossary/Person updates, Suspected duplicates, Tag-registry requests, Parked items) are preserved. The ADR-0011 cost report appended (productionized phase makes per-batch cost reports load-bearing).

Concurrent staging — ADR-0009 Hard Rule 6 reinterpretation

“No commit that touches files outside the active staging batch (during a batch run)” interpreted with parallel batches in flight: each finalize commit touches only its own staging folder + final-folder moves for atomics from that batch + Person/Glossary index updates triggered by that batch. Concurrent batches do not share files in staging; finalize is per-batch atomic. No structural change to ADR-0009 needed. Linking-step dedup scan includes all open /staging/ folders so cross-batch duplicate proposals are surfaced before they finalize.

ADR-0014 default-park behavior

ADR-0021 inverted ADR-0014’s default-promote rule to default-park for foundation-pass questions. process_batch.py atomize inherits this: any questions surfaced during atomization default-park to _meta/parking/questions.md. User can promote interactively during Gate 2 review.

Pilot before scaling

Pilot inputs span the class shapes the script must handle:

  • 5 SMM speeches — mix of date ranges (Belvedere 1970s, Day of Hope, Hoon Dok Hae morning, Cheongpyeong, post-2012 Sanctuary if available). Tests: whole-unit class, oral/repetitive style profile, era-context handling, batch of independent files.
  • 1 small CSG book or 1 DP part with 3–6 sections. Tests: multi-section book, in-book cross-references handled by linking step.
  • 1 BR chapter. Tests: chapter-unit class, regression-parity with the existing manually-driven BR atomics (existing baseline for side-by-side comparison).

Side-by-side compare against existing manually-driven Opus output for equivalent material on:

  • Atomics-per-unit granularity (target: comparable count)
  • Claim quality (1-sentence claim + 1-2 paragraph elaboration, within 400-word cap per ADR-0005)
  • Citation accuracy (every sources: resolvable; inline citations match frontmatter)
  • Cross-link coverage after linking step (every atomic has ≥1 related: or glossary backlink where warranted)
  • Pre-commit hook (ADR-0017) clean — no tag violations, no Person/Glossary backlink failures

Pilot cost target: under $5 total. If atomic-quality is indistinguishable for all three pilot shapes, scale to the remaining ~170 foundation chapters + SMM/tparents corpora. If a specific class shape underperforms, diagnose (prompt? context? model? chunking?) and re-pilot just that shape.

Alternatives considered

  • A. Keep chat-driven sequential atomization; just preach patience. Rejected — ~17 days of subscription wall-time per ~800-unit corpus, with no leverage on the bottleneck (within-section serial cross-referencing). The work is not actually time-sensitive but the user attention required to drive it is.
  • B. Fine-tune a UC-specific model to “know” the corpus. Rejected — fine-tuning teaches style/patterns, not facts. Fine-tuned models hallucinate confidently-styled quotes. The Bible Q&A quality everyone references is a pre-training + RLHF coverage artifact, not replicable cheaply. RAG / structured retrieval over the citation-grammared vault is what produces real-quote accuracy, and the vault is already RAG-ready.
  • C. Skip atomization entirely; query raw resources via RAG. Rejected — atomic-note distillation is a different epistemic act than retrieval. The atomic note is the user’s prior interpretation of a passage, hand-curated into the vault. RAG gives raw material; atomic notes give synthesized claims. Also, the published vault is read by humans following wikilinks — atomic notes are the human-readable surface, not internal scaffolding.
  • D. Introduce “reference-tier” classification with auto-finalization for low-priority sources. Rejected during domain grilling — SMM speeches are not low-priority. The right axis is per-class atomization style (prompt profile), not content tier.
  • E. “Review-light” mode bypassing ADR-0009 Gate 2. Rejected — staging-as-leisure-zone solves the attention problem without bypassing review. ADR-0009 Gate 2 stays intact.
  • F. Build a separate ingest_batch_parallel.py script instead of honoring ADR-0011’s process_batch.py name. Rejected — process_batch.py is the script ADR-0011 specified for productionized AI batch work; this IS that script (first subcommand: atomize). Two scripts would proliferate names and obscure the architectural intent.
  • G. Two-phase parallel + sweep with explicit “Phase 1 / Phase 2” framing. Rejected — the framing collides connotatively with ADR-0022 weaving-pass (also “linking”). The pipeline is a single linear script with named steps (atomize → link → stage → review → finalize); no separate workflow concept.

Consequences

  • (+) Foundation-pass wall time drops from weeks of sequential chat-driven work to days of batch-API-driven work, at fixed marginal cost (~50-100 for SMM+tparents corpora).
  • (+) Per-class style profiles capture domain-real differences (SMM oral-and-repetitive vs DP methodically-written) that were implicit in chat-pilot prompts but not codified.
  • (+) Embedding-based dedup scales sub-linearly with corpus size; AI-grep was about to break at the new scale.
  • (+) Concurrent staging matches user’s actual review pattern (multiple bites at leisure) without changing ADR-0009 review-gate principles.
  • (+) Cost-report integration per ADR-0011 makes per-batch spend visible again (chat pilot used dashboard only).
  • (+) Productionized script unblocks future weave and sermon subcommands on the same infrastructure (subcommand pattern, prompt caching, model routing, cost reports).
  • (−) Adds an external embedding API dependency to ADR-0011’s stdlib-leaning posture. Mitigated by treating embeddings as a cache (recomputable from atomics + the prompt template), persisted but not load-bearing.
  • (−) Requires ANTHROPIC_API_KEY and the embedding-provider key to run the script. Chat-pilot path remains available for one-off batches without env setup.
  • (−) Pilot before full rollout adds ~24-48 hours of wall-time delay before the full SMM corpus can be atomized. Acceptable cost to validate quality before committing batch budget.
  • (−) Multiple concurrent staging batches make git status noisier during active ingestion. Acceptable — finalize commits clear staging folders on approval.