ADR-0011: Token economy — defer embeddings, hybrid chat→script, Haiku routing

Status: Accepted Date: 2026-05-19

Context

User flagged AI cost as a top concern. Need a strategy that minimizes spend while preserving the AI-first interface and content quality.

Decision

Pipeline architecture: hybrid by phase

  • Pilot phase (first ~5 batches): pure chat, run in Claude Code conversationally. Goal = discover the right pipeline shape. Cost visibility via Anthropic dashboard.
  • Productionized phase: freeze working prompts into scripts/process_batch.py using the Anthropic SDK with prompt caching, model routing, per-batch cost reports. Chat used for exploratory/ad-hoc work only.

Model routing (script only)

  • Opus 4.7 for Generate tasks: atomization, thread synthesis, counter research, sermon drafting
  • Haiku 4.5 for Check tasks: schema validation, word count, tag lookup, index regeneration, citation grep, cost report sum
  • Mnemonic: “Generate = Opus. Check = Haiku.”
  • Chat mode just uses Opus (no manual switching mid-conversation)

Cost reporting

Scripts use Anthropic SDK usage data to produce a per-batch cost table in REVIEW.md: input/output/cached tokens per phase + $ total. Pure-chat batches omit this (dashboard view only).

Deduplication: defer embeddings

  • Start with _meta/index.md (one-line-per-note: title | type | tags | sources) + grep-based duplicate detection in AI
  • Human catches semantic dupes during batch review (already required workflow)
  • If at 500+ atomics AI is producing real dupes, add SQLite + sqlite-vss or local Chroma as a 1-day retrofit

Other levers (cumulative)

  • Prompt caching on CONTEXT.md — fixed across sessions, ~90% savings on system-prompt portion
  • Pre-computed index file — AI reads _meta/index.md instead of loading every atomic for dedup checks
  • Search-before-generate — AI greps the index for likely overlaps before writing a new atomic; flags candidates in REVIEW.md
  • Bounded chapter processing — one chapter per batch; threads pull from index, not chapter source

Alternatives considered

  • Embeddings from day one: rejected — premature; we don’t yet know AI’s tendency to produce dupes, so can’t tune the similarity threshold.
  • All-Opus, no model split: rejected — Haiku is ~15× cheaper for the mechanical 80%; routing is trivial in a script.
  • Pure-script from day one: rejected — building the pipeline before pilot teaches us what it should be is over-engineering.
  • Pure-chat forever: rejected — chat hides per-call usage; can’t tune costs precisely.

Consequences

  • (+) Pilot cost is bounded by user’s review pace, not script efficiency
  • (+) Script phase delivers precise cost reports per batch
  • (+) Model split delivers large cost savings on routine validation work
  • (−) Two operating modes to maintain (pilot vs script) — acceptable; pilot becomes pure exploration territory once script lands