ADR-0011: Token economy — defer embeddings, hybrid chat→script, Haiku routing
Status: Accepted Date: 2026-05-19
Context
User flagged AI cost as a top concern. Need a strategy that minimizes spend while preserving the AI-first interface and content quality.
Decision
Pipeline architecture: hybrid by phase
- Pilot phase (first ~5 batches): pure chat, run in Claude Code conversationally. Goal = discover the right pipeline shape. Cost visibility via Anthropic dashboard.
- Productionized phase: freeze working prompts into
scripts/process_batch.pyusing the Anthropic SDK with prompt caching, model routing, per-batch cost reports. Chat used for exploratory/ad-hoc work only.
Model routing (script only)
- Opus 4.7 for Generate tasks: atomization, thread synthesis, counter research, sermon drafting
- Haiku 4.5 for Check tasks: schema validation, word count, tag lookup, index regeneration, citation grep, cost report sum
- Mnemonic: “Generate = Opus. Check = Haiku.”
- Chat mode just uses Opus (no manual switching mid-conversation)
Cost reporting
Scripts use Anthropic SDK usage data to produce a per-batch cost table in REVIEW.md: input/output/cached tokens per phase + $ total. Pure-chat batches omit this (dashboard view only).
Deduplication: defer embeddings
- Start with
_meta/index.md(one-line-per-note: title | type | tags | sources) + grep-based duplicate detection in AI - Human catches semantic dupes during batch review (already required workflow)
- If at 500+ atomics AI is producing real dupes, add SQLite + sqlite-vss or local Chroma as a 1-day retrofit
Other levers (cumulative)
- Prompt caching on CONTEXT.md — fixed across sessions, ~90% savings on system-prompt portion
- Pre-computed index file — AI reads
_meta/index.mdinstead of loading every atomic for dedup checks - Search-before-generate — AI greps the index for likely overlaps before writing a new atomic; flags candidates in REVIEW.md
- Bounded chapter processing — one chapter per batch; threads pull from index, not chapter source
Alternatives considered
- Embeddings from day one: rejected — premature; we don’t yet know AI’s tendency to produce dupes, so can’t tune the similarity threshold.
- All-Opus, no model split: rejected — Haiku is ~15× cheaper for the mechanical 80%; routing is trivial in a script.
- Pure-script from day one: rejected — building the pipeline before pilot teaches us what it should be is over-engineering.
- Pure-chat forever: rejected — chat hides per-call usage; can’t tune costs precisely.
Consequences
- (+) Pilot cost is bounded by user’s review pace, not script efficiency
- (+) Script phase delivers precise cost reports per batch
- (+) Model split delivers large cost savings on routine validation work
- (−) Two operating modes to maintain (pilot vs script) — acceptable; pilot becomes pure exploration territory once script lands