ADR-0011: Token economy — defer embeddings, hybrid chat→script, Haiku routing

Status: Accepted Date: 2026-05-19

Context

User flagged AI cost as a top concern. Need a strategy that minimizes spend while preserving the AI-first interface and content quality.

Decision

Pipeline architecture: hybrid by phase

Pilot phase (first ~5 batches): pure chat, run in Claude Code conversationally. Goal = discover the right pipeline shape. Cost visibility via Anthropic dashboard.
Productionized phase: freeze working prompts into scripts/process_batch.py using the Anthropic SDK with prompt caching, model routing, per-batch cost reports. Chat used for exploratory/ad-hoc work only.

Model routing (script only)

Opus 4.7 for Generate tasks: atomization, thread synthesis, counter research, sermon drafting
Haiku 4.5 for Check tasks: schema validation, word count, tag lookup, index regeneration, citation grep, cost report sum
Mnemonic: “Generate = Opus. Check = Haiku.”
Chat mode just uses Opus (no manual switching mid-conversation)

Cost reporting

Scripts use Anthropic SDK usage data to produce a per-batch cost table in REVIEW.md: input/output/cached tokens per phase + $ total. Pure-chat batches omit this (dashboard view only).

Deduplication: defer embeddings

Start with _meta/index.md (one-line-per-note: title | type | tags | sources) + grep-based duplicate detection in AI
Human catches semantic dupes during batch review (already required workflow)
If at 500+ atomics AI is producing real dupes, add SQLite + sqlite-vss or local Chroma as a 1-day retrofit

Other levers (cumulative)

Prompt caching on CONTEXT.md — fixed across sessions, ~90% savings on system-prompt portion
Pre-computed index file — AI reads _meta/index.md instead of loading every atomic for dedup checks
Search-before-generate — AI greps the index for likely overlaps before writing a new atomic; flags candidates in REVIEW.md
Bounded chapter processing — one chapter per batch; threads pull from index, not chapter source

Alternatives considered

Embeddings from day one: rejected — premature; we don’t yet know AI’s tendency to produce dupes, so can’t tune the similarity threshold.
All-Opus, no model split: rejected — Haiku is ~15× cheaper for the mechanical 80%; routing is trivial in a script.
Pure-script from day one: rejected — building the pipeline before pilot teaches us what it should be is over-engineering.
Pure-chat forever: rejected — chat hides per-call usage; can’t tune costs precisely.

Consequences

(+) Pilot cost is bounded by user’s review pace, not script efficiency
(+) Script phase delivers precise cost reports per batch
(+) Model split delivers large cost savings on routine validation work
(−) Two operating modes to maintain (pilot vs script) — acceptable; pilot becomes pure exploration territory once script lands

Zettle

Explorer

0011-token-economy