Batch ws-notes — REVIEW

Content-cleanup variant per ADR-0009. Phase A of the WS-II bulk-processing handoff — produce the full back-matter notes file so that the splitter’s --notes-file argument resolves real footnote text (not TBD placeholders) for every chapter going forward, and the pilot’s staging/ws-notes-ch1/ partial can be retired.

What changed

  • New (1 file): staging/ws-notes/resources/WorldScripture/ws-notes.md — 707 footnote definitions covering Preface (2), Invocation (6), and Chapters 1–22, in standard [^N]: markdown footnote syntax, grouped under ## Preface, ## Invocation, and ## Chapter N H2 headings. Replaces the Chapter 1-only partial at staging/ws-notes-ch1/.
  • New script: scripts/clean_ws_notes.py — deterministic stage-2.5 cleaner (Marker raw notes.md → grouped + ordered footnote file). Will be reused if WS-II notes are re-extracted, or if a sibling anthology’s notes file lands in the same shape.
  • New (1 file): staging/ws-notes/flag-log.txt — every parser flag emitted during the run, mirrored from stderr.

Per-section note counts

SectionCountNotes
Preface2
Invocation6
Chapter 139Matches the partial staging/ws-notes-ch1/ already in flight.
Chapter 223
Chapter 315
Chapter 416
Chapter 537
Chapter 634Notes 29–34 misordered into Ch7 area by Marker (two-column print bleed); script reassigned by numeric continuation.
Chapter 748Notes 47–48 misordered into Ch8 area; reassigned.
Chapter 840
Chapter 948
Chapter 1027
Chapter 1121
Chapter 1228
Chapter 1333
Chapter 1427Note 25 was emitted by Marker as a #### 25. **James 1.22-24** heading instead of a list item; script stripped the heading prefix and recorded it normally.
Chapter 1523
Chapter 1666
Chapter 177
Chapter 1826
Chapter 1928Notes 22–28 misordered into Ch20 area; reassigned.
Chapter 2063
Chapter 2126
Chapter 2224
Total707

Cleanup decisions (applied deterministically by scripts/clean_ws_notes.py)

  • Footnote syntax. Each bullet - N. **citation:** text rewritten as [^N]: **citation:** text.
  • Chapter headings. Marker preserved only Preface, Invocation, Ch4, and Ch19. All other chapter starts are detected via numeric restart (- 1. after a higher number) and emitted as ## Chapter N.
  • Misorder reassignment. A note whose number is not prior_in_section + 1 is treated as misordered if exactly one prior chapter’s tail ended at number − 1; when multiple chapters end at the same number, the highest chapter number wins (most-recently-ended chapter is overwhelmingly the source of a column bleed). All seven reassigned spans flagged.
  • Continuation paragraphs. Lines without a - N. prefix (either bare paragraphs after a page break, or - bullets without a number) are merged into the preceding note’s body. Crucially, the anchor only advances when the just-recorded note was NOT reassigned — otherwise a misordered run would shadow the raw section’s real last note (the column’s true tail). This was caught in pilot review when ch6 note 34 was originally absorbing the continuation of ch7 note 4.
  • N**. artifact (9**. ..., 10**. ... in raw): script re-emits the leading ** so the citation’s opening bold marker is preserved.
  • Word-break artifacts. Joined halves of PDF line-break hyphenations restored via a small word-fix dictionary in the script: naturereligion → nature-religion, selfbegotten → self-begotten, bodyform → body-form, life-anddeath → life-and-death, birth-anddeath → birth-and-death, giveand-receive → give-and-receive. Mid-word hyphen splits with whitespace (nat- ural, treach- erous) collapsed by regex.
  • CJK spacing. (眞 如)(眞如) (width-space inside parens collapsed).
  • Frontmatter. type: resource, class: WorldScripture, title “Notes”, book/author/publisher per the existing staging/ws-notes-ch1/ partial, ingested: <today>.

Flagged for human review

  1. Marker’s low-minded at ch7 note 4 continuation: present in raw as already-correct low-minded (no break); preserved as-is. Mentioning for completeness because it sits adjacent to the corrected nature-religion/body-form/etc. and could look inconsistent.

  2. Multi-paragraph notes (preface note 2, ch6 note 3, ch6 note 9, ch6 note 16, ch7 note 4 [now fixed], ch12 note 28, ch13 note 31). The script preserves paragraph breaks with a blank line + 4-space indent (CommonMark footnote-continuation grammar). Spot-checked preface and ch6; render in Quartz preview to confirm.

  3. Ch6 note 16 stray italics. Marker output: Vatican II, *Guadium et* *Spes* (split italics). The script keeps as-is. Should be one italic run *Guadium et Spes* — flagging for manual fix.

  4. Ch7 note 17 quoted Kabbalistic doctrine includes nested italics inside Hebrew transliteration that Marker may have split. Worth eyeballing.

  5. naturereligion and friends were fixed via an explicit dictionary in the script. If a future PDF re-extraction surfaces a different word break (e.g. cosmicreligion, Buddhanature) the dictionary will need extending. The script will silently miss it; only Quartz preview / pre-commit prose checks would catch.

  6. Marker line-numbers in raw notes.md are gitignored, so the flag-log’s line N: references will go stale if notes.md is re-extracted. The script is the source of truth; re-running it after a re-extract regenerates a fresh flag log.

  7. staging/ws-notes-ch1/ can be deleted at finalize time. The pilot’s --notes-file arg points at it now; after this batch lands at resources/WorldScripture/ws-notes.md the splitter should point there instead. Deferred to Phase D pilot finalize so nothing breaks mid-flight.

Verification

# Total footnote-definition count: should be 707.
grep -cE '^\[\^[0-9]+\]:' staging/ws-notes/resources/WorldScripture/ws-notes.md
 
# All 22 chapter headings present.
grep -cE '^## Chapter [0-9]+$' staging/ws-notes/resources/WorldScripture/ws-notes.md   # → 22
 
# Preface + Invocation headings present.
grep -cE '^## (Preface|Invocation)$' staging/ws-notes/resources/WorldScripture/ws-notes.md  # → 2
 
# Per-chapter note counts match the table above.
for ch in $(seq 1 22); do
  count=$(awk "/^## Chapter $ch\$/,/^## Chapter $((ch+1))\$/" \
    staging/ws-notes/resources/WorldScripture/ws-notes.md | grep -cE '^\[\^')
  echo "Ch$ch: $count"
done
 
# Re-run the cleaner to confirm reproducibility.
uv run scripts/clean_ws_notes.py \
  --input resources-raw/WorldScripture/extracted/99-back-matter/notes.md \
  --output /tmp/ws-notes-check.md \
  --flag-log /tmp/ws-notes-flag-check.txt
diff staging/ws-notes/resources/WorldScripture/ws-notes.md /tmp/ws-notes-check.md

Finalize plan

When user approves this batch:

  1. cp staging/ws-notes/REVIEW.md _meta/batch-reviews/ws-notes.md
  2. mv staging/ws-notes/resources/WorldScripture/ws-notes.md resources/WorldScripture/ws-notes.md
  3. The staging/ws-notes/flag-log.txt is reproducible from the script + raw; do NOT commit. (It’s already only in staging/, which is gitignored at finalize.)
  4. Delete staging/ws-notes-ch1/ once the pilot’s --notes-file reference has been switched to the final location (this happens during Phase D pilot finalize per the handoff).
  5. Subsequent Phase B chapters (2–22) call split_ws_chapter.py with --notes-file resources/WorldScripture/ws-notes.md (the final path, since notes ship before chapters).

Out of scope (deferred)

  • Ch6 note 16 split italics, ch7 note 17 nested italics — flagged above for human pass.
  • Cross-reference link validation — many notes reference other chapters by name (e.g. “see Chapter 7: Reversal and Restoration”). The pre-commit hook validates wikilinks but not free-text chapter references; checking would require knowing each chapter’s title (deferred until all 22 chapters are staged so titles are derivable).
  • Quartz preview — confirm footnote definitions render correctly under H2 chapter headings, and that the per-chapter file’s local ## Footnotes section can still reference them when split_ws_chapter.py wires up the per-sub-theme files.