Codex — Orchestration Protocol (Layer 5)
The execution layer. Specifies the state machine, integration mechanics, failure recovery, batch coordination, quality sampling, cost ledger, and watchdog protocols that turn the four-layer scaffold (units / DAG / equivalence / flows / continuity) into a running autonomous system. Without this layer, the scaffolds remain plans; with it, they become executable across the 64-book Fast Track campaign without human intervention.
Read after docs/plans/FASTTRACK_EQUIVALENCE_PLAN.md, docs/specs/FASTTRACK_FLOW_SCAFFOLD.md, docs/specs/CONTINUITY_SCAFFOLD.md.
1. The orchestration model
The orchestrator is a long-running operator that holds campaign state and dispatches agent work. In the current setup, the operator is the conversation-level claude (this assistant). In a fully-automated future setup, it would be a driver agent loop continuously dispatched by a cron or watchdog. The protocols below are written to work for either model — they're operator-agnostic.
The operator's responsibilities:
- Maintain the campaign ledger (
manifests/campaign.json) — a single source of truth for every book's pass status, last-run timestamp, owning agent, and outstanding deliverables. - Watch the ledger for state transitions and dispatch the appropriate next-pass agent.
- Receive returned agent outputs, run integration (Section 4) before any subsequent agent reads shared state.
- Run watchdog checks (Section 7) periodically.
- Run quality sampling (Section 8) every batch.
- Track cost (Section 9) and enforce budget tier ceilings.
- Sign off final transitions (
equivalence-covered,continuity-verified) only when all gates pass.
The operator does not write content. It dispatches and integrates. Content is produced by the per-pass agents specified in Layers 2–4.
2. The campaign ledger
Single artifact at manifests/campaign.json. Holds every book and flow's lifecycle state. Append-only audit log in a sibling file manifests/campaign.log.jsonl records every state transition with timestamp and rationale.
Schema:
{
"version": 1,
"books": {
"lawson-michelsohn-spin-geometry": {
"tier": "α",
"fasttrack_entry": "3.35",
"status": "audited",
"passes": {
"P1_audit": { "status": "completed", "agent": "agent-a1", "started": "...", "completed": "...", "cost_usd": 0.18 },
"P2_gap": { "status": "pending", "blocked_by": [] },
"P3_plan": { "status": "pending" },
"P4_production": { "status": "pending" },
"P5_verification": { "status": "pending" }
},
"current_owner": null,
"plan_path": "plans/fasttrack/lawson-michelsohn-spin-geometry.md",
"outstanding_deliverables": []
},
"hartshorne-algebraic-geometry": {
"tier": "α",
"fasttrack_entry": "3.21",
"status": "draft",
...
}
},
"flows": {
"lawson-michelsohn-pt1": {
"kind": "author",
"source_book": "lawson-michelsohn-spin-geometry",
"status": "draft",
"passes": {
"PA_trajectory": { "status": "pending" },
"PB_bridges": { "status": "pending" },
"PC_pauses": { "status": "pending" },
"PD_verification": { "status": "pending" },
"PE_weaving": { "status": "pending" }
},
"flow_path": "flows/lawson-michelsohn-pt1.md"
}
},
"totals": {
"books_total": 64,
"books_equivalence_covered": 0,
"flows_total": 0,
"flows_attenuation_verified": 0,
"cost_usd_total": 0.0,
"agent_invocations_total": 0
}
}
The ledger is the operator's working memory. Every state transition touches it. The append-only log lets any future operator reconstruct the entire campaign history.
3. The state machine
3.1 Per-book states
draft
│
▼
audited ◄── Pass 1 complete
│
▼
gap-analysed ◄── Pass 2 complete
│
▼
production-ready ◄── Pass 3 complete (plan written)
│
▼
in-production ◄── Pass 4 dispatched (agents running)
│
▼
production-complete ◄── Pass 4 all units validated 27/27
│
▼
verified ◄── Pass 5 complete (per-book equivalence sign-off)
│
▼
woven ◄── Pass W complete (continuity weaving on this book's batch)
│
▼
equivalence-covered ◄── final sign-off; book contributes to v0.x.0 release
Each transition has explicit pre/post conditions and a triggering agent, specified in §3.3.
3.2 Per-flow states
draft → trajectory-extracted → bridged → paused → verified → woven → attenuation-verified
(PA) (PB) (PC) (PD) (PW global) (PV)
3.3 Transition table
| From → To | Trigger | Pre-condition | Agent dispatched | Post-condition |
|---|---|---|---|---|
draft → audited |
Operator includes book in active batch | book.passes.P1_audit.status == "pending" |
Audit agent (FASTTRACK_EQUIVALENCE_PLAN §4 Pass 1) | plan_path/§1.1–1.7 filled; ≥3 sources cited |
audited → gap-analysed |
Operator detects audited status |
Plan §1 complete; docs/catalogs/CONCEPT_CATALOG.md, deps.json, connections.json accessible |
Gap-analysis agent (Pass 2) | plan_path/§2 filled; scorecard populated |
gap-analysed → production-ready |
Operator detects gap-analysed status |
Plan §2 complete | Production-plan agent (Pass 3) | plan_path/§3 filled; agent decomposition specified |
production-ready → in-production |
Operator dispatches production batch | Plan §3 lists ≥1 unit to produce; budget available | Production agent batch (Pass 4 — N parallel agents) | New unit/Lean files exist; pending integration |
in-production → production-complete |
Operator processes returned outputs | All Pass-4 agents returned successfully | (operator runs integration §4) | All produced units validate 27/27; CONCEPT_CATALOG/deps integrated |
production-complete → verified |
Operator detects production-complete |
Pass-4 outputs integrated | Verification agent (Pass 5) | plan_path/§4 filled; all 7 layer scorecards ≥ thresholds |
verified → woven |
Operator detects verified AND batch of related books all verified |
Production-complete on adjacent books; weaving threshold met | Connection-weaving agent (Pass W on the batch) | Continuity gaps stitched; weaving_report exists |
woven → equivalence-covered |
Pass V continuity verification clears | Continuity metrics ≥ thresholds | Continuity-verification agent (Pass V) | Book signed off; ledger updated |
3.4 Failure transitions
Any pass can fail. Fail states are explicit:
| From | Fail trigger | Action |
|---|---|---|
| Any pass | Agent returns error or timeout | Status: <pass>.status = "failed"; operator inspects log; retry or escalate per §6 |
| Pass 4 | Unit fails 27/27 validation after 3 retry attempts | Status failed; route to "production-needs-rework" with specific defect list |
| Pass 5 | Layer scorecard < threshold | Status failed; route back to production-ready with specific gap list |
| Pass V | Continuity metrics below threshold | Route back to woven for further weaving; if 2nd weaving also fails, escalate |
4. The integration agent
When parallel production agents return outputs, shared-resource serialization is mandatory. The integration agent is a dedicated, single-threaded role that:
- Reads each returned agent's output (proposed CONCEPT_CATALOG entries, deps.json registrations, new unit files at known paths).
- Validates each new unit individually with
scripts/validate_unit.py. Rejects (routes back to production) any that fail 27/27. - Integrates accepted CONCEPT_CATALOG entries by inserting at the correct alphabetic / namespace position. Reads file fresh before each edit (no stale caches).
- Integrates deps.json
nodes[]and_notes{}entries. Adds DAG edges per the production plan §3.5. - Integrates connection annotations: reads each unit's
connections:frontmatter block, writes them intomanifests/connections.json(proposing new entries viaconnection_proposals.mdfor orchestrator review if not already registered). - Re-runs
scripts/validate_all.pyafter each batch's integration to confirm global validity. - Updates the campaign ledger.
The integration agent is always single-threaded across batches. Multiple production agents can run in parallel; only one integration agent runs at a time. This is the lock mechanism for shared resources without filesystem-level locks.
In the current setup (claude as operator), the integration role is filled by the operator manually editing the shared files. In an automated setup, it's a dedicated agent task with prompt template:
You are the integration agent for Codex.
Inputs:
- A batch of recently-returned agent outputs at <paths>
- The campaign ledger at manifests/campaign.json
- Current shared files: docs/catalogs/CONCEPT_CATALOG.md, manifests/deps.json, manifests/connections.json
For each returned output:
1. Validate the unit with ./.venv/bin/python scripts/validate_unit.py <path>
2. If validation fails (not 27/27), record failure in campaign log; do not integrate
3. If validation passes, insert proposed CONCEPT_CATALOG entry at the correct namespace/alphabetic position
4. Insert proposed deps.json node entry into nodes[] and _notes{} blocks
5. Add connection annotations to manifests/connections.json
6. Append unregistered connections to manifests/connection_proposals.md for orchestrator review
After all outputs processed:
7. Run ./.venv/bin/python scripts/validate_all.py
8. If 100% pass, update campaign ledger; else file rollback report
Output: integration_report.md listing every action taken and any failures.
5. Per-batch coordination
5.1 Locking semantics
- Production agents can run in unlimited parallel — they write to their own unit files in disjoint paths (
content/<section>/<chapter>/<id>-<slug>.md). They never touch shared files directly. - Integration agent is the sole writer to shared files. Runs single-threaded.
- Watchdog and quality-sampling agents are read-only; can run concurrent with anything.
- Verification and weaving agents modify per-unit / per-flow files but coordinate via the campaign ledger (only one weaving agent runs per book at a time; recorded as
current_owner).
5.2 Batch composition
A batch is the unit of orchestrator dispatch. Sized empirically:
- Production batch: 4–8 parallel production agents, all working on units within the same per-book plan or related plans. Sized so the integration step takes ≤ 10 minutes after batch return.
- Audit batch: up to 12 parallel audit agents on different books (Tier α can be one batch; Tier β can be three).
- Verification batch: typically 1 per book, single-threaded.
5.3 Inter-batch sequencing
The state machine enforces sequencing per book (audit → gap → plan → production → verify → weave). Across books, batches can interleave. The campaign ledger holds the truth.
6. Failure recovery protocols
Each pass has explicit recovery rules. The operator detects failures via the ledger and runs the recovery.
Pass 1 — Audit failure
- Symptom: Agent returns "insufficient public sources" or audit §1 sections incomplete after timeout.
- Recovery:
- Tier α/β: dispatch a second audit agent with explicitly broader source list (course pages, PhD theses, reviews on JSTOR). Allow 2x normal token budget.
- Tier γ: if second audit also fails, mark book as
audit-degraded— proceed with reduced expectations (topic-equivalent only, no reproduction-equivalent).
- Escalation: Human review at 3rd failure.
Pass 2 — Gap analysis failure
- Symptom: Agent can't reconcile audit theorem inventory against shipped Codex.
- Recovery:
- Most often resolved by operator-side regrep (the agent missed a unit). Operator manually grep the codebase, re-dispatch with hint.
- If genuinely no Codex coverage: this is the expected state for unfilled books; not a failure.
- Escalation: only if Codex grep shows clear coverage but agent's gap report doesn't reflect it (agent reading failure).
Pass 3 — Production plan failure
- Symptom: Agent produces decomposition that's poorly sized (one agent gets 30 units, others get 2).
- Recovery: Re-dispatch with explicit size guidance: "each sub-agent gets 4–8 units."
- Escalation: If 2nd attempt also imbalanced, operator manually re-balances.
Pass 4 — Production failure
- Symptom: Unit fails 27/27 validation.
- Recovery:
- Common: prohibited-phrasings, lean_status: stub, Beginner symbols, missing Master sections. Re-dispatch agent with specific defect list and 1.5x token budget.
- 2nd failure: reduce scope (instead of producing the full unit, produce just the failing section).
- 3rd failure: escalate to operator. Often a sign the unit topic is genuinely outside agent training-data range.
- Escalation: After 3rd failure, operator manually drafts the unit (or marks as deferred to v1.x).
Pass 5 — Verification failure
- Symptom: Layer scorecard below threshold (e.g., theorem coverage < 95%).
- Recovery:
- Verification agent emits specific gap list: "theorems X, Y, Z not covered in any unit."
- Operator routes back to
production-readywith gap list as additional Pass 4 input. - Re-runs Pass 4, Pass 5.
- Escalation: If 2nd verification still fails, accept reduced threshold for this book and document deferral.
Pass W — Weaving failure
- Symptom: Agent returns "no continuity gaps found" but continuity metrics still below threshold.
- Recovery: Re-dispatch with specific metric failures highlighted; agent must address each named metric.
- Escalation: Multiple repeated weavings without metric improvement → operator inspects manually.
Pass V — Continuity verification failure
- Symptom: Continuity metrics below threshold despite weaving.
- Recovery: Loop weaving + verification up to 3 times. Then escalate to operator.
- Escalation: Operator may add explicit per-unit synthesis_claims that the weaving missed; re-runs Pass V.
7. The watchdog agent
A periodic agent that audits the campaign ledger for stuck books, dropped batches, and integration drift. Runs every ~12 hours (or every batch boundary) in autonomous mode.
Watchdog tasks:
- Identify books with
current_ownerset but no progress for > 4 hours → likely stuck agent. Reset owner; re-dispatch. - Identify books in
in-productionfor > 24 hours without batch return → orphaned batch. Investigate. - Identify newly-created connection proposals in
manifests/connection_proposals.md→ flag for orchestrator integration. - Compare
validate_all.pyoutput to ledger'sequivalence-coveredcount; flag discrepancies. - Identify books that have been at
verifiedfor > 7 days without weaving → schedule weaving. - Compute total cost vs. budget tier ceiling; alert if approaching threshold.
- Identify quality-sampling reports flagging drift; schedule rework on flagged units.
Output: manifests/watchdog_report.md — list of issues + recommended actions.
The operator (or driver script) reads the watchdog report and acts on its recommendations.
8. Quality sampling
Continuity verification (Pass V) catches structural defects but not subtle quality issues. The quality-sampling agent reads random samples of recent production and flags drift.
Quality-sampling agent:
- Samples 5% of recently-shipped units (random selection from last 7 days).
- For each sampled unit, checks:
- Prose density — does the Master section feel synthesized or list-y?
- Genius-prose presence — is the originator's framing actually quoted/contextualised, not just cited?
- Bridge coherence (for flow steps) — does the bridge advance an argument, or just announce the next topic?
- Connection invocation — are registered connections actually invoked in the prose, with anchor phrases?
- Tier appropriateness — is the Beginner section actually accessible, the Master section actually deep?
- Produces
manifests/quality_report.mdwith sampled units, scores per dimension (1–5), and specific defects.
Threshold: average sampled score ≥ 4.0/5 across dimensions. Below 4.0 → trigger rework on flagged units; investigate systematic causes.
The quality-sampling agent does not have authority to edit. It only reports. The operator decides on rework.
9. Cost ledger
Every agent invocation logs to manifests/cost_ledger.jsonl:
{"timestamp": "2026-04-29T15:23:00Z", "book": "lawson-michelsohn-spin-geometry", "pass": "P1_audit", "agent_id": "agent-a1", "model": "claude-opus-4-7", "input_tokens": 8500, "output_tokens": 3200, "cost_usd": 0.18, "duration_s": 215}
Aggregated rolling totals are reported in the campaign ledger's totals block. The watchdog watches for budget tier overruns:
| Tier | Budget ceiling per book | Total tier budget |
|---|---|---|
| α | $50 USD | $600 (12 books) |
| β | $30 USD | $720 (24 books) |
| γ | $20 USD | $560 (28 books) |
| Total campaign budget | ~2256 |
Real numbers depend on model pricing and batch sizes; calibrated empirically from the Lawson-Michelsohn pilot. Tiers above their ceiling either pause for operator review or auto-defer to a later phase.
10. Convergence criteria
The campaign converges when:
- All Tier α books are
equivalence-coveredANDcontinuity-verified. - All Tier α flows are
attenuation-verified. - Strand flows for all 9 strands are
attenuation-verified. - The synthesis flow
fasttrack-canonicalisattenuation-verified. - Continuity metrics across the curriculum meet thresholds globally.
Per-tier convergence:
- Tier α complete = condition for v0.7 release
- Tier β complete = condition for v0.9 release
- Tier γ complete = condition for v1.0 release
Tier δ optional per the equivalence plan.
11. The pilot — Lawson-Michelsohn
The first end-to-end autonomous run. Goal: validate the entire 7-pass protocol (P1 audit → P2 gap → P3 plan → P4 production → P5 verify → PW weave → PV continuity) on one book, end-to-end, with the operator (claude in this conversation) running each transition manually but without external intervention.
Why Lawson-Michelsohn:
- Tier α
- Codex already covers ~80% (the v0.x spin-geometry pilot strand)
- Estimated ~10 new units to produce, smaller than e.g. Hartshorne's 50+
- Subject is well-represented in agent training data
- The book is widely available; even web research can pull from solution sets, course pages, reviews
11.1 Pilot pass schedule
| Pass | Estimated time | Notes |
|---|---|---|
| P1 audit | 30–60 min | Background agent; web research |
| P2 gap analysis | 30 min | Read audit + Codex; produce coverage diff |
| P3 production plan | 30 min | Decompose into 2–4 production agents |
| P4 production (parallel) | 1–2 hours | 2–4 agents producing ~10 units total |
| P5 verification | 30 min | Per-book equivalence sign-off |
| PW weaving | 30 min | Stitch seams across new + existing |
| PV continuity verification | 30 min | Six metrics + samples |
| Total wall time | ~6 hours | Operator-active |
11.2 What the pilot validates
- Whether Pass 1 audit can produce a usable §1 from web sources alone (no book PDF)
- Whether Pass 4 production agents respect the connection brief and continuity scaffolding
- Whether the integration agent (operator) handles parallel returns cleanly
- Whether Pass V continuity metrics actually reflect quality
- Where the failure points are — informs failure-recovery refinement before scale
11.3 Pilot success criteria
- All 7 passes complete; book reaches
equivalence-coveredANDcontinuity-verified - New units produced validate 27/27
- Continuity metrics meet thresholds
- Total cost < $50 (Tier α budget)
- Total wall time < 8 hours operator-active
- Quality-sample score ≥ 4.0/5 on the produced units
If all criteria met → protocol scales to Hartshorne next, then the rest of Tier α in batches. If any fail → patch the gap, re-pilot before scaling.
12. The full stack — final
╔═════════════════════════════════════════════════════════════════════╗
║ Layer 5: Orchestration ← THIS DOCUMENT ║
║ Campaign ledger · State machine · Integration agent · Watchdog ║
║ Failure recovery · Quality sampling · Cost ledger ║
║ ↓ executes the layers below autonomously ║
╠═════════════════════════════════════════════════════════════════════╣
║ Layer 4: Continuity & connection scaffold ║
║ Connection registry · Per-unit / per-flow annotations · Briefs ║
║ ↓ binds parallel agent output into continuous teaching ║
╠═════════════════════════════════════════════════════════════════════╣
║ Layer 3: Flows ║
║ Author · Strand · Synthesis · Themed ║
╠═════════════════════════════════════════════════════════════════════╣
║ Layer 2: Equivalence plans ║
║ Per-book audit + coverage + production ║
╠═════════════════════════════════════════════════════════════════════╣
║ Layer 1: DAG ║
║ manifests/deps.json ║
╠═════════════════════════════════════════════════════════════════════╣
║ Layer 0: Units ║
║ Three-tier reference articles ║
╚═════════════════════════════════════════════════════════════════════╝
Layer 5 is the executive. With it, the campaign runs autonomously across 64 books, with the operator intervening only for escalations the watchdog surfaces. Without it, the layers below are well-designed but stationary.
The pilot starts with Lawson-Michelsohn, runs through the 7 passes operator-driven (claude as operator), produces an equivalence-covered, continuity-verified Lawson-Michelsohn equivalent, and informs every subsequent book's run with the empirical lessons. From there, Tier α, β, γ in turn — bounded by budget, watchdog'd for stalls, sampled for quality, audit-logged for accountability.
Plan v1. To be refined after Lawson-Michelsohn pilot conclusion.