scafld

scafld

Spec-driven orchestration for AI coding agents. Every task becomes a YAML spec before a line of code changes.

TypeScriptAI AgentsOrchestration

scafld enforces a constraint that should be obvious: think before you type.

Every non-trivial task becomes a YAML specification before a single line of code changes. The spec defines what will change, in what order, with what acceptance criteria, and how to roll it back if it breaks. A human reviews and approves the spec. Only then does the agent execute; phase by phase, validated at every checkpoint, auditable after the fact.

This is the same separation of planning from execution that every serious engineering discipline has always required, applied to AI coding agents. The spec is a YAML file. The CLI is a Python script. The prompts are markdown. Any agent that can read files and execute shell commands can run the full workflow; Claude Code, Cursor, Copilot, Windsurf, whatever.

The spec as contract

The spec is a contract between what was requested and what gets delivered.

Before any code changes, there must be a machine-readable YAML artifact that defines precisely what will change. The human approves the plan, not the outcome. Once approved, the agent operates autonomously within those bounds. Another agent; or human; can pick up the same spec and execute it identically. Prompts are ephemeral. Specs are artifacts.

A spec declares:

  • Task definition - objectives, scope boundaries (in/out), assumptions, size, risk level
  • Context - packages affected, files impacted with line ranges, architectural invariants that must be preserved
  • Touchpoints - every system, module, or adapter the change will touch
  • Risks - what could go wrong, impact level, mitigation plan
  • Phases - ordered execution steps, each with file-level change declarations and acceptance criteria
  • Rollback - per-phase undo commands so failure is recoverable, not catastrophic
  • Definition of done - explicit checklist items that get checked off during execution

Planning mode

Planning is a structured exploration cycle that the agent runs conversationally with the developer:

  1. THOUGHT - interpret the request in repo terms, identify unknowns
  2. ACTION - search the codebase, read files, check diffs to answer those unknowns
  3. OBSERVATION - capture what was learned: files, invariants, risks, dependencies
  4. THOUGHT - update the spec, ask clarifying questions when information is missing
  5. REPEAT until all required fields are filled and assumptions are explicit

The agent is in read-only mode during planning. It can explore anything but change nothing outside .ai/specs/. If planning gets blocked on missing information, the spec saves with status: under_review and the agent tells you exactly what it needs. Max 20 cycles; if it’s still uncertain, it documents assumptions and moves on.

This is how you get specs that are executable by another agent without any additional back-and-forth. The planning loop does the work upfront so execution doesn’t have to guess.

Lifecycle and execution

                    ┌─── request changes ───┐
                    ▾                       │
  draft ──▸ under_review ──▸ harden ──▸ approved ──▸ in_progress ──▸ review ──▸ completed
                             (optional)       │              │             │
                                           HUMAN           phase loop    adversarial
                                           GATE         ┌──────────────┐  review
                                                        │ apply changes │   │
                                                        │ run criteria  │ failed ──▸ fix
                                                        │ record result │   │         │
                                                        └──────────────┘   └─────────┘

                                                        failed ──▸ rollback ──▸ resume

The filesystem is the state machine. Specs physically move between directories as they progress:

.ai/specs/
  drafts/          planning in progress
  approved/        human-reviewed, ready for execution
  active/          currently executing
  archive/YYYY-MM/ completed, failed, or cancelled

Each transition is enforced by the CLI. You can’t skip the approval gate. You can’t execute a draft.

Execution is phase-by-phase. Each phase reads the spec, applies changes, runs every acceptance criterion, and records pass/fail results with timestamps directly into the spec file. If a criterion fails, the phase rolls back independently; completed phases stay intact. If execution gets interrupted, the resume protocol picks up from the first pending or failed phase, not from scratch.

Harden

Between planning and approval, scafld harden <task-id> interrogates the draft one grounded question at a time. The agent walks down the design tree and resolves upstream decisions before downstream ones, so questions are never wasted on premises that may shift. Every question and every recommended answer must cite its source using one of three patterns: a spec gap at a named field, a verified code location (file and line), or an archived spec precedent. Ungrounded questions are forbidden; invented citations are forbidden.

This is optional and operator-driven. scafld approve does not consult harden status. Run it on high-risk or ambiguous specs; skip it on trivial ones.

Validation

Not all changes deserve the same level of verification. scafld scales validation proportionate to risk:

  • Light (micro/small, low risk) - compile check + acceptance criteria only
  • Standard (medium risk) - add targeted tests per phase, full test suite + linter + typecheck + security scan before commit
  • Strict (high risk) - everything in standard, plus boundary checks per phase to catch cross-module side effects

The profile derives from the task’s risk level or can be set explicitly.

Scope auditing

scafld audit diffs actual git changes against the files declared in the spec. If the agent touched files it didn’t declare, the audit flags it:

scafld audit add-error-codes -b main
# Scope drift: 12% (3/25 files undeclared) ── exit 1

Three categories: declared and changed (green), changed but not in spec (red; scope creep), and in spec but not changed (yellow; incomplete).

Adversarial review

Ask an AI “how did you do?” and it says great. Always 8 or 9 out of 10.

Ask it “what’s wrong with this?” and it actually finds things. Real things; a missing null check on line 47, a caller that assumes a parameter that just changed shape, a hardcoded value that should come from config. The same model that rubber-stamps its own work will genuinely tear it apart when you frame the task as critique instead of self-assessment.

scafld structures this. After execution, scafld review runs automated passes, scaffolds a machine-validated review artifact, and records review provenance. The agent reads its own diff through three lenses:

  1. Regression hunt - for each modified file, check all callers and importers. What breaks?
  2. Convention violations - read the project’s documented rules. What did you violate?
  3. Defect scan - hardcoded values, off-by-one errors, missing boundary checks, race conditions, copy-paste bugs, unhandled error paths.

Every finding cites a file and line number. Findings are blocking (must fix) or non-blocking (should fix). The review produces a verdict.

scafld complete reads the latest review round, validates its structure, records the verdict into the spec, and archives. If the round is missing, malformed, incomplete, or failed, it refuses. The only bypass is an exceptional human-reviewed override with an audited reason and interactive confirmation.

This works because critique is cognitively easier than creation. When building, the agent optimises for completion. When reviewing, it optimises for finding flaws. The separation is what makes the honesty structural, not a prompt trick.

Self-evaluation

Agents also score their work against a weighted rubric (completeness, architecture fidelity, spec alignment, validation depth). Below 7/10 triggers a mandatory second pass. Scores above 8 without noting deviations get flagged; a perfect score with no self-criticism is a rubber stamp.

Guardrails

Safety controls

Some actions require human approval regardless of the spec: schema migrations, public API changes, data deletion, production deployments. These are defined in config.yaml and enforced during execution.

scafld also automatically prevents common security violations: hardcoded secrets, unbounded queries, SQL injection patterns, XSS vulnerabilities. The security scan runs as part of the standard and strict validation profiles.

Invariants

Non-negotiable architectural rules the agent cannot violate regardless of the task:

  • Domain boundaries - services stay in their layers, no circular dependencies
  • No legacy fallbacks - no dual-reads, dual-writes, or runtime shims. Migrate immediately with a one-off script
  • Public API stability - HTTP contracts and event schemas don’t change without explicit approval
  • Config from environment - never hardcoded
  • No test logic in production - fixtures and mocks stay in test files

These are customisable per project. You define your own invariants in AGENTS.md and reference them by name in config.yaml. Every spec declares which invariants it must preserve.

If the task requires violating an invariant, the agent pauses and asks. Non-optional.

Workspace support

For projects with multiple codebases; an API, a frontend, an SDK, an MCP server; the workspace pattern gives the agent visibility across all of them from a single root. Create a root repo, add your codebases as git submodules, run scafld init. The root holds the orchestration layer and the agent sees the whole picture.

scafld init                 Scaffold workspace (copies templates, creates directories)
scafld new <task>           Create a spec (scaffold in drafts/)
scafld harden <task>        Optional: interrogate the draft with grounded questions
scafld approve <task>       Human approval gate (drafts/ -> approved/)
scafld start <task>         Begin execution (approved/ -> active/)
scafld exec <task>          Run acceptance criteria, record results
scafld exec <task> -p phase Run criteria for a specific phase
scafld audit <task> -b main Scope drift check against git
scafld diff <task>          Show git history for a spec
scafld review <task>        Run automated passes + generate adversarial review prompt
scafld complete <task>      Read review, record verdict, archive (requires passing review)
scafld complete <task> --human-reviewed --reason "manual audit"
                             Exceptional audited override when the review gate is blocked
scafld fail <task>          Archive as failed
scafld cancel <task>        Archive as cancelled
scafld status <task>        Review spec details and progress
scafld list [filter]        List specs by state
scafld validate <task>      Check spec against JSON schema
scafld report               Aggregate stats across all specs

scafld report aggregates pass rates, self-eval scores, scope drift, size/risk distributions, and monthly activity across your entire spec history. Completed specs archive to .ai/specs/archive/ with full execution logs, self-evaluation scores, acceptance criteria results, and git diffs. Permanent audit trail; when something breaks in production six months from now, you trace back to the spec that approved it.