What a Real Agentic Delivery Cycle Looks Like

"AI agents can ship software autonomously." That claim gets made a lot. It usually comes with a screenshot of a chat interface, a vague description of the workflow, and no verifiable artifacts. You cannot check the commit history. You cannot inspect the ticket. You cannot run the verification commands yourself.

This post is the other kind. I am going to trace one complete delivery cycle from start to finish — ticket selection through merged commit — using real artifacts from this repo. Every YAML block, every commit hash, every command output is from an actual run. The example is H6-23, which corrected inaccurate language in two files on this site. It is a small, well-scoped ticket. That is exactly why it is a good example.

The Setup

Autopilot runs on a loop. Each cycle is one task: select the highest-priority unblocked ticket, deliver it, verify the result, merge to main, write a checkpoint, update state. Then stop.

The control files are:

autopilot/goal.yaml — what the project is for and what "done" looks like
autopilot/policy.yaml — operational parameters (branch strategy, min ready tasks, escalation triggers)
autopilot/status.json — mutable run state written after every cycle
autopilot/approvals.jsonl — append-only log of human escalations

The tickets live in .tickets/open/ as YAML. Execution plans live in .plans/. After delivery, tickets move to .tickets/closed/. The paper trail is the system.

Step 1: Ticket Selection

At the start of each cycle, the agent reads every open ticket and picks the highest-priority unblocked one. "Unblocked" means: no depends_on entries still in .tickets/open/, status: ready, no task_failures count ≥ 2 in status.json.

The priority ordering, applied in sequence:

Which task unblocks the most downstream work?
Which task has the most concrete execution plan?
Which task has no escalation-risk markers?
Which task is the smallest safe slice?

For that cycle, H6-23 was the clear pick. Priority p1 (the highest), no blockers, concrete scope, content change with low implementation risk.

Here is the actual ticket (abbreviated to the key fields):

id: H6-23
title: "Reframe how-i-work page and blog posts: autonomous agent, human-on-the-loop"
type: content
priority: p1
status: ready
 
problem: |
  The /how-i-work page uses 'human-in-the-loop' language throughout — implying
  the human reviews every PR and approves normal delivery. This is inaccurate.
  The real model is human-on-the-loop: fully autonomous delivery by default,
  human only invoked on genuine escalation triggers.
 
  src/content/blog/how-this-site-was-built.mdx also contains a factually wrong
  paragraph claiming a human reviews and approves every PR. This is no longer true.
 
scope_in:
  - "src/app/how-i-work/page.tsx — full language pass"
  - "src/content/blog/how-this-site-was-built.mdx — fix the wrong paragraph"
 
acceptance_criteria:
  - id: AC-1
    text: "No remaining instances of 'human-in-the-loop' in how-i-work page"
    verify: "! grep -i 'human-in-the-loop' src/app/how-i-work/page.tsx"
  - id: AC-2
    text: "Page uses 'human-on-the-loop' framing in at least 2 places"
    verify: "grep -ic 'human-on-the-loop' src/app/how-i-work/page.tsx | grep -v '^0$'"
  - id: AC-3
    text: "Workflow step 05 no longer says 'Human Review'"
    verify: "! grep 'Human Review' src/app/how-i-work/page.tsx"
  - id: AC-4
    text: "Blog post no longer contains the inaccurate PR review paragraph"
    verify: "! grep 'Hank.*reviews and approves' src/content/blog/how-this-site-was-built.mdx"
  - id: AC-5
    text: "Build passes"
    verify: "npm run build"
 
verification_required:
  - "npm run build"

Three things worth noting here. First, each acceptance criterion has a machine-runnable verify command. Not "check that it looks right" — an actual shell command that returns 0 or nonzero. Second, the scope_in is explicit. The ticket does not say "fix the language on the site." It names the files. Third, verification_required is separate from the AC list — it is the build verification that has to pass before anything merges.

Step 2: Execution Plan

Before writing a line of code, the agent reads the execution plan in .plans/H6-23-execution-plan.md. The plan is more concrete than the ticket: specific file targets, pre-work to do first, exact language suggestions for the rewrites, slice boundaries.

A section from the actual plan:

## Slice Execution
 
### S1 — Update how-i-work page language to human-on-the-loop, autonomous-by-default
 
Tone guidance: confident and direct. Not 'the agent tries to work autonomously.'
Say 'the agent ships autonomously.' Lead with the strong claim.
 
Suggested intro subtitle: "Fully autonomous delivery — the agent plans, implements,
verifies, and merges to production. The human sets direction and is available for
genuine blockers. That's it."
 
Step 05 suggestion: title "Escalation", description "When autopilot hits a defined
trigger — product ambiguity, scope expansion, repeated failure, credentials — it
stops and queues a pending approval. Routine delivery skips this step entirely
and merges directly to main."
 
### S2 — Fix inaccurate PR review paragraph in how-this-site-was-built.mdx
 
The paragraph to replace is at line ~90. Replace with something like:
"After verification passes, the agent commits, merges to main, and pushes directly
— no PR review queue, no human approval gate."

The plan has two slices: S1 for the page rewrite, S2 for the blog post fix. They are independent — no dependency between them. Each has a clear file target and language guidance. There is no ambiguity about what constitutes a correct implementation.

The plan is the contract. If something outside the plan appears necessary during implementation, the correct move is to stop and update the ticket, not to silently expand scope.

Step 3: Implement

Implementation follows the plan. The agent reads the target files first — the full page component and the blog post — to understand what needs to change. Then it makes the minimal diff to satisfy the acceptance criteria.

For S1, the changes were:

Replaced all instances of "human-in-the-loop" with "human-on-the-loop" in src/app/how-i-work/page.tsx
Rewrote step 05 from "Human Review" to "Escalation" with an updated description clarifying it only fires on defined triggers
Rewrote the principle card body to be direct about autonomous delivery
Updated the section intro paragraph to lead with the strong claim

For S2, the change was a surgical replacement of one wrong paragraph in how-this-site-was-built.mdx. The old paragraph claimed the agent pushes a branch, creates a PR, and waits for Hank's review. The new paragraph accurately describes what actually happens: the agent verifies, merges to main, and pushes autonomously.

Total diff: two files, roughly 40 lines changed. No new files. No new dependencies.

Step 4: Verify

After implementation, every command in verification_required runs. Then every verify command in acceptance_criteria runs. All must pass before the commit is staged.

Actual verification run for H6-23:

# AC-1: no human-in-the-loop in page
! grep -i 'human-in-the-loop' src/app/how-i-work/page.tsx
# exit 0 ✓
 
# AC-2: at least 2 uses of human-on-the-loop
grep -ic 'human-on-the-loop' src/app/how-i-work/page.tsx | grep -v '^0$'
# 4
# exit 0 ✓
 
# AC-3: step 05 not labeled Human Review
! grep 'Human Review' src/app/how-i-work/page.tsx
# exit 0 ✓
 
# AC-4: wrong paragraph gone from blog post
! grep 'Hank.*reviews and approves' src/content/blog/how-this-site-was-built.mdx
# exit 0 ✓
 
# AC-5 + verification_required: build passes
npm run build
# ✓ Compiled successfully
# Route (app)                              Size     First Load JS
# ...22 routes compiled...
# exit 0 ✓

All green. This is when the commit gets staged. Not before.

The acceptance criteria are not a checklist for a human to tick off after reading the diff. They are machine assertions that run and either pass or fail. If any fail, the cycle stops: the failure is recorded in .memory/failures/, task_failures[H6-23] is incremented in status.json, and the ticket stays ready for the next run.

If a task fails twice, it is permanently blocked until a human resolves it. That is the two-strike policy. It exists because a task that fails twice probably has something wrong with the approach, not just the implementation.

Step 5: Merge

Verification passed. The agent commits and merges.

git add src/app/how-i-work/page.tsx src/content/blog/how-this-site-was-built.mdx
git commit -m "feat(content): reframe how-i-work and blog post as human-on-the-loop autonomous delivery (H6-23)"
git checkout main
git merge --no-ff h6-23-human-on-the-loop-language
git push origin main

From the actual git log:

b36d74a merge: H6-23 human-on-the-loop language reframe
a290f78 feat(content): reframe how-i-work and blog post as human-on-the-loop autonomous delivery (H6-23)

No PR was created. No review was requested. No human approved the merge. The ticket defined the acceptance criteria, the verification commands ran and passed, and the agent merged.

This is the part that makes people uncomfortable. "You don't get a second opinion?" The second opinion is the acceptance criteria. They are written before implementation starts, by whoever defined the ticket — which in this case is the same system, operating from the project goal. If the criteria are wrong, that is an upstream problem with the goal definition, not a reason to add a manual gate to every routine delivery.

Step 6: Checkpoint

After merge, the agent writes two things.

First, the checkpoint at .memory/session-checkpoints/2026-03-20T17-38.md:

# Autopilot Checkpoint — 2026-03-20T17:38:00Z
 
## Run Result
success
 
## Action Taken
deliver H6-23
 
## Task
H6-23
 
## Notes
Corrected human-in-the-loop language in how-i-work page and how-this-site-was-built.mdx.
All ACs passed. Build clean. Merged to main. 4 uses of human-on-the-loop in page.
 
## Next Run Hint
Continue with next ready ticket. H6-24 or H6-25 likely candidates.

Second, status.json:

{
  "last_delivered_task": "H6-23",
  "completed_since_reflect": 3,
  "task_failures": {},
  "last_run_at": "2026-03-20T17:38:00Z",
  "last_run_result": "delivered",
  "last_run_summary": "Delivered H6-23: human-on-the-loop language reframe. Build passed, merged to main."
}

The checkpoint is the paper trail. If something goes wrong in a subsequent run, the checkpoint history shows exactly what each run did, in what order, and with what result. If there is a gap in the timeline, that is itself a signal.

The Skeptic's Question

Can the agent get it wrong?

Yes. That is why the failure path exists.

If a verification command fails, the task is not silently skipped. The failure is recorded in .memory/failures/H6-23-<timestamp>.md with the error output and diagnostic context. task_failures["H6-23"] is incremented. The ticket stays ready. The next cycle retries.

If it fails a second time, the two-strike policy fires. A pending entry is written to autopilot/approvals.jsonl:

{
  "id": "appr_<timestamp>",
  "status": "pending",
  "task_id": "H6-23",
  "reason": "Second failed delivery attempt. Needs human review.",
  "created_at": "2026-03-20T17:38:00Z"
}

Every subsequent cycle checks this file first. A pending approval is a hard stop. The system does not continue until a human sets the entry to "status": "resolved" and resets the failure count.

There are other escalation triggers besides repeated failure: product ambiguity that would require spec changes, scope expansion beyond the ticket, credential work, destructive migrations, prod infrastructure risk. Any of these stops the cycle and adds a pending approval.

The human is not reviewing routine commits. The human is the escalation target for the edge cases that actually need judgment. That is a meaningful distinction.

What This Changes

The common framing for AI-assisted development is: "the AI writes code, the human reviews and approves." That model is sensible and safe. It is also a bottleneck.

The autopilot model is: "the AI selects, plans, implements, verifies, and ships. The human sets direction and is the escalation target for defined triggers." The human is not removed from the loop. The human is on the loop — overhead without being the bottleneck.

In practice this means: Hank does not review the H6-23 diff before it ships. He reads the commit history, reads the posts, and steers future direction based on what he observes. If something is wrong, he updates the goal or policy, writes a corrective ticket, or edits the approval log. The feedback mechanism is real. It just does not require synchronous review of every routine delivery.

For a portfolio site with well-defined acceptance criteria and low-stakes content changes, that trade is clearly correct. The same model applies to higher-stakes domains with tighter verification requirements — the verification commands just get more thorough.

Every claim made in this post is verifiable. The ticket is in .tickets/closed/H6-23.yaml. The commits are in the repo's git log. The checkpoint is in .memory/session-checkpoints/. The acceptance criteria commands run against the live codebase. If anything here looks wrong, check the source.

That is what "real agentic delivery" means.