This is a follow-up to Claude Agents + Worktrees. There I got agents running in parallel, each in its own worktree with the right context. That solved the input side. It said nothing about what happens to the code once it lands.

Four agents producing four PRs drops four reviews on one person at once, and review doesn’t parallelize. On x.com I’ve seen posts about “loop engineering” that point at the same problem. So I sat down with Claude and built two skills around it. ship-task ships one task end to end. ship-loop runs ship-task over and over until it hits a stop condition.

Where the Leverage Actually Is

Before building, I wanted to be clear-eyed about this. I think it’s easy to get wrong.

The real bottleneck isn’t writing code. It’s reviewing it. Parallelism scales with how independent the tasks are, not how many you can launch. Genuinely independent work is rarer in a real codebase than it looks. So agentic parallelism mostly moves my bottleneck from coding to reviewing. That’s fine, but it’s the reason a harness matters more than more agents.

A review agent built on the same model as the developer agent shares its blind spots. It’ll happily rubber-stamp the exact mistakes the developer just made. So pure “another agent reads the diff” review has low yield. The leverage is in checking against ground truth. Does it compile, do the tests pass, does the live endpoint return the right thing. The testing leg is load-bearing. The harness is only as strong as its weakest verification path.

Ship Task

ship-task takes one task and drives it into shipped PRs. The task comes from a roadmap doc or a GitHub issue. Either way the code lives across three sibling repos: a Cloudflare Worker backend, an iOS app, and a web admin console.

It splits the task into lanes, one per repo, capped at three subagents. Each lane runs in its own git worktree on its own feature branch. That’s the worktrees post paying off. The agents never touch my checkout and never touch each other’s files. Each lane gets a written file boundary, the files it owns and the files it must not touch, so parallel lanes don’t collide on merge.

The lanes are sequenced by dependency, not convenience. The backend defines the typed contract first. The iOS and web lanes consume it. No client lane uses an entity before its contract lands.

Every lane runs that repo’s tests and quotes the real pass or fail counts. The worker runs tsc then vitest. iOS runs a simulator script. Web runs its test command. If a test can’t run, the lane says so out loud instead of skipping quietly.

The iOS lane does one more thing. If it changes a screen, it renders that screen to a PNG with a snapshot test, commits the image, and embeds it in the PR. So I can eyeball the result without checking out the branch. This is the snapshot idea from the last post, now wired into the flow. It’s a regression signal, not a correctness oracle, but paired with the diff it gives me something to look at fast.

One gotcha cost me a few blank-image PRs. The iOS repo is private. The plain raw.githubusercontent.com link 404s for everyone because that domain never sees your GitHub login. The first-party github.com/.../blob/<sha>/...?raw=true link renders inline. Small thing, real time lost.

Each lane lands a documented PR. What it does, why, how, critical decisions, what human support is needed, test status, the version bump, screenshots. The skill does not merge and does not deploy. That stays mine.

Ship Loop

ship-loop is a thin wrapper. It picks the next eligible task, ships it with ship-task, records the result, checks the stop conditions, and repeats. It doesn’t reimplement any shipping. It only adds three things: picking the next task, tracking state across iterations, and knowing when to stop.

A task is eligible only if it isn’t already shipped, isn’t already in flight, and doesn’t overlap the files of an open PR from an earlier iteration. Ineligible tasks get skipped, not stopped on.

The stop conditions are the part I care about most. A loop that doesn’t know when to quit is worse than no loop. Three things end it:

  • Nothing eligible left. Clean finish.
  • The next task needs a human. A secret to set, a migration, a destructive step, or a deploy the next task depends on. The loop never deploys, so it can’t unblock itself. It stops.
  • Contract collision. Only one unmerged contract change in flight at a time. A second one would fight the first on merge.

Stopping is success here, not failure. When the loop ends it writes up everything it shipped, the merge and deploy order across PRs, the human to-do list, and the exact task it would pick up next. I merge and deploy. Then I run it again.

Where I Trust It

The harness is only as honest as its verification, and mine is uneven. That shapes how much rope I give the loop.

On the backend, the agent has a staging environment and a CLI to hit the deployed API. That’s real ground truth. I let the loop run hot here.

On the frontend, I’m still figuring out good autonomous testing. I’ve been trying the Chrome DevTools MCP to let the agent drive the console and watch the UI. So far it’s hit or miss. I expect it to improve, and I may be using it wrong. For now I wouldn’t gate much on it.

On mobile, validation is harder. You have to build the app and drive a simulator, and I haven’t found a tool that lets Claude interact with one. The snapshot PNGs are the stopgap. Claude reading a screenshot will cheerfully tell you it looks right, so I treat the image as a regression signal next to a deterministic snapshot diff, never as proof on its own.

So the loop runs hot on the backend and on a short leash everywhere else. That’s the honest state of it. The next frontier is verification, not more agents.