I ended Shipping Tasks on a Loop on an honest note. The backend had real ground truth. Mobile had none. A screenshot is not a test. Claude reading a screenshot will cheerfully tell you it looks right. So I built the thing I said I didn’t have. It’s an MCP server that drives the iOS simulator, public and MIT at github.com/CodeJonesW/ios-agent-driver. Fourteen tools over simctl and Meta’s idb: boot, install, launch, read the screen, tap, type, swipe.
The design that makes it work is accessibility-tree-first. The agent calls describe_ui, gets back labeled elements, and taps by label, not by pixel. A tap on a missing label doesn’t no-op. It returns the nearest labels so the agent can re-read and pick the right one. Then it reads the tree again after the tap, because the post-action read is the assertion. There’s a skill called test-ios that wraps this into a loop: build the app, log in, perceive, decide, act, observe toward one goal with one success predicate. It validates and posts a verdict. It never merges.
Every tool takes an explicit udid. That one parameter is what lets four run at once. One agent drives one sim. Four agents, each passing its own udid, never touch. So I boot four simulators, hand each a persona and a goal, and run them at the same time. An athlete mid-training-block moving tomorrow’s session. A brand-new user who doesn’t know the vocabulary. A lapsed user back after two weeks. Someone who just wants to log a lift and leave. Four genuinely different walks through the same app, on real staging accounts, finding bugs a single happy path never would.


The machine feels it. On an M4 MacBook with 16GB, four sims and four agents sit around 78% CPU once you add system and user, 6,700 threads, 1,500 processes. It was slow. Four was about the ceiling on this machine, and honestly a bit much for it. It’s not a proof. Four people find four people’s worth of bugs, not the absence of the fifth. Still, the gap I named at the end of the loop post is closed. Mobile drives a real device now, asserts on what the app shows, and does it four people wide.
Cheers,
Will