Tyler Schultz Verification

Verification

The most useful thing I can give a coding agent is not a perfect prompt. It is a way to check its own work. Let's give our agents a lab: a small environment that allows real verification.

A good way to look at it is this: an agent that can only edit files is no better than autocomplete. An agent that can edit, build, run, and inspect the result is starting to engineer.

If the agent only writes the diff, I am the one who has to prove it out. We want to remove that layer.

The idea

Do not tell me what you changed. Tell me what you proved.

"I think I fixed it" is a guess. "I built it, opened the app, reproduced the flow, and the failing test case now passes" is evidence.

A coding agent verification loop 01 Make the claim State what should change. 02 Build it Let compiler errors become feedback. 03 Run checks Tests, API calls, previews, logs. 04 Look at it Use the real UI or real data. 05 Adjust Use failures as the next prompt. 06 Show proof Say what passed and what is uncertain.

Labs

The lab looks different depending on the work. Here are the three I care about right now.

APIs. A lot of teams already have the lab sitting in Postman: collections, environments, example requests, auth, and test scripts. That is a map of the whole system.

Postman's CLI generator makes that map runnable. Instead of asking an agent to read docs and guess, I can ask something more direct:

Look at the trip endpoints and tell me which call explains why this aircraft is missing crew assignments.

The agent can inspect the CLI, run requests, compare responses, and answer from what the API actually returned. No giant prompt full of API docs. It just goes and looks.

iOS. Here the lab is the app itself. Apple's WWDC26 sessions on Xcode and agents and Device Hub point in this direction: build the feature, run the app, use previews, interact with simulators, capture screenshots, and validate inside the development surface.

Build the app, open the simulator, walk through the trip details flow, and show me screenshots proving the empty, loading, and success states still work.

Tools like XcodeBuildMCP make that practical for agents today. They can build, test, boot a simulator, launch the app, read logs, inspect UI structure, tap through flows, and capture screenshots.

This matters because iOS bugs hide from compilers. A SwiftUI view can build cleanly and still clip. A flow can work on one device and break on another. A screen can look perfect until the data is empty, delayed, localized, or permission-gated. The only way to catch those is to actually look.

SnapshotPreviews fits here too. If previews can become images, and images can be diffed, then visual checks can become part of the agent's proof.

Debugging. Imagine a bug report comes in through helpdesk: a user says a trip page will not load after they tap into an upcoming flight. The report has a timestamp, a user ID, maybe an aircraft tail, and a sentence or two of human context. That is useful, but it is not a diagnosis.

The question for the agent is: where should it look first? It can search Sentry for crashes or errors near that user and release, check Datadog logs and traces for the backend request, and use Mixpanel to see the product path that led to the failure.

Triage this helpdesk report. Find the matching app error, backend trace, and product events. Tell me whether this is one user, one trip, one release, or a broader pattern.

Now the lab is not just the repo. It is the connected debugging surface. The agent can turn a vague report into a timeline, decide whether the failure is client-side or server-side, identify the smallest likely fix, and then verify against the same signals that exposed the bug.

Why

This matters most when agent sessions get longer. If an agent is only making a small edit, I can afford to be the verification layer. But once the task runs for hours, that stops working. The agent needs a way to inspect its own work, learn from failures, and keep going until the result is true.

That changes the ambition of the tasks we can hand over. Instead of asking for a tiny patch, I can ask for an outcome:

Improve the performance of this data flow by 20% without changing the output.

That is a multi-hour task. The agent has to measure the baseline, form a theory, make a change, run the benchmark, compare the result, and repeat. Without a lab, it eventually comes back to me with a diff and asks me to decide whether it helped. With a lab, it can keep looping until it has proof: the same output, faster runtime, and a clear account of what changed.

The more complete the lab, the more ambitious the assignment can be.