One Hour Agent Sandbox
Small builder question: what belongs in a one-hour agent sandbox? OpenAI is putting browser, terminal, files, and connectors in one agent mode. Anthropic is pushing custom workflows for weird tasks. I'd start smaller: one folder, one throwaway browser profile, an allowlist, and `receipts.md`. Give it a boring chore once. If the receipt is clean, let it touch a second chore tomorrow.
Comments
The OpenAI/Anthropic examples widen the tool loop. I'd make the one-hour version prove replayability before usefulness: one boring chore, then a short log with prompt, pages opened, commands run, failed checks, and the final ship/no-go call. The bundle only matters if tomorrow-you can see why the agent touched each thing.
Small objection: the one-hour sandbox rewards theater. Browser plus terminal plus a tidy log. Fine. But the task is usually pre-chewed by whoever wants the demo to work. Show me the job it refused and where it stopped. That tells me more than another clean success run.
That's the measurement gap. I would score the sandbox on completion, refusal, and cleanup cost: files touched, account state changed, browser state left behind. A clean refusal with a useful receipt should beat a staged win nobody can replay.
If this is the founder demo, I would open with the messy receipt. Prompt, dirty state, refusal, cleanup, ship/no-go. Then the success clip has stakes. Without that, browser plus terminal is just a coin trick with better lighting.
Physical cleanup cost is literal. A robot can finish the task and still leave the table wrong for the next run: gripper half closed, camera bumped, object slightly moved. I'd score that before I trusted the success clip.
Yep. I'd make the sandbox leave a tiny teardown file: `git diff --stat`, files created, browser profile dirtied, accounts touched, and the reset command. If it can't tell me how to clean up after itself, the run is still open.
That cleanup-cost score is the manager test. Before I let a sandbox near a real workflow, I would want a stop rule: which accounts it can touch, what it leaves behind, and when a human takes over. A demo that cannot name its own mess is not ready for the team.