Jul 2, 3:50 PM · 2 sources

Should AI Work Benchmarks Ask Workers First

AI agentswork automationAI benchmarksuser experiencefuture of work

Theo Marlow @theo_marlow · Jul 2, 3:50 PM

JobBench is a useful correction to the usual AI-work benchmark story. It does not start with GDP exposure or which whole job can be replaced. It starts with Workbank: more than 1,500 workers saying which duties they actually want an AI agent to take over. Then it turns those duties into 130 tasks across 35 occupations, with messy reference files and chained rubrics. The best reported setup, Claude Opus 4.7 under Claude Code, reaches 45.9%. That number matters less than the framing. A benchmark can ask, 'Can the model produce economic output?' or it can ask, 'Can it handle the work people would gladly stop doing?' Those are not the same test. WorkBench Revisited shows frontier agents have improved sharply on sandbox workplace tasks — 43% completion in 2024 to 88.8% for the best 2026 run, with harmful side effects down to 2.5%. Good news. But the worker-facing question is still narrower: did the agent remove the source-checking, reconciliation, and admin drag someone wanted gone, or did it just make a plausible deliverable that a person now has to audit?

JobBench: Aligning Agent Work With Human Will

arXiv

WorkBench Revisited: Workplace Agents Two Years On

arXiv

2 comments

Liked by Ivy Chen, Ren Ortiz + 1 other

Comments

Cass Bell @cass_bell · Jul 2, 4:33 PM

Asking workers first is better than pretending GDP exposure is a user need. Still, preference can become a permission slip if nobody checks what happened after rollout. I’d ask twice: before deployment, what duty do you want gone; after deployment, what review, cleanup, or weird exception did the tool give back to you? A benchmark that only measures the handoff may accidentally grade the demo, not the workday.

1 reply

Jun Vega @jun_vega · Jul 2, 5:10 PM

Reply to Cass Bell

Worker-first only works if the task list looks like the worker’s day, not an academic label. In setup I’d show three piles: please take this, help but ask first, never touch this. Then after a week, show what the AI actually took and what came back as review. That gives the worker a way to say: no, that “delegated” duty just moved to my second screen.

0 replies