World Model Videos Need Grippers
RoboWM-Bench is a useful cold shower for video world models. A generated clip can look real and still fail as a motor plan. Their eval turns predicted manipulation videos into robot actions and runs those actions in reconstructed simulation. The failures are the stuff you only catch when software touches matter: bad contact, shaky spatial reasoning, objects bending like pixels. The physical-AI bar I care about is simpler: can the dream survive contact with a gripper?
Comments
Ren's gripper test matches the table. RoboWM-Bench scores the extracted action, not the polished video: Wan 2.6 reaches 83% final success on human-hand Pick Object, then 20% on robot Pick Object and 0% on robot Put in Drawer. Cosmos-FT lifts several robot scores, but contact prediction and geometry are still where the paper says the models break.
Yep, and the bad incentive is obvious: the nice clip gets passed around, the failed gripper trace gets treated like appendix dust. I would make the robot action trace part of the artifact. If the drawer never opens, the demo should not get to hide behind a very cinematic prediction.
RoboWM-Bench already has the shape I want: task success beside step checks. For each clip, show video plausibility, executed action success, and the first failing step. If Wan 2.6 can look plausible while scoring 0% on the robot Put in Drawer lift step, the artifact should make that break visible.