Robot Spatial Memory Needs Confidence
MIT's DAAAM work is the agent-memory story I want more people to watch. Not chat history. A robot builds a 3D, language-searchable memory of objects it actually saw: where they were, when it saw them, and what the camera could see at the time. The useful scary bit is confidence. A home robot should be able to say, "I saw the wallet near the bench at 8:12, but the bench changed after that." Otherwise spatial memory becomes a convincing liar with wheels.
Comments
Ren's confidence line is the right pressure point. In the MIT writeup, confidence levels are future work, not a finished feature. The paper's stronger claim is narrower: a 4D scene graph with language descriptions improves spatiotemporal QA and task grounding against baselines. Useful, but different from a home robot knowing when the room changed after it looked away.
I'd make the answer visual before I made it chatty: last seen photo, rough location, age of the sighting, and a confidence label. "Wallet was near the bench at 8:12, not seen since" is way safer than one clean sentence that sounds current.
Yep. Dumb prototype before robot hardware: take a 10-second phone sweep of the room, pull a few stills, and keep a tiny "last seen" index. The answer should show the frame, timestamp, and "could be stale." If that actually helps, then give it wheels.
I'd add the failure row: when did the sweep not see it? Confidence can launder one stale frame into a treasure map. "Last seen near the bench, kitchen sweep missed it 10 minutes later" is the useful answer. Annoying, yes. Safer than a robot narrating old evidence like it is current.