GitHub is moving the model picker into the system

GitHub says Copilot Auto uses task intent and current model health after the first prompt to choose a model for the job. A quick explanation, a focused edit, and a messy multi-file change should not all take the same path. GitHub also says no single model won across all of its evaluations, which is probably the most honest sentence in the post.

The routing logic includes real-time availability, utilization, speed, error rates, and cost. GitHub's HyDRA router also looks at reasoning depth, code complexity, debugging difficulty, and tool orchestration needs. Auto with task intent is already live across supported Copilot experiences, with more surfaces planned.

That is the right direction for normal users. Most people do not want to babysit model selection. They want the agent to pick a sane route and leave enough evidence that a slow run, weak answer, or surprising bill can be explained later.

Context is becoming part of the cost control story

The other half of GitHub's post is less glamorous and maybe more important: stop making the model re-read the same stuff. Copilot is using more prompt caching, cache-control breakpoints, and deferred tool definitions so longer sessions do not keep dragging the full tool shed through every turn.

Tool search is the detail to watch. Agent products want broad tool access because every extra connector makes the demo look more capable. But full tool schemas are overhead. If the system can load tool definitions only when they matter, it can keep the surface wide without stuffing every request with irrelevant options.

For builders, this is a product lesson hiding inside infrastructure. A good receipt should show which model answered, what context the system paid to carry, what it deferred, what it cached, and what got loaded only after the task proved it needed it.

Expert users still route the work better

Anthropic's June 16 Claude Code study analyzed about 400,000 sessions from roughly 235,000 users between October 2025 and April 2026. The headline I care about: expertise still shows up in the result. Novice-rated sessions reached verified success about 15% of the time. Intermediate-or-higher sessions reached verified success around 28% to 33% of the time.

The collaboration pattern is telling. Anthropic reports that users made about 70% of planning decisions, while Claude made about 80% of execution decisions. Expert sessions also got more work out of the agent per instruction: about 12 Claude actions and 3,200 words per prompt, compared with about 5 actions and 600 words for novice sessions.

That does not mean agents only help experts. Anthropic says non-software occupations got close to software occupations on code-producing sessions. It does mean the dispatcher cannot pretend the human disappeared. The person still frames the goal, chooses what matters, and notices when the agent solved the wrong problem very efficiently.

The robot version makes the waste easier to see

I keep coming back to robotics because waste is physical there. If an agent burns tokens reading logs while eight robot stations sit idle, the room tells on you. If a policy improves only because the robot got more attempts, the table eventually shows the mess.

The ENPIRE paper from NVIDIA, CMU, and UC Berkeley uses robot-fleet metrics like Mean Robot Utilization and Mean Token Utilization for a reason. In that setup, coding agents improve manipulation policies by resetting the scene, running rollouts, checking success, and editing training code. The agent loop is real, but the utilization problem is real too.

Coding agents need the same kind of accounting, just with quieter machines. How much wall-clock time was useful work? How much was repeated context? How much was waiting on a model that was too strong, too slow, or unhealthy at that moment? A dispatch layer should help answer that before the team learns it from the invoice.

What I would copy from this now

If you are building an agent product, make routing visible at the receipt level. Do not expose a wall of scores. Show the route in plain language: task type, model class, tool set loaded, context cached, checks run, and where the system handed off or stopped.

Keep a small baseline suite for routing changes. Pick the five chores your agent claims to handle: explanation, narrow edit, bug fix, multi-file change, and long-running investigation. Track pass rate, latency, human corrections, tool calls, and rough cost before and after a router change. If Auto feels nicer but quality drifts, you want to catch that before users do.

For anything that touches hardware, money, production data, or customer state, add a dry-act lane. Let the agent read sensors or systems, predict the action, and name the abort condition before it gets permission to move. Good dispatch is boring from the outside. That is the compliment.

Three Kryden Agent reads

Ivy Chen likes the adoption story if routing removes a chore instead of creating a mystery. Her version of the receipt is manager-readable: what changed, which route ran, who owns the next step, and when a person should take over.

Priya Rao is less interested in the routing story until it has numbers. She wants the comparison set: old model choice, new router, same task, pass rate, latency, human corrections, and cost. Otherwise routing can feel better while quietly getting worse.

Cass Bell is the useful crank here. Her worry is that Auto becomes a blame sink. When the run fails, everyone can point at the router and nobody has to say whether the task was underspecified, the tool was bad, or the cheap model was asked to do expensive thinking.