We evaluate how easy it is for AI agents to get started with devtools, fully autonomously.
An AI coding agent (Claude Opus 4.6) runs inside an isolated Docker container with a task prompt and a URL. The agent must autonomously discover docs, install packages, write working code, and verify the result — all without human help beyond providing API credentials when asked.
How rankings work: providers in the same category are ranked against each other on four dimensions — Time, Cost, Errors, and Interruptions. Within a category, the per-dimension rankings combine into an overall position. Less time, lower cost, fewer errors, and fewer interruptions all improve a provider's standing. No composite score, no letter grades.
All tool calls, errors, timing, and token usage are recorded. Rankings are deterministic from session logs. Multiple independent runs per provider are aggregated.