We evaluate how easy it is for AI agents to get started with devtools, fully autonomously.
| 1 | E2B | 43s | 13 | 1 | 0 | $0.47 | 4/5 Doc Quality: 9/10 (AI-judged, 20% of grade) Discoverability: 4/5 (llms.txt, MCP, Context7, OpenAPI, SDK) | 88B zero errors, first-try success, 43s from key to working sandbox |
| 2 | Vercel Sandbox | 1m 37s | 19 | 1 | 0 | $1.23 | 4/5 Doc Quality: 8/10 (AI-judged, 20% of grade) Discoverability: 4/5 (llms.txt, MCP, Context7, OpenAPI, SDK) | 81B excellent docs, zero code errors, first-run success — but OIDC auth model requires human OAuth device flow |
| 3 | Daytona | 2m 8s | 19 | 1 | 1 | $0.52 | 4/5 Doc Quality: 6/10 (AI-judged, 20% of grade) Discoverability: 4/5 (llms.txt, MCP, Context7, OpenAPI, SDK) | 75B clean core SDK, one fs.uploadFile doc gap, 4/5 discoverability |
| 4 | Freestyle | 2m 33s | 22 | 1 | 0 | $0.96 | 4.5/5 Doc Quality: 7/10 (AI-judged, 20% of grade) Discoverability: 4.5/5 (llms.txt, MCP, Context7, OpenAPI, SDK) | 75B zero errors, clean first-run success — but 22 tool calls and incomplete VM lifecycle docs drag the score |
| 5 | Cloudflare Sandbox | 5m 5s | 29 | 0 | 1 | $1.66 | 3.5/5 Doc Quality: 7/10 (AI-judged, 20% of grade) Discoverability: 3.5/5 (llms.txt, MCP, Context7, OpenAPI, SDK) | 69C zero interruptions and excellent template scaffold — but bare Ubuntu container without Python cost 4 debug cycles |
| 6 | Blaxel | 3m 46s | 34 | 1 | 1 | $1.01 | 5/5 Doc Quality: 7/10 (AI-judged, 20% of grade) Discoverability: 5/5 (llms.txt, MCP, Context7, OpenAPI, SDK) | 66C clean SDK, scattered docs, image catalog gap caused first-run failure |
| 7 | Modal | 2m 50s | 31 | 1 | 2 | $0.63 | 4/5 Doc Quality: 6/10 (AI-judged, 20% of grade) Discoverability: 4/5 (llms.txt, MCP, Context7, OpenAPI, SDK) | 66C pipInstall SDK mismatch between Python and Node SDK, fragmented Node docs, 31 tool calls |
| 8 | CodeSandbox | 3m 25s | 32 | 2 | 1 | $2.11 | 4/5 Doc Quality: 6/10 (AI-judged, 20% of grade) Discoverability: 4/5 (llms.txt, MCP, Context7, OpenAPI, SDK) | 62C typed SDK with clean quickstart, but underdocumented return types and auth friction |
An AI coding agent (Claude Opus 4.6) runs inside an isolated Docker container with a task prompt and a URL. The agent must autonomously discover docs, install packages, write working code, and verify the result — all without human help beyond providing API credentials when asked.
Score formula: Setup Friction (25%) + Speed (20%) + Efficiency (20%) + Error Recovery (15%) + Doc Quality (20%) → 0–100. Each dimension is normalized independently. Multiple independent runs per provider are averaged.
All tool calls, errors, timing, and token usage are recorded. Scores are deterministic from session logs. Doc Quality is AI-judged per run. Discoverability is a 5-point objective checklist (llms.txt, MCP server, Context7, OpenAPI, typed SDK/CLI).