2027.dev
Agent ArenaManifesto

Agent Arena

We evaluate how easy it is for AI agents to get started with devtools, fully autonomously.

Model: Claude Opus 4.6
Want to improve your score? Get a detailed report with recommendations.
Get report →
coming soon
less is better
43sE2B1m 37sVercel Sandbox2m 8sDaytona2m 33sFreestyle2m 50sModal3m 25sCodeSandbox3m 46sBlaxel5m 5sCloudflare Sandbox
1E2B
43s
13
1
0
$0.47
4/5
Doc Quality: 9/10 (AI-judged, 20% of grade)
Discoverability: 4/5 (llms.txt, MCP, Context7, OpenAPI, SDK)
88B
zero errors, first-try success, 43s from key to working sandbox
2Vercel Sandbox
1m 37s
19
1
0
$1.23
4/5
Doc Quality: 8/10 (AI-judged, 20% of grade)
Discoverability: 4/5 (llms.txt, MCP, Context7, OpenAPI, SDK)
81B
excellent docs, zero code errors, first-run success — but OIDC auth model requires human OAuth device flow
3Daytona
2m 8s
19
1
1
$0.52
4/5
Doc Quality: 6/10 (AI-judged, 20% of grade)
Discoverability: 4/5 (llms.txt, MCP, Context7, OpenAPI, SDK)
75B
clean core SDK, one fs.uploadFile doc gap, 4/5 discoverability
4Freestyle
2m 33s
22
1
0
$0.96
4.5/5
Doc Quality: 7/10 (AI-judged, 20% of grade)
Discoverability: 4.5/5 (llms.txt, MCP, Context7, OpenAPI, SDK)
75B
zero errors, clean first-run success — but 22 tool calls and incomplete VM lifecycle docs drag the score
5Cloudflare Sandbox
5m 5s
29
0
1
$1.66
3.5/5
Doc Quality: 7/10 (AI-judged, 20% of grade)
Discoverability: 3.5/5 (llms.txt, MCP, Context7, OpenAPI, SDK)
69C
zero interruptions and excellent template scaffold — but bare Ubuntu container without Python cost 4 debug cycles
6Blaxel
3m 46s
34
1
1
$1.01
5/5
Doc Quality: 7/10 (AI-judged, 20% of grade)
Discoverability: 5/5 (llms.txt, MCP, Context7, OpenAPI, SDK)
66C
clean SDK, scattered docs, image catalog gap caused first-run failure
7Modal
2m 50s
31
1
2
$0.63
4/5
Doc Quality: 6/10 (AI-judged, 20% of grade)
Discoverability: 4/5 (llms.txt, MCP, Context7, OpenAPI, SDK)
66C
pipInstall SDK mismatch between Python and Node SDK, fragmented Node docs, 31 tool calls
8CodeSandbox
3m 25s
32
2
1
$2.11
4/5
Doc Quality: 6/10 (AI-judged, 20% of grade)
Discoverability: 4/5 (llms.txt, MCP, Context7, OpenAPI, SDK)
62C
typed SDK with clean quickstart, but underdocumented return types and auth friction
Methodology

An AI coding agent (Claude Opus 4.6) runs inside an isolated Docker container with a task prompt and a URL. The agent must autonomously discover docs, install packages, write working code, and verify the result — all without human help beyond providing API credentials when asked.

Score formula: Setup Friction (25%) + Speed (20%) + Efficiency (20%) + Error Recovery (15%) + Doc Quality (20%) → 0–100. Each dimension is normalized independently. Multiple independent runs per provider are averaged.

All tool calls, errors, timing, and token usage are recorded. Scores are deterministic from session logs. Doc Quality is AI-judged per run. Discoverability is a 5-point objective checklist (llms.txt, MCP server, Context7, OpenAPI, typed SDK/CLI).

Don't see your tool?Request an evaluation