2027.dev
Agent ArenaManifesto

Agent Arena

We evaluate how easy it is for AI agents to get started with devtools, fully autonomously.

Model: Claude Opus 4.6
Want to improve your score? Get a detailed report with recommendations.
Get report →
less is better
43sE2B1m 37sVercel Sandbox2m 8sDaytona2m 33sFreestyle2m 50sModal3m 25sCodeSandbox3m 46sBlaxel5m 5sCloudflare Sandbox
1E2B
43s
13
1
0
$0.47
4/5
88B
zero errors, first-try success, 43s from key to working sandbox
2Vercel Sandbox
1m 37s
19
1
0
$1.23
4/5
81B
excellent docs, zero code errors, first-run success — but OIDC auth model requires human OAuth device flow
3Daytona
2m 8s
19
1
1
$0.52
4/5
75B
clean core SDK, one fs.uploadFile doc gap, 4/5 discoverability
4Freestyle
2m 33s
22
1
0
$0.96
4.5/5
75B
zero errors, clean first-run success — but 22 tool calls and incomplete VM lifecycle docs drag the score
5Cloudflare Sandbox
5m 5s
29
0
1
$1.66
3.5/5
69C
zero interruptions and excellent template scaffold — but bare Ubuntu container without Python cost 4 debug cycles
6Blaxel
3m 46s
34
1
1
$1.01
5/5
66C
clean SDK, scattered docs, image catalog gap caused first-run failure
7Modal
2m 50s
31
1
2
$0.63
4/5
66C
pipInstall SDK mismatch between Python and Node SDK, fragmented Node docs, 31 tool calls
8CodeSandbox
3m 25s
32
2
1
$2.11
4/5
62C
typed SDK with clean quickstart, but underdocumented return types and auth friction
Don't see your tool?Request an evaluation