Make something agents want

Mika Sagindyk · Mar 12, 2026

Andreessen Horowitz famously said that software was eating the world. Today, it's agents. Still software, but much more intelligent and far more autonomous; software so powerful, it can complete complex tasks without human oversight.

We're increasingly moving towards a reality where agents operate on behalf of humans. And while it's astonishing to see what AI can accomplish, it can be surprisingly hard for it to operate in systems made for humans. In a world where agents operate on our behalf, what do good systems even look like?

We noticed something strange

Earlier this year, my co-founder and I were experimenting with coding agents. We would prompt them to complete tasks for us, in different scenarios and with different tools. After a few experiments, we started noticing a pattern -- coding agents recommended devtools to us, and it wasn't clear to us why they selected some over the others. It wasn't just SEO -- it once literally picked a less famous auth solution because it had "better DX". We were mind-blown.

We wanted to know if it's a pure coincidence, so we rolled up our sleeves and tested how easy it is for AI agents to get started with different tools, fully autonomously and without human oversight. The idea is simple: If an agent can't set up a tool on its own, it simply won't be able to use it. For this experiment, we focused on six factors that measure AX (Agent Experience): setup speed, tool calls made, setup friction, error recovery, docs discoverability and API cost.

After analyzing 60+ devtools across databases, auth, sandboxes, video/image gen, and search, we started noticing patterns. Tools like Neon, E2B and Tavily scored high grades - agents set them up in under two minutes with zero human intervention. Others took 10+ minutes, threw cryptic errors, and required multiple human rescues before limping across the finish line. It was obvious that certain teams had invested in better experience for coding agents, and that effort compounded.

AX evaluations of sandboxes
AX evaluations of sandboxes

It's not as easy as it seems

We also learned that much of this knowledge isn't obvious. Designing systems for agents isn't something our human brain has been trained to do. We have what some would call "curse of knowledge": it's innately hard for us to unlearn what we already know, to unsee what we already see.

But agents are wired completely differently from us. They have no prior experience to rely on, and get stuck in unexpected ways. When a human sees an error, we try things out and google errors. When an agent sees an unknown error, it gets stuck in a loop. We watched one agent retry the same failing command 14 times before giving up, burning through $3 in API costs on a task that should have cost $0.50.

The AX tradeoff

The tech community seems to be split on whether AX is even a good idea.

@biilmann, the founder of Netlify, coined the term Agent Experience in early 2025, defining it as the holistic experience AI agents will have as the user of a product or platform. @nikitabase started assembling a team to ensure @neondatabase caters to AI agents around that time too; similarly, many operators highlight the importance of frictionless onboarding for agents.

Tweets from @biilmann and @nikitabase about Agent Experience

There's a second camp of people saying that frictions in onboarding are a feature, not a bug. When we dug deeper, we uncovered a certain fear: autonomous signup can also lead to autonomous spam. If an agent can sign up, so can a bot farm, as @zeeg points out.

@zeeg's tweet about bot/spam concerns

It's a very valid point that needs a thoughtful approach. The question is, how do you make setup easy for agents without making it easy for attackers?

@JavaSquip of @Netlify offers a useful distinction: bots extract value without giving back, while agents maintain mutual value exchange -- they're extensions of real users, not external actors. AX is centered around these user-delegated agents, not bots.

The technical challenge of enforcing this distinction isn't fully solved. But one thing is clear: not serving agents means not serving users who use agents.

And in running 60+ evals, we noticed something: the patterns that make agents successful, like clear auth and explicit errors, also create better audit trails and boundaries. It turns out that good AX and good security aren't opposites.

What we have learned about AX

We decided to publish our benchmark and learnings on the Agent Arena, and it blew up. Within a week, dozens of developers and devtools were messaging us, asking for best practices on how to evaluate AX and optimize for agents. So here is what we learned from running 60+ evals:

1/ Docs is priority #1

68% of agent failures started with documentation problems.

Unlike humans, agents read your docs literally. Every broken link, outdated example, or scattered guide creates friction. We saw agents fail because examples in GitHub didn't match the released SDK: someone had written sample code for a past version and never updated it. Outdated examples turned out to be worse than no examples at all.

The best-performing tools had one source of truth: not information scattered across GitHub and landing pages or conflicting instructions between the quickstart and the reference docs – just one clear path from zero to working.

2/ Authentication is the biggest friction point

Auth issues caused 40% of all agent interruptions.

The smoothest tools used API keys through environment variables, while OAuth flows often paused agents mid-task; multi-step authentication caused the most failures. The tools that said "no setup, no API keys, just install and go" consistently outperformed everything else. If your getting started requires a human to click through a browser flow, you've already lost the agent.

3/ Error messages make or break recovery

Agents recovered from clear errors in under 30 seconds. Cryptic errors caused 5+ minute spirals.

We watched one agent retry the same failing command 14 times before giving up -- burning through API costs on a task that should have been trivial. The error gave it nothing to work with. No cause, no fix, no link.

Then we watched a different agent hit an error, read the message, follow the link, fix the issue, and move on in half a minute. The difference wasn't the agent, it was the error.

Actionable errors have three things: 1/ what went wrong, 2/ how to fix it, and 3/ a link to relevant docs. If you get those three right, agents become surprisingly resilient users of your tool.

4/ CLIs often assume a human is present

Most command-line tools don't support non-interactive mode; they expect someone to answer prompts, make choices, and confirm actions. We watched agents try to reverse-engineer CLIs because the interactive flows couldn't be automated.

We believe making non-interactive mode the default is the way to go. If your CLI can't run without a human present, it can't run with an agent either.

5/ Discoverability determines whether agents find you at all

Only 23% of tools we evaluated had an llms.txt that agents could actually find -- and those tools scored 35% higher on average.

Machine-readable formats like llms.txt, OpenAPI specs, and MCP servers are only useful if an agent can locate them in the first place. We watched agents fail not because documentation didn't exist, but because they landed on the wrong page and never found the right one. The file was there, but the agent just couldn't reach it.

It turns out, machine readability and findability are two inherently different problems. You can have a perfect llms.txt and still lose agents if it's buried on a docs subdomain instead of your root. The tools that scored highest solved both: structured formats that were also easy to discover.

If an agent can't find you, it can't use you. And if it can't use you, it won't recommend you.

Optimize for zero friction

The point of the exercise is to measure autonomous completion rate, so we focused on a single task: can an agent complete setup without asking a human for help?

To evaluate this, we counted interruptions -- every moment the agent pauses and turns to the user with "I need an API key" or "this error doesn't make sense to me." The best tools have zero interruptions, where the agent reads the docs, writes the code, handles errors, and finishes without human involvement.

For the record: the top 10 tools averaged 0.2 interruptions, while the bottom 10 averaged 4.7.

The truth is, if an agent can't set up your tool autonomously, it won't be able to use you, and it won't recommend you to users in the future; and agents are increasingly becoming how developers discover tools.

Better for agents, better for everyone

What surprised us most while running AX evals is one simple insight: optimizing for agents makes documentation better for humans too.

It turns out, including more concrete examples, clearer instructions, error messages that actually tell you what to do make not only AX great, they also tremendously improve DX.

The things that trip up agents, like outdated examples or scattered docs, also confuse human developers. The difference is that humans power through: we're simply used to bad docs and looking up error messages.

Agents don't have that resilience; they expose every friction point, assumption, and gap.

In a strange way, agents are the most honest feedback you can get about your DX. They can't make excuses for you.

What's next

We're entering an era where developer experience isn't just for developers. Agents are reading your docs and calling your APIs. And when things get hard, they will recommend your competitors instead.

The tools that thrive will be the ones that are easy to discover and simple to set up, while being thoughtful about the abuse risks that come with smooth access.

But one thing remains clear: the bar for developer experience just got higher, and meeting it makes your tools better for everyone.

You can find live ranking of AX evals at 2027.dev/arena. If you're optimizing for agents, we'd love to hear what's working and what's not. The field is still early: if you have ideas on improving our benchmark and measuring AX more accurately, we're all ears! My DMs are open.

Let's make something agents want. We, humans, will benefit from it too.