/Engineering

A benchmark for docs that AI agents can actually use

Public benchmarks need explicit scoring, narrow claims, and vendor neutrality to create useful feedback loops.

F

Faizan Khan

2026-05-04 • 9 min read

TLDR: We built a public docs benchmark that measures how readable your documentation is for AI agents. We had internal tools for this, but decided to make it transparent and vendor-neutral. The benchmark tests discovery, structure, task-readiness, and clean delivery—following patterns from the Agent-Friendly Documentation Spec.

AI agents are reading documentation more than humans now.

Every day, agents crawl docs sites to understand APIs, find integration examples, and figure out authentication flows. When they can't parse your docs cleanly, they fail at tasks that should be straightforward. The problem isn't model capability—it's interface design.

We've been tracking this at Docsalot through internal agent traffic analysis. Most documentation failures happen at basic structural levels: no machine-readable entry points, HTML-only content, missing examples, unclear auth requirements.

We had closed internal benchmarks for analyzing this, but realized the broader ecosystem needed transparent measurement tools.

Most AI benchmarks have the same problem: vendors run them and also want to win them. This doesn't make them dishonest, but it makes them hard to trust. If you define the scoring, pick the tests, and rank your own framework first, you're writing a product page, not running a benchmark.

We faced this building Docsalot's public docs benchmark. The obvious move: ship a flattering leaderboard, rank our stack high, call ourselves the best. The incentives are clear—benchmarks are free distribution.

That felt wrong.

Software has machine users now. We need measurements that reveal actual machine-readability, not which vendor wrote the test.

Narrow claims are useful claims

This isn't a benchmark for "how good AI agents are" or "whether an agent can use your product end to end."

It tests one thing: can an agent discover your docs, fetch them in machine-readable format, understand the structure, and find enough information to proceed without guessing?

Many "AI foundation model failures" are actually interface failures:

no machine-readable entry point
HTML-only docs with no clean markdown path
navigation and cookie banners mixed into machine-facing content
no clear prerequisites or auth constraints
no examples
no troubleshooting or recovery information

Humans skim and infer. Agents don't.

What we actually score

The public benchmark uses four weighted buckets.

1. Discovery and delivery

This bucket checks whether the docs expose basic machine-readable entry points and retrievable text:

does /llms.txt exist
does /llms-full.txt exist
do sampled documentation pages have valid .md versions

We sample from llms.txt links when available, otherwise fall back to /getting-started, /guide, /api, /reference.

2. Content structure

This bucket checks whether llms.txt is actually usable, not merely present:

does it have an H1 title
does it include a blockquote description
does it use meaningful H2 sections
do list items contain links
do those links include descriptions
is there an ## Optional section for lower-priority material

This follows patterns from The Agent-Friendly Documentation Spec (AFDocs) and its GitHub repository. The idea is simple: if a machine-readable index exists, it should help agents prioritize instead of forcing blind exploration.

3. Task readiness

This bucket checks whether the docs contain the kinds of details agents repeatedly need to complete real tasks:

a clear opening description of what the product actually does
prerequisites and constraints
integration information
troubleshooting or error-handling material
concrete examples or code blocks

Pretty docs often fail here. Sites look polished but force agents to guess about auth, setup, limits, errors.

4. Access and recovery

This bucket checks whether the content is delivered in a form agents can actually use:

do markdown pages return valid markdown rather than HTML or error pages
are responses reasonably fast
is the main content present in the initial HTML, or does it effectively require client-side JavaScript
do pages support Accept: text/markdown
is markdown output clean, or polluted with nav, footer, and consent cruft

Clean matters. Markdown with nav cruft is technically readable but practically noisy.

The methodology has to be inspectable

Benchmarks need transparency. We store breakdowns, individual checks, and methodology versions. External scores stay labeled as external.

Leaderboards collapse to integers, but the failure pattern matters more than the score. Did you lose points for missing llms.txt, dirty markdown, or no examples?

We disable domain-specific adjustments in benchmark mode. Our score widget has host-specific tweaks for user experience, but public benchmarks can't have hidden exceptions.

Benchmark mode is stricter by design.

Why vendor neutrality matters

Neutrality isn't about moral purity—it's about diagnostic value. If your stack always wins, teams learn what you sell, not what works.

A credible benchmark needs to be willing to produce uncomfortable outputs:

another docs stack can beat yours
your own customers can outrank you
open-source sites can outperform commercial platforms
a site with plain design can beat a visually polished one because the machine-facing delivery is better

That is useful.

It creates a feedback loop around operability rather than brand preference.

What this benchmark still misses

This has limits. We don't test full workflows, validate code, or exercise APIs. This measures docs operability, not software operability.

Still worth doing—docs are often the first interface agents hit. But we need separate benchmarks for CLI, API, auth, and recovery flows.

Seeing agent failures from missing structure makes the broader pattern obvious: this will happen in auth, setup, and recovery too.

Why this matters now

Software has machine users now. Buying, onboarding, support workflows start with agents reading docs.

If that first contact fails, adoption suffers before humans get involved.

We need benchmarks that are narrow, explicit, and neutral. Not because scores matter, but because visible failure modes give teams something to fix.