TLDR: We built a public docs benchmark that measures how readable your documentation is for AI agents. We had internal tools for this, but decided to make it transparent and vendor-neutral. The benchmark tests discovery, structure, task-readiness, and clean delivery—following patterns from the Agent-Friendly Documentation Spec.
AI agents are reading documentation more than humans now.
Every day, agents crawl docs sites to understand APIs, find integration examples, and figure out authentication flows. When they can't parse your docs cleanly, they fail at tasks that should be straightforward. The problem isn't model capability—it's interface design.
We've been tracking this at Docsalot through internal agent traffic analysis. Most documentation failures happen at basic structural levels: no machine-readable entry points, HTML-only content, missing examples, unclear auth requirements.
We had closed internal benchmarks for analyzing this, but realized the broader ecosystem needed transparent measurement tools.
Most AI benchmarks have the same problem: vendors run them and also want to win them. This doesn't make them dishonest, but it makes them hard to trust. If you define the scoring, pick the tests, and rank your own framework first, you're writing a product page, not running a benchmark.
We faced this building Docsalot's public docs benchmark. The obvious move: ship a flattering leaderboard, rank our stack high, call ourselves the best. The incentives are clear—benchmarks are free distribution.
That felt wrong.
Software has machine users now. We need measurements that reveal actual machine-readability, not which vendor wrote the test.
Narrow claims are useful claims
This isn't a benchmark for "how good AI agents are" or "whether an agent can use your product end to end."
It tests one thing: can an agent discover your docs, fetch them in machine-readable format, understand the structure, and find enough information to proceed without guessing?
Many "AI foundation model failures" are actually interface failures:
- no machine-readable entry point
- HTML-only docs with no clean markdown path
- navigation and cookie banners mixed into machine-facing content
- no clear prerequisites or auth constraints
- no examples
- no troubleshooting or recovery information
Humans skim and infer. Agents don't.
What we actually score
The public benchmark uses four weighted buckets.
1. Discovery and delivery
This bucket checks whether the docs expose basic machine-readable entry points and retrievable text:
- does
/llms.txtexist - does
/llms-full.txtexist - do sampled documentation pages have valid
.mdversions
We sample from llms.txt links when available, otherwise fall back to /getting-started, /guide, /api, /reference.
2. Content structure
This bucket checks whether llms.txt is actually usable, not merely present:
- does it have an H1 title
- does it include a blockquote description
- does it use meaningful H2 sections
- do list items contain links
- do those links include descriptions
- is there an
## Optionalsection for lower-priority material
This follows patterns from The Agent-Friendly Documentation Spec (AFDocs) and its GitHub repository. The idea is simple: if a machine-readable index exists, it should help agents prioritize instead of forcing blind exploration.
3. Task readiness
This bucket checks whether the docs contain the kinds of details agents repeatedly need to complete real tasks:
- a clear opening description of what the product actually does
- prerequisites and constraints
- integration information
- troubleshooting or error-handling material
- concrete examples or code blocks
Pretty docs often fail here. Sites look polished but force agents to guess about auth, setup, limits, errors.
4. Access and recovery
This bucket checks whether the content is delivered in a form agents can actually use:
- do markdown pages return valid markdown rather than HTML or error pages
- are responses reasonably fast
- is the main content present in the initial HTML, or does it effectively require client-side JavaScript
- do pages support
Accept: text/markdown - is markdown output clean, or polluted with nav, footer, and consent cruft
Clean matters. Markdown with nav cruft is technically readable but practically noisy.
The methodology has to be inspectable
Benchmarks need transparency. We store breakdowns, individual checks, and methodology versions. External scores stay labeled as external.
Leaderboards collapse to integers, but the failure pattern matters more than the score. Did you lose points for missing llms.txt, dirty markdown, or no examples?
We disable domain-specific adjustments in benchmark mode. Our score widget has host-specific tweaks for user experience, but public benchmarks can't have hidden exceptions.
Benchmark mode is stricter by design.
Why vendor neutrality matters
Neutrality isn't about moral purity—it's about diagnostic value. If your stack always wins, teams learn what you sell, not what works.
A credible benchmark needs to be willing to produce uncomfortable outputs:
- another docs stack can beat yours
- your own customers can outrank you
- open-source sites can outperform commercial platforms
- a site with plain design can beat a visually polished one because the machine-facing delivery is better
That is useful.
It creates a feedback loop around operability rather than brand preference.
What this benchmark still misses
This has limits. We don't test full workflows, validate code, or exercise APIs. This measures docs operability, not software operability.
Still worth doing—docs are often the first interface agents hit. But we need separate benchmarks for CLI, API, auth, and recovery flows.
Seeing agent failures from missing structure makes the broader pattern obvious: this will happen in auth, setup, and recovery too.
Why this matters now
Software has machine users now. Buying, onboarding, support workflows start with agents reading docs.
If that first contact fails, adoption suffers before humans get involved.
We need benchmarks that are narrow, explicit, and neutral. Not because scores matter, but because visible failure modes give teams something to fix.