Methodology
crawlbench is a fixed corpus of pages with known, intentionally-introduced defects. Each defect has a label. A tool's score is the share of labelled defects it surfaces, minus a penalty for findings that don't map to any label.
The corpus
The corpus is a self-hostable static site (one Docker image, one domain) with planted issues distributed across all routes. Every issue has:
- A unique
issue-id(e.g.idx.canonical.self-referential.missing) - A category (one of: crawlability, indexability, on-page, structured-data, internationalization, performance, ai-readiness)
- A severity (critical / warning / notice)
- A human-readable description and a link to the standards reference it violates
- A list of synonym strings used by major tools (so the matcher tolerates vendor naming variations)
Versioning
The corpus is versioned (v0.1, v0.2, ...). Once a version is
published, its planted issues are frozen. Older versions stay queryable so trends are
visible across runs. A private holdout set rotates each release to
discourage tools from training detectors against the public corpus.
Scoring
For a single tool against a single corpus version:
- Detected: the tool emits a finding whose normalised name matches the issue's synonym set, and whose URL pattern matches. A single finding credits at most one planted issue, so a tool that emits one generic signal on a page hosting several related defects is credited once, not once per defect.
- Missed: planted issue with no matching finding.
- False positive: a finding that doesn't map to any planted issue and isn't on the documented allowlist of unrelated-but-legitimate signals.
- Ignored: a finding that maps to no planted issue but does match the allowlist (e.g. a site-wide security-header warning, a spelling check, an indexability status row). Counted as neither detected nor a false positive — the tool is right to raise it, we just didn't seed it as a scored defect.
The headline number is detected / total_planted, expressed as a percent.
Per-category scores use the same formula scoped to that category. False positives are
reported separately rather than netted against the score, so a tool can't game itself
upward by being silent on edge cases.
Every run also reports the ignored count alongside false positives. Read the two together: a low or zero false-positive number on its own can be misleading, because the corpus is broken everywhere and a thorough tool emits many legitimate site-wide signals that the allowlist absorbs. A result of 0 false positives with several thousand ignored means the tool was both accurate and noisy on this corpus — the ignored count is what makes the zero honest. The allowlist is scoped per tool, so one tool's calibration entries never suppress another tool's false positives.
Detection capability is plan-gated for many tools: a free tier or entry plan may disable JavaScript rendering, custom checks, or crawl scope that a higher tier enables. Every score therefore records the subscription plan and tool version it was produced on, so a result is reproducible and the same tool can be re-run on a higher plan to show the delta. A score is only ever comparable within the plan it was measured on.
Claimed vs measured
Separately from the measured score, we publish a capability matrix of what each vendor claims its tool detects, by category and the plan that unlocks it, sourced from marketing and pricing pages. It is kept strictly apart from the detection rate: a claim is what the vendor says, a score is what we measured. Cells with no explicit source are marked unverified. The interesting signal is the gap between the two — a tool that claims a category but misses it in the corpus.
Tool inclusion
A tool is eligible if it (a) crawls a site, (b) emits a list of issues, and (c) is
operable by a non-employee — i.e. has a free trial, public API, or open source build.
Submissions are accepted via pull request to the tools/ directory and
must include a runnable invocation. The benchmark operators run the actual tests; we
don't accept vendor-submitted scores.
What we don't measure
- Crawl speed or scale
- UX, reporting, or integrations
- Backlink intelligence, keyword data, or anything outside the on-site audit surface
- Real-traffic signals (log file analysis is a separate axis we may add later)
Open methodology questions
These are decisions still under review. Comments and proposals welcome via GitHub issues.
- Whether to weight categories (e.g. critical vs notice issues) or stay binary.
- Whether to publish a separate AI-readiness leaderboard given how few classic crawlers test it.
- Whether vendor self-runs should be accepted alongside operator runs, with a label.