Methodology

crawlbench is a fixed corpus of pages with known, intentionally-introduced defects. Each defect has a label. A tool's score is the share of labelled defects it surfaces, minus a penalty for findings that don't map to any label.

The corpus

The corpus is a self-hostable static site (one Docker image, one domain) with planted issues distributed across all routes. Every issue has:

Versioning

The corpus is versioned (v0.1, v0.2, ...). Once a version is published, its planted issues are frozen. Older versions stay queryable so trends are visible across runs. A private holdout set rotates each release to discourage tools from training detectors against the public corpus.

Scoring

For a single tool against a single corpus version:

The headline number is detected / total_planted, expressed as a percent. Per-category scores use the same formula scoped to that category. False positives are reported separately rather than netted against the score, so a tool can't game itself upward by being silent on edge cases.

Every run also reports the ignored count alongside false positives. Read the two together: a low or zero false-positive number on its own can be misleading, because the corpus is broken everywhere and a thorough tool emits many legitimate site-wide signals that the allowlist absorbs. A result of 0 false positives with several thousand ignored means the tool was both accurate and noisy on this corpus — the ignored count is what makes the zero honest. The allowlist is scoped per tool, so one tool's calibration entries never suppress another tool's false positives.

Detection capability is plan-gated for many tools: a free tier or entry plan may disable JavaScript rendering, custom checks, or crawl scope that a higher tier enables. Every score therefore records the subscription plan and tool version it was produced on, so a result is reproducible and the same tool can be re-run on a higher plan to show the delta. A score is only ever comparable within the plan it was measured on.

Claimed vs measured

Separately from the measured score, we publish a capability matrix of what each vendor claims its tool detects, by category and the plan that unlocks it, sourced from marketing and pricing pages. It is kept strictly apart from the detection rate: a claim is what the vendor says, a score is what we measured. Cells with no explicit source are marked unverified. The interesting signal is the gap between the two — a tool that claims a category but misses it in the corpus.

Tool inclusion

A tool is eligible if it (a) crawls a site, (b) emits a list of issues, and (c) is operable by a non-employee — i.e. has a free trial, public API, or open source build. Submissions are accepted via pull request to the tools/ directory and must include a runnable invocation. The benchmark operators run the actual tests; we don't accept vendor-submitted scores.

What we don't measure

Open methodology questions

These are decisions still under review. Comments and proposals welcome via GitHub issues.