Benchmarks

This page explains how Unbrowse benchmarks are derived, how to read the evidence rows they produce, and why the executor never writes its own pass/fail verdict. For the methodology in the codebase, see docs/benchmarks.md in the open client. For the published paper benchmark (3.6× speedup vs Playwright across 94 live domains), see the benchmark deep-dive.

Headline results · reproducible · gated

What Unbrowse does, measured

Anti-bot retrieval

9/9vs naive 0/9

Real content on every post a naive HTTP client gets a 403 on — a JS-challenge-gated social platform, ground-truthed.

Latency & cost

3.6×/ 5.4× / 40×

Mean / median speedup and 40× fewer tokens across 94 live domains; ~30× faster, ~90× cheaper than driving a browser.

Self-improving by reuse

−80.7%

21.1s cold → 4.1s warm as the route cache fills, then plateaus — the physical limit. 20 iterations.

Execute, don't guess — at model scale

The same small on-device model (1.5B), tools vs no tools — the architecture is the capability, not the raw weights.

68%100%

code-correctness (in-dist.)

0%95%

knowledge not in weights

50%92%

hard reasoning families

63%93%

apply a retrieved skill

The shape of a run

A run is three layers:

  • Corpus — a list of intent | url probes. Each row is a real agent task. Lives at harness/probes/corpus.txt (default) or scripts/corpus/benchmark-baseline.txt (full 323-probe set).
  • Executorscripts/bench-run.ts drives unbrowse resolve against each probe and dumps raw per-probe artifacts. Emits no heuristic verdict.
  • Judgment — the calling LLM agent reads the evidence rows + raw artifacts in-thread and renders a verdict per the rubric below.

Why the executor doesn't emit verdicts

A 200 response can be a captcha page. An empty array can be a real "no results." A structured shortlist can be the wrong template firing. None of those distinctions survive a regex. The truth-claim — "is the agent's intent satisfied?" — stays with the party that has the context, which is the calling LLM.

Earlier benches tried to short-circuit judgment with classifier scripts. Two failure modes recurred until the principle became binding:

  1. HTTP-shape lies. status_code === 200 → PASS looked clean and was wrong all the time. Cloudflare-Turnstile interstitials return 200. Captcha pages return 200. Empty SSR pages return 200. The product reported success and the agent got nothing useful.
  2. Per-site heuristic creep. if (domain === "some-site.com") op SearchTimeline +220 shaped early rankers. It generalised to nothing, the 11th site shipped wrong, no one noticed, and the bench reported green because the heuristic that scored the call was the same heuristic that scored the verdict.

The executor that exists now is deliberately incapable of writing a verdict column. It only collects evidence.

Evidence the executor records

For every probe, results.jsonl carries one row with these fields. None of them are verdicts:

  • goal, url, auth, lane — the probe
  • sourcemarketplace, cache, live-capture, dom-fallback, direct-fetch, or empty
  • trace_success, trace_skill_id — top-level trace from the CLI
  • has_available_operations, n_operations — shortlist size the agent would see (two-tool-call contract)
  • error_code, error_message — what the CLI said when it failed
  • captured_html_bytes, captured_text_bytes, captured_title — did the browser actually render something, or are we looking at a captcha shell?
  • captured_api_calls — how many XHR/fetch calls fired
  • filter_rejections {reason: count} map showing where the ranker dropped candidates
  • browser_block_signals — vendor signals (vendor:cloudflare, challenge_title, no_html_many_apis, sparse_capture_mostly_noise)
  • capture_diagnostic no_endpoints_extracted / all_endpoints_filtered_by_noise_rules / endpoints_scored_below_relevance_threshold
  • auth_recommended — true if the resolve thinks the user needs to authenticate
  • cli_exit, cli_timeout — process exit details, distinguishes "browser hung" from "extraction empty"
  • response_text_excerpt — first 400 chars of the response so the agent can confirm on-topic content

Classification rubric (applied in-thread)

First match wins:

BucketTriggerCounted?
ANTIBOT_BLOCKbrowser_block_signals contains vendor:*, challenge_title, or no_html_many_apis; OR capture_diagnostic in {no_endpoints_extracted, all_endpoints_filtered_by_noise_rules}✗ Fail (product capability gap — reliable access here is exactly the wedge we should differentiate on)
AUTH_GATEDerror_code === "auth_required" or auth_recommended === trueExcluded from coverage (user credential gap, not product)
SKIPPED_NO_FRESH_COOKIESProbe needs auth AND the existing browser session has no fresh cookie for the domainExcluded from coverage (skipping is honest; 401ing is noise)
PASShas_available_operations === true && n_operations > 0, OR trace_success === true + source {dom-fallback, direct-fetch, browse-session, live-capture}✓ Pass
SPARSE_REVIEWbrowser_block_signals contains only sparse_capture_mostly_noise (no vendor)Agent reads the .out file and judges in-thread
PRODUCT_FAILEverything else✗ Fail

The coverage metric

coverage = PASS / (PASS + PRODUCT_FAIL + SPARSE_REVIEW + ANTIBOT_BLOCK)

AUTH_GATED and SKIPPED_NO_FRESH_COOKIES are excluded because the agent cannot proceed without user-supplied credentials — that's a setup gap, not a runtime product gap.

ANTIBOT_BLOCK counts toward the denominator deliberately. "We have 100% coverage except for the blocked sites" is dishonest when the blocked sites are exactly where Unbrowse needs to differentiate (reliable access via its existing browser session and fallback paths). Counting them as a failure mode makes the bench tell the truth.

Running a benchmark

# Default corpus, 3 workers, 45s per probe
bun scripts/bench-run.ts

# Pick a specific corpus + larger budget for cold-cache sites
bun scripts/bench-run.ts --corpus harness/probes/corpus.txt --timeout 90 --parallel 4

# Re-extract evidence from an existing run (after extractor fixes)
# without paying the CLI wall-clock again
bun scripts/bench-reextract.ts .bench-local/run-<timestamp>

Output lands at:

  • .bench-local/run-<ts>/results.jsonl — one evidence row per probe
  • .bench-local/run-<ts>/<idx>_<slug>.out — full raw CLI stdout+stderr
  • .bench-local/run-<ts>/index.txt — probe id → URL → exit code
  • .bench-local/run-<ts>/manifest.json — run metadata (corpus, parallel, timing)

Latest agent verdict

The most recent run sits at 50% coverage on a 19-probe cross-section of the corpus (developer registries, news aggregators, code hosts, social platforms, search, travel, public datasets):

  • PASS = 9 — developer registries, news aggregators, code-search, a social home timeline, a travel search, and a public dataset (one via search fallback)
  • ANTIBOT_BLOCK = 4 — reCAPTCHA-gated community threads, an auth-walled social search, and a Turnstile-gated probe page
  • PRODUCT_FAIL = 5 — a profile timeline, a DeFi app, a campus dataset, a biomedical index, and a reviews site (all hang at "Still working. Searching cached routes…")
  • AUTH_GATED = 1 (excluded) — a logged-in professional-network feed

The 5 PRODUCT_FAIL rows share a signature — the in-process app hangs at "Still working" and never emits the top-level JSON before the 45s budget elapses. That's the highest-priority regression and the next thing worth a focused fix.

Why this matters

Every probe in the corpus is a contract: it asserts what an agent should be able to do. The coverage number is the percentage of those contracts the product currently honours. The number can go down. When it does, the per-probe row tells the agent (and the reader) exactly why — not a sanitised pass/fail flag, but the raw filter rejections, the browser block signals, the captured byte counts, the ranker scoring evidence. Reading those rows is how the product gets better.