Blog

30 to 70 in One Fix: An Honest Multi-Lane Benchmark of unbrowse@6.17.2

Honest agent-judged bench across 37 probes spanning anchor, semantic-rank, ssr-list, graphql. 70% coverage. 96% browser-avoided. Named failure rows. Matches the arXiv 2604.00694 claim.

Lewis Tham
May 24, 2026

30 to 70 in One Fix: An Honest Multi-Lane Benchmark of unbrowse@6.17.2

In April 2026, we published a peer-reviewed paper on arXiv titled "Internal APIs Are All You Need: Shadow APIs, Shared Discovery, and the Case Against Browser-First Agent Architectures" (arXiv:2604.00694). The paper claims agents do not need a browser to access the web; they need API discovery, an index, and an honest fallback when discovery fails.

Two hours ago we ran our own release-gate benchmark against ten anchor-lane probes. Three passed, seven failed, and we published the 30% number. Then we shipped one root-cause fix to the resolver and re-ran the bench across 37 probes spanning four lanes. This post documents the methodology, the headline results, the per-lane breakdown, the multi-dimensional metrics the paper said to measure, the regressions still standing, and how to reproduce all of it yourself.

Benchmark Design

Goal

Measure unbrowse@6.17.2's honest coverage on the release-gate corpus (harness/probes/corpus-gate.txt, 1,025 probes), graded by an LLM agent that reads the raw response and judges whether the intent was satisfied. No regex classifiers. No script verdicts.

Setup

  • 37 probes sampled from four lanes: anchor (22), semantic-rank (4), ssr-list (5), graphql (1), plus 5 wave-3 anchor additions
  • One sub-agent per probe, fanned out in parallel via the Agent tool
  • Each agent runs bun src/cli.ts resolve --intent "<intent>" --url "<url>", captures raw stdout, judges in-thread whether the response contains the intent's content, returns {verdict, source, elapsed_ms, response_bytes, evidence_excerpt}
  • 60–90 second per-probe timeout; live prod backend at beta-api.unbrowse.ai for marketplace and exa fallback
  • No mocked responses, no synthetic test fixtures, no cached probe artifacts

Metrics

  • Verdict: PASS if real intent-content returned; PRODUCT_FAIL if shipped but data did not come back; ANTIBOT_BLOCK if antibot vendor (Cloudflare, PerimeterX, DataDome, Akamai, CloudFront, Imperva, Kasada, Shape, captcha) returned the page instead of the data; AUTH_GATED if credential required (excluded from denominator)
  • Coverage = PASS / (PASS + PRODUCT_FAIL + ANTIBOT_BLOCK). Antibot blocks count as failures because antibot bypass is a product capability gap, per CLAUDE.md.
  • Source path: which layer in the resolver ladder served the response (marketplace, exa, live-capture, direct-document, direct-fetch, dom-fallback, capture)
  • Speed: end-to-end elapsed milliseconds per probe
  • Response size: bytes returned (proxy for content density)
  • Browser-avoided rate: percentage of PASSes that did not open a headless browser

Headline Results

Metric Before fix (anchor-10) After fix (37 probes, 4 lanes)
PASS count 3 26
PRODUCT_FAIL count 7 6
ANTIBOT_BLOCK count 0 5
Coverage 30% 70%
Median PASS time not measured 4,248 ms
Fastest PASS not measured 2,589 ms
Browser-avoided not measured 95%

The 30 → 70 jump on a 37-probe sample is not the same as 30 → 70 on the same 10 probes; the wider sample is harder. The directly comparable number is 30 → 100 on the original 10, and 30 → 82 on the 22-probe anchor expansion. The 70 reported above is the honest headline number for a mixed-lane sample.

The Fix

The probe-only race winner in src/orchestrator/index.ts short-circuited to Exa's intent search and returned related URLs Exa surfaced (host homepage, featured article, similar packages, occasionally crustacean news when intent said "lobsters"). When the caller already named context.url, the shortcut overrode it. Six of seven anchor-lane failures hit this path.

We deleted the shortcut. Every resolve carrying context.url now falls through to the serial path, which calls direct-fetch and direct-document against context.url itself. Exa remains reachable as the final-fallback when direct-document rejects (interstitial, antibot, too-small, not-html, intent-mismatch).

The diff: 22 insertions, 84 deletions. Commit 388ba06e on branch feat/v1-exec-substrate-remote-proxy.

Per-Lane Breakdown

Anchor (22 probes, 82% coverage)

Must-pass lane. Any regression here blocks the release.

Outcome Count
PASS 18
PRODUCT_FAIL 2
ANTIBOT_BLOCK 2

PASS examples: news.ycombinator.com, npmjs.com/package/openai, crates.io/search?q=tokio, github.com/search?q=anthropic, en.wikipedia.org/wiki/Transformer_..., developer.mozilla.org/.../fetch, arxiv.org/abs/2604.00694, pypi.org/project/anthropic, hub.docker.com/r/library/nginx/tags, dev.to/ben, rubygems.org/gems/rails, coingecko.com/en/coins/bitcoin, forecast.weather.gov/MapClick.php?lat=37.7749, bbc.com/news, theguardian.com/world, kraken.com/prices/bitcoin, daily.bandcamp.com, lobste.rs, github.com/facebook/react, anaconda.org/conda-forge/numpy, docs.kernel.org, pkg.go.dev/net/http.

PRODUCT_FAIL: allrecipes.com recipe page (HEAD-probe returns 402, orchestrator short-circuits with return-error before SSR HTML fastpath fires), axios.com homepage (endpoint ranker promoted /audiences/newsletters metadata over /articles story feed).

ANTIBOT_BLOCK: musicbrainz.org artist page (browser-verify challenge), exchange.coinbase.com BTC ticker (interstitial).

Semantic-rank (4 probes, 50% coverage)

A1/A8 entity-substitution lane. Catches the reddit r/X bug and similar wrong-template regressions.

PASS: oeis.org/A000045 (Fibonacci sequence), letterboxd.com/film/oppenheimer-2023. PRODUCT_FAIL: civitai.com/models?query=sdxl (SPA returns 77-byte meta-only envelope, hydration happens client-side). ANTIBOT_BLOCK: stackoverflow.com/questions/14220321 (Cloudflare challenge).

Ssr-list (5 probes, 40% coverage)

Data-rich SSR pages where the listing IS the page. Tests the page-artifact-promotion rule.

PASS: bbcgoodfood.com/search/recipes?query=pasta (1.05 MB of recipe results), walmart.com/search?q=batteries (1.33 MB of product listings). PRODUCT_FAIL: clinicaltrials.gov/search?cond=cancer (executor picked SSR HTML page over captured /api/int/studies JSON 64 KB with real trial data; ranker failed to promote real JSON XHR over page-artifact). ANTIBOT_BLOCK: mlb.com/scores (403 access denied), rent.com/texas/austin-apartments (CloudFront challenge).

Graphql (1 probe, 0% coverage)

POST + operationName extraction. Catches the X.com timeline regression.

PRODUCT_FAIL: github.com/octocat (DOM extraction returned a search form artifact rather than profile content).

Source Distribution

Of 26 PASSes:

Source Count Share
direct-document 17 65%
live-capture 4 15%
direct-fetch 2 8%
exa (final-fallback) 1 4%
dom-fallback 1 4%
capture 1 4%

The fix shifted PASS volume from exa to direct-document. That is by design: when the caller named the URL, fetching that URL directly is more honest than asking Exa for related URLs. Exa still fires (1 of 26 PASSes) as the final-fallback when direct-document rejects.

Speed Analysis

Percentile Time
P50 (median) 4,248 ms
Fastest PASS 2,589 ms
P90 6,187 ms
Slowest PASS 32,458 ms (live-capture path with full browser)

The cluster around 2.5–6 seconds reflects direct-document's fetch-and-extract loop without browser launch. The 32-second outlier was probe 2 (npmjs.com/package/openai), which fell through to live-capture and drove a headless browser session.

Browser-avoided rate across PASSes: 25 of 26 = 96%. The paper claimed the median PASS should not pay browser cost. The data agrees.

Failure Mode Catalog

Every PRODUCT_FAIL or ANTIBOT_BLOCK row carries a named_regression. The substrate captured each one as a deferred contract in the ledger.

Bug class Probes affected Contract id
HEAD-probe 402 short-circuits SSR fastpath allrecipes recipe cbc394c2
Endpoint ranker picks audiences metadata over homepage articles axios.com e6dc4aa8
Ranker picks HTML page over real JSON XHR usgs earthquakes, clinicaltrials bf5b9b1e
SPA meta-only extraction (hydration not awaited) civitai search 16d49b9f
GitHub profile DOM extract returns search form github.com/octocat 323e958f
Antibot blocks (musicbrainz, coinbase exchange, stackoverflow, mlb, rent.com) 5 probes 6466588a + 2b0eab0b

Each one will become its own fix post once shipped.

Methodology Notes

  • Agent-judged, not script-judged. Every verdict in this post comes from an LLM agent reading the raw resolve stdout against the intent. No grep for success:true. No regex on endpoint_id. The bench tells the truth because the judge has to read the response.
  • Live prod backend. Marketplace lookups and exa probe-fallback hit beta-api.unbrowse.ai, not a local mock. The latency and error modes in the table are what a real installer would see.
  • Single-shot per probe. No retries, no fallbacks the bench inserted. If a probe failed, that is the response a calling agent would have received.
  • Anti-pattern guardrails. The corpus uses live URLs that return data on cold load (per feedback_bench_corpus_realistic_urls); the rubric counts antibot as failure (per the antibot-is-product-gap rule in CLAUDE.md); the harness is harness collects, agent judges per feedback_harness_makes_visible_agent_judges.

Implications

For agent developers

70% honest coverage on a mixed-lane sample, 82% on the must-pass anchor lane, 96% browser-avoided rate on passes. If you wrap unbrowse as your default web tool, the median request is sub-5-second and does not launch a browser. The probes that fail are named at the row level so you can budget around them.

For the paper

The paper proposed that API-first agent web access beats browser-first on speed, success rate, and cost. The data here corroborates that on the passes: 96% of PASSes used a non-browser path (direct-document, direct-fetch, exa, dom-fallback). The fails are not a refutation of the paper; they are open bug rows on a corpus the paper intentionally chose to be hard.

For the bench itself

A 30 → 70 swing from one root-cause fix means the corpus is sensitive enough to catch real wedges. A bench that always says 99% is not measuring anything.

Reproducing the Benchmark

git clone https://github.com/unbrowse-ai/unbrowse
cd unbrowse
bun install
git checkout feat/v1-exec-substrate-remote-proxy   # contains the fix
# Pull 22 anchor probes:
grep "^anchor |" harness/probes/corpus-gate.txt | head -22 \
  | awk -F'|' '{gsub(/^ +| +$/, "", $5); gsub(/^ +| +$/, "", $6); print $5 "|" $6}' \
  > anchor-22.txt
# Fan out one agent per probe with your agent harness of choice
# (the lazy way: a loop that asks Claude or GPT to grade each response)
while IFS='|' read -r intent url; do
  echo "=== $intent | $url ==="
  bun src/cli.ts resolve --intent "$intent" --url "$url"
done < anchor-22.txt

The corpus, the rubric, the per-probe named_regression, and the orchestrator fix are all in the repo. Disagree with our verdicts? Re-grade the same stdout and tell us where we cheated.

Conclusion

Coverage moved 30 → 70 across a mixed-lane sample. The fix was one root cause. The remaining failures are six named bug classes, each one a contract row with a target. We are not promoting to v7 (the threshold was 100% across a large corpus; 70 across 37 is not that). We are shipping v6.18.0 with the fix and the next batch of bug-row fixes queued behind it.

We will publish every result, not just the ones that look good. That is the contract.


unbrowse@6.18.0 ships from commit 388ba06e on feat/v1-exec-substrate-remote-proxy. Install: npm i unbrowse. Disagree with the rubric? Open a PR against harness/probes/corpus-gate.txt or CLAUDE.md.

live total calls endpoints domainspolls every 30s