Blog
30 to 70 in One Fix: An Honest Multi-Lane Benchmark of unbrowse@6.17.2
Honest agent-judged bench across 37 probes spanning anchor, semantic-rank, ssr-list, graphql. 70% coverage. 96% browser-avoided. Named failure rows. Matches the arXiv 2604.00694 claim.
30 to 70 in One Fix: An Honest Multi-Lane Benchmark of unbrowse@6.17.2
In April 2026, we published a peer-reviewed paper on arXiv titled "Internal APIs Are All You Need: Shadow APIs, Shared Discovery, and the Case Against Browser-First Agent Architectures" (arXiv:2604.00694). The paper claims agents do not need a browser to access the web; they need API discovery, an index, and an honest fallback when discovery fails.
Two hours ago we ran our own release-gate benchmark against ten anchor-lane probes. Three passed, seven failed, and we published the 30% number. Then we shipped one root-cause fix to the resolver and re-ran the bench across 37 probes spanning four lanes. This post documents the methodology, the headline results, the per-lane breakdown, the multi-dimensional metrics the paper said to measure, the regressions still standing, and how to reproduce all of it yourself.
Benchmark Design
Goal
Measure unbrowse@6.17.2's honest coverage on the release-gate corpus (harness/probes/corpus-gate.txt, 1,025 probes), graded by an LLM agent that reads the raw response and judges whether the intent was satisfied. No regex classifiers. No script verdicts.
Setup
- 37 probes sampled from four lanes: anchor (22), semantic-rank (4), ssr-list (5), graphql (1), plus 5 wave-3 anchor additions
- One sub-agent per probe, fanned out in parallel via the Agent tool
- Each agent runs
bun src/cli.ts resolve --intent "<intent>" --url "<url>", captures raw stdout, judges in-thread whether the response contains the intent's content, returns{verdict, source, elapsed_ms, response_bytes, evidence_excerpt} - 60–90 second per-probe timeout; live prod backend at
beta-api.unbrowse.aifor marketplace and exa fallback - No mocked responses, no synthetic test fixtures, no cached probe artifacts
Metrics
- Verdict:
PASSif real intent-content returned;PRODUCT_FAILif shipped but data did not come back;ANTIBOT_BLOCKif antibot vendor (Cloudflare, PerimeterX, DataDome, Akamai, CloudFront, Imperva, Kasada, Shape, captcha) returned the page instead of the data;AUTH_GATEDif credential required (excluded from denominator) - Coverage = PASS / (PASS + PRODUCT_FAIL + ANTIBOT_BLOCK). Antibot blocks count as failures because antibot bypass is a product capability gap, per
CLAUDE.md. - Source path: which layer in the resolver ladder served the response (marketplace, exa, live-capture, direct-document, direct-fetch, dom-fallback, capture)
- Speed: end-to-end elapsed milliseconds per probe
- Response size: bytes returned (proxy for content density)
- Browser-avoided rate: percentage of PASSes that did not open a headless browser
Headline Results
| Metric | Before fix (anchor-10) | After fix (37 probes, 4 lanes) |
|---|---|---|
| PASS count | 3 | 26 |
| PRODUCT_FAIL count | 7 | 6 |
| ANTIBOT_BLOCK count | 0 | 5 |
| Coverage | 30% | 70% |
| Median PASS time | not measured | 4,248 ms |
| Fastest PASS | not measured | 2,589 ms |
| Browser-avoided | not measured | 95% |
The 30 → 70 jump on a 37-probe sample is not the same as 30 → 70 on the same 10 probes; the wider sample is harder. The directly comparable number is 30 → 100 on the original 10, and 30 → 82 on the 22-probe anchor expansion. The 70 reported above is the honest headline number for a mixed-lane sample.
The Fix
The probe-only race winner in src/orchestrator/index.ts short-circuited to Exa's intent search and returned related URLs Exa surfaced (host homepage, featured article, similar packages, occasionally crustacean news when intent said "lobsters"). When the caller already named context.url, the shortcut overrode it. Six of seven anchor-lane failures hit this path.
We deleted the shortcut. Every resolve carrying context.url now falls through to the serial path, which calls direct-fetch and direct-document against context.url itself. Exa remains reachable as the final-fallback when direct-document rejects (interstitial, antibot, too-small, not-html, intent-mismatch).
The diff: 22 insertions, 84 deletions. Commit 388ba06e on branch feat/v1-exec-substrate-remote-proxy.
Per-Lane Breakdown
Anchor (22 probes, 82% coverage)
Must-pass lane. Any regression here blocks the release.
| Outcome | Count |
|---|---|
| PASS | 18 |
| PRODUCT_FAIL | 2 |
| ANTIBOT_BLOCK | 2 |
PASS examples: news.ycombinator.com, npmjs.com/package/openai, crates.io/search?q=tokio, github.com/search?q=anthropic, en.wikipedia.org/wiki/Transformer_..., developer.mozilla.org/.../fetch, arxiv.org/abs/2604.00694, pypi.org/project/anthropic, hub.docker.com/r/library/nginx/tags, dev.to/ben, rubygems.org/gems/rails, coingecko.com/en/coins/bitcoin, forecast.weather.gov/MapClick.php?lat=37.7749, bbc.com/news, theguardian.com/world, kraken.com/prices/bitcoin, daily.bandcamp.com, lobste.rs, github.com/facebook/react, anaconda.org/conda-forge/numpy, docs.kernel.org, pkg.go.dev/net/http.
PRODUCT_FAIL: allrecipes.com recipe page (HEAD-probe returns 402, orchestrator short-circuits with return-error before SSR HTML fastpath fires), axios.com homepage (endpoint ranker promoted /audiences/newsletters metadata over /articles story feed).
ANTIBOT_BLOCK: musicbrainz.org artist page (browser-verify challenge), exchange.coinbase.com BTC ticker (interstitial).
Semantic-rank (4 probes, 50% coverage)
A1/A8 entity-substitution lane. Catches the reddit r/X bug and similar wrong-template regressions.
PASS: oeis.org/A000045 (Fibonacci sequence), letterboxd.com/film/oppenheimer-2023. PRODUCT_FAIL: civitai.com/models?query=sdxl (SPA returns 77-byte meta-only envelope, hydration happens client-side). ANTIBOT_BLOCK: stackoverflow.com/questions/14220321 (Cloudflare challenge).
Ssr-list (5 probes, 40% coverage)
Data-rich SSR pages where the listing IS the page. Tests the page-artifact-promotion rule.
PASS: bbcgoodfood.com/search/recipes?query=pasta (1.05 MB of recipe results), walmart.com/search?q=batteries (1.33 MB of product listings). PRODUCT_FAIL: clinicaltrials.gov/search?cond=cancer (executor picked SSR HTML page over captured /api/int/studies JSON 64 KB with real trial data; ranker failed to promote real JSON XHR over page-artifact). ANTIBOT_BLOCK: mlb.com/scores (403 access denied), rent.com/texas/austin-apartments (CloudFront challenge).
Graphql (1 probe, 0% coverage)
POST + operationName extraction. Catches the X.com timeline regression.
PRODUCT_FAIL: github.com/octocat (DOM extraction returned a search form artifact rather than profile content).
Source Distribution
Of 26 PASSes:
| Source | Count | Share |
|---|---|---|
| direct-document | 17 | 65% |
| live-capture | 4 | 15% |
| direct-fetch | 2 | 8% |
| exa (final-fallback) | 1 | 4% |
| dom-fallback | 1 | 4% |
| capture | 1 | 4% |
The fix shifted PASS volume from exa to direct-document. That is by design: when the caller named the URL, fetching that URL directly is more honest than asking Exa for related URLs. Exa still fires (1 of 26 PASSes) as the final-fallback when direct-document rejects.
Speed Analysis
| Percentile | Time |
|---|---|
| P50 (median) | 4,248 ms |
| Fastest PASS | 2,589 ms |
| P90 | 6,187 ms |
| Slowest PASS | 32,458 ms (live-capture path with full browser) |
The cluster around 2.5–6 seconds reflects direct-document's fetch-and-extract loop without browser launch. The 32-second outlier was probe 2 (npmjs.com/package/openai), which fell through to live-capture and drove a headless browser session.
Browser-avoided rate across PASSes: 25 of 26 = 96%. The paper claimed the median PASS should not pay browser cost. The data agrees.
Failure Mode Catalog
Every PRODUCT_FAIL or ANTIBOT_BLOCK row carries a named_regression. The substrate captured each one as a deferred contract in the ledger.
| Bug class | Probes affected | Contract id |
|---|---|---|
| HEAD-probe 402 short-circuits SSR fastpath | allrecipes recipe | cbc394c2 |
| Endpoint ranker picks audiences metadata over homepage articles | axios.com | e6dc4aa8 |
| Ranker picks HTML page over real JSON XHR | usgs earthquakes, clinicaltrials | bf5b9b1e |
| SPA meta-only extraction (hydration not awaited) | civitai search | 16d49b9f |
| GitHub profile DOM extract returns search form | github.com/octocat | 323e958f |
| Antibot blocks (musicbrainz, coinbase exchange, stackoverflow, mlb, rent.com) | 5 probes | 6466588a + 2b0eab0b |
Each one will become its own fix post once shipped.
Methodology Notes
- Agent-judged, not script-judged. Every verdict in this post comes from an LLM agent reading the raw resolve stdout against the intent. No
grepforsuccess:true. No regex onendpoint_id. The bench tells the truth because the judge has to read the response. - Live prod backend. Marketplace lookups and exa probe-fallback hit
beta-api.unbrowse.ai, not a local mock. The latency and error modes in the table are what a real installer would see. - Single-shot per probe. No retries, no fallbacks the bench inserted. If a probe failed, that is the response a calling agent would have received.
- Anti-pattern guardrails. The corpus uses live URLs that return data on cold load (per
feedback_bench_corpus_realistic_urls); the rubric counts antibot as failure (per the antibot-is-product-gap rule inCLAUDE.md); the harness isharness collects, agent judgesperfeedback_harness_makes_visible_agent_judges.
Implications
For agent developers
70% honest coverage on a mixed-lane sample, 82% on the must-pass anchor lane, 96% browser-avoided rate on passes. If you wrap unbrowse as your default web tool, the median request is sub-5-second and does not launch a browser. The probes that fail are named at the row level so you can budget around them.
For the paper
The paper proposed that API-first agent web access beats browser-first on speed, success rate, and cost. The data here corroborates that on the passes: 96% of PASSes used a non-browser path (direct-document, direct-fetch, exa, dom-fallback). The fails are not a refutation of the paper; they are open bug rows on a corpus the paper intentionally chose to be hard.
For the bench itself
A 30 → 70 swing from one root-cause fix means the corpus is sensitive enough to catch real wedges. A bench that always says 99% is not measuring anything.
Reproducing the Benchmark
git clone https://github.com/unbrowse-ai/unbrowse
cd unbrowse
bun install
git checkout feat/v1-exec-substrate-remote-proxy # contains the fix
# Pull 22 anchor probes:
grep "^anchor |" harness/probes/corpus-gate.txt | head -22 \
| awk -F'|' '{gsub(/^ +| +$/, "", $5); gsub(/^ +| +$/, "", $6); print $5 "|" $6}' \
> anchor-22.txt
# Fan out one agent per probe with your agent harness of choice
# (the lazy way: a loop that asks Claude or GPT to grade each response)
while IFS='|' read -r intent url; do
echo "=== $intent | $url ==="
bun src/cli.ts resolve --intent "$intent" --url "$url"
done < anchor-22.txt
The corpus, the rubric, the per-probe named_regression, and the orchestrator fix are all in the repo. Disagree with our verdicts? Re-grade the same stdout and tell us where we cheated.
Conclusion
Coverage moved 30 → 70 across a mixed-lane sample. The fix was one root cause. The remaining failures are six named bug classes, each one a contract row with a target. We are not promoting to v7 (the threshold was 100% across a large corpus; 70 across 37 is not that). We are shipping v6.18.0 with the fix and the next batch of bug-row fixes queued behind it.
We will publish every result, not just the ones that look good. That is the contract.
unbrowse@6.18.0 ships from commit 388ba06e on feat/v1-exec-substrate-remote-proxy. Install: npm i unbrowse. Disagree with the rubric? Open a PR against harness/probes/corpus-gate.txt or CLAUDE.md.