Blog

Unbrowse vs Playwright: The Full Benchmark (94 Domains)

Name: Unbrowse
Author: Unbrowse

Detailed benchmark results from the arXiv paper comparing Unbrowse API calls vs Playwright browser automation across 94 domains with site-by-site breakdown.

Lewis Tham

April 3, 2026

In April 2026, we published a peer-reviewed paper on arXiv titled "Internal APIs Are All You Need: Shadow APIs, Shared Discovery, and the Case Against Browser-First Agent Architectures" (arXiv:2604.00694). The paper presents the theory behind API-first agent web access, the Unbrowse system architecture, and a comprehensive benchmark against Playwright across 94 domains.

This post breaks down the benchmark methodology, presents the full results, and explores what the data tells us about the future of AI agent web access.

Benchmark Design

Goal

Measure the performance difference between two approaches to web data retrieval:

Browser automation (Playwright): Navigate to a URL, wait for rendering, extract data from the DOM
API discovery (Unbrowse): Call the discovered internal API endpoint directly

Setup

94 domains selected across categories: e-commerce (Amazon, eBay, Shopify stores), social media (Reddit, Twitter, LinkedIn), search (Google, Bing, DuckDuckGo), news (CNN, BBC, TechCrunch), financial data (Yahoo Finance, Bloomberg), developer tools (GitHub, Stack Overflow, npm), and general reference (Wikipedia, IMDb, Weather)
Playwright baseline: Chromium headless, default configuration, auto-wait enabled. For each domain, a standardized task was defined (search for a term, navigate to a listing, fetch a feed). Time measured from navigation start to data extraction complete.
Unbrowse execution: Same tasks, resolved against the cached route database. Routes were pre-indexed through browsing sessions. Time measured from resolve call to JSON response received.
10 runs per domain per tool to account for variance. Median values used for comparison.

Metrics

Execution time: Wall-clock time from task initiation to structured data received
Success rate: Percentage of runs that returned correct, complete data
Data completeness: Whether the response contained all expected fields

Headline Results

Metric	Playwright	Unbrowse	Improvement
Mean execution time	3,404ms	950ms	3.6x faster
Median execution time	3,850ms	710ms	5.4x faster
P95 execution time	8,200ms	1,800ms	4.6x faster
Well-cached routes (P10)	2,100ms	95ms	22x faster

The mean speedup of 3.6x is meaningful, but the median tells a more interesting story. At 5.4x, it indicates that the distribution of improvements is skewed: Unbrowse's advantage is larger on most domains but reduced on domains where the API call itself is slow (complex queries, distant servers, rate-limited endpoints).

The P10 result is striking: for the best-cached routes, Unbrowse returns data in under 100ms -- roughly the time to make a single HTTP request. This is the theoretical minimum for any web data retrieval tool, achieved when the route is cached locally and the API server responds quickly.

Category Breakdown

E-commerce (22 domains)

Domain	Playwright (ms)	Unbrowse (ms)	Speedup
Amazon	4,200	1,100	3.8x
eBay	3,800	890	4.3x
Walmart	5,100	1,300	3.9x
Target	4,500	950	4.7x
Best Buy	4,800	1,050	4.6x
Shopify stores (avg)	3,200	780	4.1x
Etsy	3,600	820	4.4x
Wayfair	5,500	1,400	3.9x

Category average: 4.2x speedup

E-commerce sites showed the most consistent improvements. These sites are heavily API-driven: product search, category browsing, price lookup, and inventory checks all hit well-defined REST endpoints. The API responses contain exactly the structured product data that agents typically need (title, price, availability, ratings), making the Unbrowse approach particularly efficient.

Playwright's performance on e-commerce sites was dragged down by heavy JavaScript bundles, lazy-loaded images, and A/B test frameworks that add rendering time without affecting data availability.

Social Media (15 domains)

Domain	Playwright (ms)	Unbrowse (ms)	Speedup
Reddit	3,500	650	5.4x
Twitter/X	6,200	1,800	3.4x
LinkedIn	5,800	1,500	3.9x
Instagram	4,900	1,200	4.1x
TikTok	5,500	1,600	3.4x
YouTube	4,100	900	4.6x
Pinterest	3,800	850	4.5x

Category average: 4.0x speedup

Social media sites showed high variance in Playwright performance due to aggressive anti-bot measures. Twitter/X and TikTok had the highest Playwright times because their anti-bot systems frequently delayed or blocked automated sessions. Unbrowse bypasses this entirely since direct API calls do not trigger browser-based anti-bot detection.

Reddit showed the largest speedup (5.4x) because its JSON API is well-structured, fast, and minimally rate-limited for read operations.

Search Engines (8 domains)

Domain	Playwright (ms)	Unbrowse (ms)	Speedup
Google	3,200	1,100	2.9x
Bing	2,800	750	3.7x
DuckDuckGo	2,400	620	3.9x
Google Scholar	3,500	980	3.6x

Category average: 3.4x speedup

Search engines showed the smallest category speedup, primarily because Google's internal API is complex (encrypted payloads, protobuf responses) and requires more processing to call correctly. DuckDuckGo and Bing have simpler, more accessible internal APIs.

News and Content (18 domains)

Domain	Playwright (ms)	Unbrowse (ms)	Speedup
CNN	3,800	680	5.6x
BBC	3,200	550	5.8x
TechCrunch	4,100	720	5.7x
Reuters	3,500	600	5.8x
The Verge	4,500	800	5.6x
Hacker News	1,800	180	10.0x

Category average: 6.1x speedup

News sites showed the largest category speedup. These sites typically have clean, fast content APIs behind heavy frontend rendering (ads, tracking scripts, cookie consent modals, newsletter popups). Stripping away the rendering overhead reveals extremely fast underlying APIs.

Hacker News is an outlier at 10x because its API is exceptionally simple and fast (plain JSON, no auth, minimal payload), while its Playwright rendering time includes Algolia-powered search widget initialization and comment tree rendering.

Developer Tools (12 domains)

Domain	Playwright (ms)	Unbrowse (ms)	Speedup
GitHub	3,900	850	4.6x
Stack Overflow	3,100	700	4.4x
npm	2,600	480	5.4x
PyPI	2,200	380	5.8x
MDN Web Docs	2,800	520	5.4x
Docker Hub	3,400	750	4.5x

Category average: 5.0x speedup

Developer tools showed strong improvements because these sites tend to have clean, well-structured APIs (many of which are even documented). GitHub's REST and GraphQL APIs are among the best-designed on the web. npm and PyPI have straightforward JSON registries.

Financial Data (10 domains)

Domain	Playwright (ms)	Unbrowse (ms)	Speedup
Yahoo Finance	4,800	1,200	4.0x
Google Finance	3,600	950	3.8x
CoinGecko	3,200	680	4.7x
TradingView	5,500	1,500	3.7x

Category average: 4.0x speedup

Financial sites showed consistent improvements. Real-time data feeds are naturally API-driven, and the frontend rendering overhead (charting libraries, real-time update handlers) is substantial.

Distribution Analysis

The speedup is not uniformly distributed across domains:

5% of domains: >8x speedup (sites with very fast APIs and heavy frontend rendering)
25% of domains: 5-8x speedup (most content and developer sites)
50% of domains: 3-5x speedup (the majority, including e-commerce and social media)
20% of domains: 2-3x speedup (sites with complex APIs or fast rendering)

No domain showed less than a 2x speedup. The floor of improvement comes from the inherent overhead of browser rendering vs. direct HTTP requests: even on the simplest sites, rendering a page in a headless browser takes at least 1-2 seconds, while an API call to a well-optimized endpoint takes 100-500ms.

Success Rate Comparison

Category	Playwright Success	Unbrowse Success
E-commerce	82%	95%
Social Media	71%	88%
Search	85%	90%
News/Content	90%	97%
Developer Tools	88%	96%
Financial	80%	92%
Overall	82%	93%

Playwright's failures were primarily due to:

Anti-bot detection and blocking (42% of failures)
Timeout on slow-rendering pages (28% of failures)
Dynamic content not loaded by extraction time (18% of failures)
Selector mismatch due to A/B testing variants (12% of failures)

Unbrowse's failures were primarily due to:

Endpoint authentication expired or changed (45% of failures)
Rate limiting on the underlying API (30% of failures)
Route not in cache, fallback timeout (25% of failures)

The failure modes are fundamentally different. Playwright fails because browsers are detectable and pages are complex. Unbrowse fails because APIs have access controls and rate limits. The Unbrowse failure modes are more predictable and easier to address (refresh auth, respect rate limits, expand cache).

Token Consumption Analysis

For AI agents, token consumption directly impacts cost and context window usage:

Approach	Avg Tokens per Request	For 94 Domains (10 runs each)
Playwright (accessibility tree)	8,500	7,990,000
Playwright (screenshot + OCR)	12,000	11,280,000
Unbrowse (JSON response)	450	423,000

Unbrowse uses approximately 19x fewer tokens than Playwright's accessibility tree approach and 27x fewer tokens than screenshot-based approaches. For models charging $3 per million input tokens, this is the difference between $24 and $0.63 per benchmark run -- or extrapolated to production, $24,000 vs. $1,260 per month for an agent making 1 million web requests.

Methodology Notes

Fairness considerations:

Playwright was tested with optimal configuration (auto-wait, headless mode, network idle detection)
Unbrowse routes were pre-indexed (representing a warm cache scenario, which is the expected production state for popular domains)
Cold-start scenario (Unbrowse with no cache) was not benchmarked because it degrades to browser automation for the initial session
Both tools ran on the same hardware (M2 MacBook Pro, 16GB RAM, macOS Sonoma)
Network conditions were consistent (residential broadband, ~50ms latency to most US-based servers)

What we did not test:

Interactive tasks (form filling, multi-step workflows) where browser automation is required
Server-rendered sites with no client-side API calls
Sites behind CAPTCHA walls where both approaches would fail without solving
Mobile app APIs (Unbrowse focuses on web traffic)

Implications

For AI Agent Developers

The 3.6x speedup and 19x token reduction mean that agents using API-first approaches can:

Process 3.6x more web requests in the same time window
Fit 19x more web data into the same context window
Spend 95% less on web data retrieval at scale
Achieve 93% vs. 82% success rate on data retrieval tasks

For the Browser Automation Industry

Browser automation is not dead -- it is being correctly scoped. The benchmark shows that browser automation is the wrong tool for data retrieval, which is the majority of agent web access. It remains the right tool for interactive tasks, visual testing, and sites without discoverable APIs.

The future architecture is hybrid: API-first for data, browser for interaction. The tools that win will be the ones that automate the transition between these modes.

For the Scraping Industry

The $530 per 1,000 browser actions cost structure is under pressure. As API discovery tools improve and shared route caches grow, the marginal cost of web data retrieval approaches the cost of a single HTTP request. Scraping services that charge per-page fees based on browser rendering costs will face pricing pressure from API-first alternatives.

Reproducing the Benchmark

The benchmark methodology is described in full in the arXiv paper (2604.00694). To reproduce:

# Install Unbrowse
npx unbrowse setup

# Run the programmatic eval suite
bun run eval:codex:product-success

# Results are written to:
# evals/codex-harness-last-run.json
# evals/codex-harness-last-run.review-queue.json

The eval harness tests resolve quality across indexed domains, with agent review of shortlist quality and optional execute verification.

Conclusion

Across 94 domains, direct API calls are faster than browser automation 100% of the time. The mean speedup is 3.6x, the median is 5.4x, and the best cases exceed 10x. Token consumption drops by 19x, success rates improve by 11 percentage points, and the cost per request drops by two orders of magnitude.

These are not marginal improvements. They represent a fundamental architectural advantage: calling the data source directly is always faster than rendering the data source's presentation layer and scraping it.

The benchmark data supports a simple conclusion: for data retrieval tasks, browser automation is an unnecessarily expensive and unreliable indirection layer. The APIs are there. Call them.

The full paper is available at arxiv.org/abs/2604.00694.