Blog

Unbrowse vs Scrapy: Traditional Scraping vs API-First

Scrapy parses the web's presentation layer. Unbrowse taps the data layer directly. Compare Python scraping frameworks vs API-native resolution for modern web data.

Lewis Tham
April 3, 2026

Unbrowse vs Scrapy: Traditional Scraping vs API-First

Scrapy has been the Python web scraping workhorse for over a decade. It is fast, battle-tested, and powers production crawlers at thousands of companies. But Scrapy was built for a web that no longer exists — a web of static HTML documents where data lived in the markup. Today, nearly every website is a JavaScript application that fetches data from APIs and renders it client-side. Unbrowse targets those APIs directly, skipping both the rendering and the scraping.

TL;DR Comparison

Feature Scrapy Unbrowse
Approach CSS/XPath selectors on HTML API-native — discovers and calls shadow APIs
Speed Fast for static HTML (~1-3s per page) Sub-100ms cached, ~3,400ms first pass
JavaScript support None (needs Splash/Playwright plugin) Full — Kuri browser handles JS-rendered pages
Token cost N/A (not LLM-native) Structured JSON, 40x fewer tokens than DOM
Auth handling Manual cookie/session management Automatic browser cookie extraction
Pricing Free (OSS) Free tier + x402 micropayments
Best for Large-scale static HTML crawling AI agents, dynamic sites, API-first data retrieval

What is Scrapy?

Scrapy is a Python framework for web crawling and scraping at scale. Released in 2008, it provides a complete toolkit for building spiders that crawl websites, extract data using CSS selectors or XPath expressions, and store results in various formats. It handles request scheduling, rate limiting, retry logic, and data pipelines out of the box.

Scrapy's architecture is built around the concept of spiders — classes that define how to navigate a site and what data to extract. You write selectors that target specific HTML elements, define item schemas for structured output, and configure pipelines to process and store the data. The framework's asynchronous engine (built on Twisted) can handle thousands of concurrent requests efficiently.

The ecosystem is extensive. Scrapy-splash adds JavaScript rendering via a headless browser. Scrapy-playwright integrates Playwright for modern SPAs. Scrapy Cloud (now Zyte) provides managed infrastructure. The community has built middleware for proxy rotation, CAPTCHA solving, and anti-bot evasion.

Scrapy's fundamental limitation in 2026 is its HTML-centric worldview. It parses rendered DOM trees to find data. But on modern websites, the data visible on the page was fetched from a backend API before the page rendered. Scrapy parses the output of that API call — the rendered HTML — instead of intercepting the API call itself.

What is Unbrowse?

Unbrowse is an API-native agent browser. Rather than parsing HTML to extract data, it discovers the APIs that websites use internally (shadow APIs) and calls them directly.

The process works in two phases. In the discovery phase, Unbrowse opens a real browser session using Kuri — a Zig-native CDP broker at 464KB with approximately 3ms cold start. As the page loads and renders, Unbrowse captures all network traffic: every API call, fetch request, and XHR. The enrichment pipeline reverse-engineers endpoint signatures, extracts auth patterns, and stores everything as reusable route definitions.

In the resolution phase, Unbrowse matches an intent to a cached route and makes a direct HTTP call. No browser, no HTML parsing, no CSS selectors. The API returns structured JSON — the same payload the website's frontend originally consumed. Response times are sub-100ms for cached routes.

A shared marketplace multiplies this: routes discovered by any user are available to all. Contributors earn x402 micropayments when their routes are used by others.

Key Differences

Data Extraction: Parsing vs. Interception

Scrapy's entire paradigm is extraction from rendered content. You inspect a page's HTML, find the elements containing your target data, write selectors to extract them, and handle edge cases when the HTML structure changes.

# Scrapy: parse rendered HTML
def parse(self, response):
    for product in response.css('.product-card'):
        yield {
            'name': product.css('.title::text').get(),
            'price': product.css('.price::text').get(),
            'url': product.css('a::attr(href)').get(),
        }

Unbrowse skips this entirely. The product data was loaded via an API call like GET /api/products?category=electronics&page=1. Unbrowse captures that call and returns the raw JSON response:

{
  "products": [
    {"name": "Widget Pro", "price": 29.99, "url": "/products/widget-pro"},
    {"name": "Gadget Max", "price": 49.99, "url": "/products/gadget-max"}
  ]
}

No selectors to write. No selectors to break when the frontend redesigns.

JavaScript-Rendered Content

Scrapy cannot execute JavaScript. For modern SPAs — which constitute the majority of the web in 2026 — Scrapy sees an empty shell: a <div id="root"></div> with some script tags. Getting actual content requires adding Splash (a headless browser rendering service) or integrating Playwright, adding significant complexity and latency.

Unbrowse handles JavaScript natively because it uses a real browser (Kuri) for discovery. But more importantly, JavaScript rendering is irrelevant for the resolution phase. The data loaded by JavaScript came from an API. Unbrowse calls that API directly. Whether the frontend uses React, Vue, Svelte, or vanilla JavaScript is immaterial.

Maintenance Burden

Scrapy spiders are notoriously fragile. A CSS class rename, a layout restructure, a switch from server-side to client-side rendering — any of these breaks your selectors. Teams running Scrapy at scale report spending 30-50% of their engineering time maintaining existing spiders against website changes.

Unbrowse targets APIs, not DOM elements. APIs are versioned and backward-compatible by convention. When they do change, the next browser discovery pass automatically captures the new endpoint. The maintenance burden drops from constant selector repair to occasional re-discovery.

Concurrency Model

Scrapy excels at concurrent crawling. Its Twisted-based engine can manage thousands of parallel requests with configurable rate limiting. For large-scale static HTML crawling — sitemaps, documentation sites, content archives — this is efficient.

Unbrowse is not a crawler. It is a resolver. It does not traverse links or build sitemaps. It answers specific data requests: "get the search results for X," "fetch the product details for Y." For AI agents, this is the right abstraction. Agents do not need to crawl — they need answers.

Integration with AI Agents

Scrapy was built before the LLM era. Using it in an agent pipeline means: calling Scrapy to get HTML, parsing the HTML into text, sending that text to the LLM. The HTML-to-text conversion loses structure and wastes tokens.

Unbrowse is agent-native. It integrates as an MCP server that any AI agent can call. The resolve endpoint accepts natural language intents and returns structured JSON. Token consumption is minimal because the response is already the structured data the agent needs.

Getting Started

Scrapy

# Install
pip install scrapy

# Create project
scrapy startproject myproject
cd myproject

# Write spider (product_spider.py)
import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('.product-card'):
            yield {
                'name': product.css('.title::text').get(),
                'price': product.css('.price::text').get(),
            }

# Run
scrapy crawl products -o products.json

Unbrowse

# Install
npx unbrowse setup

# Resolve — no selectors, no spider, no project scaffolding
npx unbrowse resolve "product listings on example.com"

Or via MCP:

{
  "tool": "unbrowse_resolve",
  "input": {
    "intent": "product listings",
    "url": "https://example.com/products"
  }
}

When to Use Scrapy

Scrapy remains excellent for specific use cases. Large-scale crawling of static HTML sites — documentation, government portals, academic archives — where you need to traverse thousands of pages following links. Building structured datasets from server-rendered content where the HTML is the canonical source. Compliance and archival use cases where you need the full rendered page.

Scrapy also has an advantage in pipeline maturity. Its item pipelines, feed exports, and middleware ecosystem are battle-hardened by over 15 years of production use. If your data engineering team already has Scrapy infrastructure, the switching cost is real.

But if you are building AI agents that need web data, writing CSS selectors and maintaining spiders is the wrong abstraction. The data your agent needs is already structured — it just got rendered into HTML before you arrived.

The Bottom Line

Scrapy parses the web's presentation layer. Unbrowse taps the data layer directly.

For AI agents and modern data retrieval, the advantages are clear: sub-100ms responses versus multi-second crawls, structured JSON versus HTML parsing, zero selector maintenance, automatic auth, and a marketplace that grows with every user. Scrapy solved web scraping for the HTML era. Unbrowse solves data retrieval for the API era.

Get started at unbrowse.ai or read the paper at arXiv.