Blog

Build an AI Research Assistant That Reads the Whole Web

Build an AI research assistant that collects data from multiple websites, extracts structured information via shadow APIs, and produces comprehensive summaries. No scraping frameworks needed.

Lewis Tham
April 3, 2026

Build an AI Research Assistant That Reads the Whole Web

Research agents are the killer app for AI. You describe what you want to know, and the agent fans out across the web, collects information from dozens of sources, and returns a structured summary. The problem is that building one has traditionally required a fragile stack of scrapers, parsers, and API clients.

Unbrowse changes the equation. Instead of writing custom code for every website, your research agent uses a single interface — unbrowse.resolve() — to extract structured data from any site. Unbrowse discovers the shadow APIs behind every page and calls them directly.

The result: a research assistant you can build in an afternoon that works across the entire web.

Why Traditional Research Agents Break

Most research agents follow this pattern:

  1. Search Google for a topic
  2. Visit each result page
  3. Parse the HTML to extract relevant text
  4. Feed the text to an LLM for summarization

Step 3 is where everything falls apart. Every website has different HTML structure. News sites use one layout, academic papers another, forums a third. You end up writing dozens of parsers, and they all break when sites update their markup.

Even tools like BeautifulSoup or Cheerio require you to understand each site's DOM structure. And if you use a headless browser to render JavaScript-heavy pages, you are paying the cost of launching Chromium for every single page.

The Unbrowse Research Pattern

With Unbrowse, your research agent does not need to understand HTML at all. Every website has internal APIs that return structured data — article text, metadata, comments, related content. Unbrowse discovers these APIs and lets your agent call them directly.

Here is a research assistant that investigates any topic across multiple sources:

import { Unbrowse } from "@unbrowse/sdk";

const unbrowse = new Unbrowse();

async function research(topic) {
  const sources = [
    { name: "Hacker News", url: `https://hn.algolia.com/?q=${encodeURIComponent(topic)}` },
    { name: "Reddit", url: `https://www.reddit.com/search/?q=${encodeURIComponent(topic)}` },
    { name: "Wikipedia", url: `https://en.wikipedia.org/wiki/${encodeURIComponent(topic)}` },
    { name: "ArXiv", url: `https://arxiv.org/search/?query=${encodeURIComponent(topic)}` },
  ];

  const findings = await Promise.all(
    sources.map(async (source) => {
      const result = await unbrowse.resolve({
        intent: `find information about ${topic}`,
        url: source.url,
      });
      return {
        source: source.name,
        data: result.data,
        endpoints: result.endpoints?.length || 0,
      };
    })
  );

  return findings;
}

const results = await research("transformer architecture improvements 2025");
console.log(JSON.stringify(results, null, 2));

This fans out to four major sources in parallel. Each resolve() call discovers and uses the shadow APIs behind that site's search functionality. The results come back as structured JSON — not raw HTML you need to parse.

Deep Dive: Multi-Source Data Collection

For serious research, you want to go deeper than search results. Here is how to collect detailed information from each source:

import { Unbrowse } from "@unbrowse/sdk";

const unbrowse = new Unbrowse();

async function deepResearch(topic) {
  // Step 1: Collect search results from multiple sources
  const searchResults = await unbrowse.resolve({
    intent: `search for recent discussions and articles about ${topic}`,
    url: `https://www.reddit.com/search/?q=${encodeURIComponent(topic)}&sort=relevance&t=month`,
  });

  const posts = searchResults.data?.posts || searchResults.data?.results || [];

  // Step 2: For each top result, get the full content
  const detailed = [];
  for (const post of posts.slice(0, 5)) {
    if (post.url || post.permalink) {
      const fullPost = await unbrowse.resolve({
        intent: `get full content, comments, and discussion`,
        url: post.url || `https://www.reddit.com${post.permalink}`,
      });
      detailed.push({
        title: post.title,
        content: fullPost.data,
      });
    }
  }

  return { topic, sources: detailed };
}

Notice the pattern: first resolve gets search results, then we drill into each result for full content. Unbrowse caches routes it discovers, so the second and subsequent calls to the same domain are nearly instant.

Building a Research Report Generator

Combine Unbrowse with an LLM to produce polished research reports:

import { Unbrowse } from "@unbrowse/sdk";

const unbrowse = new Unbrowse();

async function generateReport(topic) {
  // Collect from diverse source types
  const sources = {
    academic: await unbrowse.resolve({
      intent: `find recent academic papers about ${topic}`,
      url: `https://arxiv.org/search/?query=${encodeURIComponent(topic)}&searchtype=all`,
    }),
    community: await unbrowse.resolve({
      intent: `find community discussions about ${topic}`,
      url: `https://news.ycombinator.com/`,
      params: { q: topic },
    }),
    encyclopedia: await unbrowse.resolve({
      intent: `get overview of ${topic}`,
      url: `https://en.wikipedia.org/wiki/${encodeURIComponent(topic)}`,
    }),
  };

  // Structure the findings
  const report = {
    topic,
    generated: new Date().toISOString(),
    sections: {
      overview: sources.encyclopedia.data,
      recentResearch: sources.academic.data,
      communityPerspective: sources.community.data,
    },
  };

  return report;
}

Monitoring Research Topics Over Time

Research is not a one-shot task. For ongoing topics, set up continuous monitoring:

import { Unbrowse } from "@unbrowse/sdk";
import { writeFileSync, readFileSync, existsSync } from "fs";

const unbrowse = new Unbrowse();
const STATE_FILE = "./research-state.json";

function loadState() {
  return existsSync(STATE_FILE)
    ? JSON.parse(readFileSync(STATE_FILE, "utf-8"))
    : { topics: {}, lastRun: null };
}

function saveState(state) {
  writeFileSync(STATE_FILE, JSON.stringify(state, null, 2));
}

async function monitorTopic(topic) {
  const state = loadState();

  const result = await unbrowse.resolve({
    intent: `find new developments about ${topic} in the last week`,
    url: `https://news.ycombinator.com/`,
    params: { q: topic },
  });

  const newItems = result.data?.hits || [];
  const previousIds = state.topics[topic]?.seenIds || [];
  const unseen = newItems.filter((item) => !previousIds.includes(item.objectID));

  state.topics[topic] = {
    seenIds: [...previousIds, ...unseen.map((i) => i.objectID)],
    lastCheck: new Date().toISOString(),
    newCount: unseen.length,
  };

  saveState(state);

  if (unseen.length > 0) {
    console.log(`Found ${unseen.length} new items for "${topic}"`);
    unseen.forEach((item) => console.log(` - ${item.title}`));
  }

  return unseen;
}

The state file saves progress incrementally, so the monitor can be interrupted and resumed without missing items or re-processing old results.

Cross-Referencing Sources

The real power of a research assistant is cross-referencing information from multiple sources. Here is a pattern for finding consensus and contradictions:

async function crossReference(topic) {
  const sources = [
    { name: "Wikipedia", url: `https://en.wikipedia.org/wiki/${encodeURIComponent(topic)}` },
    { name: "Reddit", url: `https://www.reddit.com/search/?q=${encodeURIComponent(topic)}` },
    { name: "HackerNews", url: `https://hn.algolia.com/?q=${encodeURIComponent(topic)}` },
  ];

  const results = await Promise.all(
    sources.map(async (s) => {
      const r = await unbrowse.resolve({ intent: `get key facts about ${topic}`, url: s.url });
      return { source: s.name, data: r.data };
    })
  );

  return {
    topic,
    sourceCount: results.length,
    findings: results,
    // Feed this to an LLM to identify consensus vs. contradictions
  };
}

Performance at Scale

Research agents need to be fast. Waiting 30 seconds per page load kills any workflow that touches dozens of sources.

Metric Playwright + Parser Unbrowse
Per-page extraction 3-10 seconds 50-200ms (cached)
Memory per source 200+ MB ~15 MB
New domain, first visit N/A (needs custom parser) 3-8 seconds (auto-discovery)
Structured output Requires custom parsing Automatic
Cross-site consistency Every site is different Uniform JSON

For a research task hitting 20 sources with cached routes, Unbrowse completes in about 4 seconds total. The same task with Playwright would take 2-3 minutes and require 4+ GB of RAM.

The Marketplace Effect

Every source your research agent touches gets indexed in the Unbrowse marketplace. This means:

  • The first time you research on a new site, Unbrowse discovers and publishes the routes
  • Every subsequent research task — by you or any other agent — uses those cached routes
  • You earn x402 micropayments when other agents use routes you discovered
  • The research assistant gets faster over time as the marketplace grows

This is the network effect at work. A research agent is not just consuming data — it is building infrastructure that benefits every agent on the network.

Getting Started

git clone --single-branch --depth 1 https://github.com/unbrowse-ai/unbrowse.git ~/unbrowse
cd ~/unbrowse && ./setup --host off

Or for SDK-based development:

npm install @unbrowse/sdk

Start with a single topic and a few sources. As you add more sources, the agent gets more powerful — and thanks to the marketplace, faster. Every research task makes the next one easier.