A scraper-first, Pi-native, and local-first extension for the Pi ecosystem.
pi-scraper reads known URLs and sites. Use it to scrape, summarize one page, crawl, map URLs, diff snapshots, retrieve stored results, or download/extract deterministic/structured data β including CloakBrowser-backed browser mode with C++ fingerprint patches and persistent sessions.
Install the extension via the Pi CLI:
pi install npm:pi-scraperAsk naturally; Pi can choose the right web tool automatically:
Tip
- "Read https://example.com as markdown."
- "List all URLs available from https://example.com."
- "Crawl https://example.com, up to 25 pages."
- "Compare https://example.com against my homepage snapshot."
- "Open https://example.com/login in browser mode, save the session, then scrape /dashboard."
pi-scraper intelligently escalates its scraping strategy to balance speed and capability.
| Mode | JS Support | Speed | Best Use Case |
|---|---|---|---|
fast |
β | π | Static HTML, documentation, and quick text extraction. |
fingerprint |
β | ποΈ | Sites that block simple bots (uses TLS fingerprinting). |
readable |
β | β±οΈ | Articles and blogs where noise reduction is critical. |
browser |
β | π’ | Heavily JS-rendered sites (uses CloakBrowser by default). |
auto |
π€ | π | Default. Automatically selects the best path based on signals. |
| Tool | Capability | Best For... | Contract β |
|---|---|---|---|
web_scrape |
π Local | Reading a single URL as Markdown, Text, or HTML. | 308 tokens |
web_crawl |
π·οΈ Resumable | BFS crawling to build local datasets or context packages. | 158 tokens |
web_map |
πΊοΈ Discovery | Inventorying URLs via robots.txt, sitemaps, and llms.txt. | 58 tokens |
web_batch |
π¦ Bulk | Scaping multiple independent URLs concurrently. | 195 tokens |
web_extract |
π Structured | Deterministic, selector-based, or LLM-backed extraction. | 337 tokens |
web_get_result |
π Retrieval | Accessing stored results, job manifests, or snapshots. | 56 tokens |
Note
Contract is the total tokens for the tool declaration.
| Area | Parameters | Description |
|---|---|---|
| Shared | sessionId, saveSession, clearSession, stealth, autoWait, browserBackend, proxy, headers, provider |
Sessions, browser controls, and LLM provider selection. |
| Scrape | url, urls, content, task, mode, format, refresh, respectRobots, timeoutSeconds |
Targets, tasks (read/summarize), and fetch behavior. |
| Limits | maxBytes, maxChars, onlyMainContent |
Size limits and content cleaning. |
| Filtering | include, exclude, linesMatching, contextLines, caseSensitive |
Glob patterns and line-based content filtering. |
| Redirection | followAlternates, followMetaRefresh |
Controls for non-standard redirects. |
| Snapshots | snapshotName, snapshotTag, diff, compareTag, maxSnapshotAgeSeconds |
Versioning and diffing baselines. |
| Crawl | action, maxPages, maxDepth, sameOrigin, concurrency, resume, crawlId, compile, seed, seedSitemap, status, limit, extract |
BFS discovery, limits, and state management. |
| Extract | action, extractor, prompt, schema, selector, selectorType, attribute, adaptive, bullets, sentences, identifier, autoSave, threshold, extractSchema |
Vertical, ad-hoc, and selector extraction. |
| Patterns | markers, contains, excerpts, regexes, sections, jsonPaths, sourceFormat, length |
Deterministic inspection: strings, regex, and ranges. |
| Map | url, maxSitemaps |
Site-wide discovery of robots.txt and sitemaps. |
| Storage | saveToFile |
true or {dir, filename, maxBytes} for disk storage. |
| Retrieval | responseId, jobId, snapshotUrl, snapshotName, snapshotTag |
Retrieve stored payloads and job manifests. |
pi-scraper is stateless by default. Use sessionId when you need to maintain state (cookies, login, cart) across multiple calls.
sessionId: A unique key for the session.saveSession: Persist cookies to disk (useful across Pi reloads).clearSession: Wipe the session state.fingerprint: Usemode: "fingerprint"to bypass basic bot blocks using browser-grade TLS fingerprints without the overhead of a full browser.
// Example: Log in and then scrape a protected page
web_scrape({ url: "https://example.com/login", sessionId: "user-1", saveSession: true })
web_scrape({ url: "https://example.com/dashboard", sessionId: "user-1" })
Extract structured data using CSS selectors, XPath, or plain text search.
| Parameter | Description |
|---|---|
selector |
The CSS/XPath/Text to find. |
attribute |
Extract a specific attribute (e.g., href) instead of text. |
adaptive |
Enable relocation if the page layout changes. |
limit |
Maximum elements to return. |
{
"url": "https://example.com/products",
"selector": ".product-card",
"identifier": "products-v1",
"autoSave": true,
"limit": 5
}mode: "browser" uses CloakBrowser by default β a patched Chromium binary with 48 C++-level fingerprint patches.
| Backend | Default | Browser | Stealth level | Requirement |
|---|---|---|---|---|
"cloak" |
β | CloakBrowser Chromium 145 | C++ source-level (48 patches) | Bundled |
"playwright" |
β | Stock Playwright Chromium | JS page.evaluate() via stealth=true |
npm install playwright |
CloakBrowser does not need stealth=true β all anti-detection patches (navigator.webdriver, canvas, WebGL, audio, fonts, GPU, screen, WebRTC, network timing) are applied at the C++ binary level, undetectable by any JS-level bot detection.
Test results from CloakBrowser:
- reCAPTCHA v3 score: 0.9 (human)
- Cloudflare Turnstile: PASS
- FingerprintJS: PASS
- BrowserScan: NORMAL (4/4)
- 30+ detection sites: passed
When using CloakBrowser with sessionId + saveSession=true:
web_scrape url="https://example.com" mode=browser sessionId="my-session" saveSession=true
CloakBrowser uses launchPersistentContext() which writes cookies, localStorage, and session state to a disk profile at ~/.pi/browser-sessions/<sessionId>/. This:
- Avoids incognito/private-mode detection (BrowserScan penalizes incognito by ~10%)
- Survives Pi restarts and process reloads
- Keeps login state across multiple scrape calls
To persist an authenticated login flow:
-
Log in and Save the Session Open the login page in browser mode. Specifying
saveSession=truewrites the cookies and session state to your local profile.web_scrape url="https://example.com/login" mode=browser sessionId="site-session" saveSession=true
-
Scrape Authenticated Content Subsequent calls using the same
sessionIdautomatically inherit the authenticated state (cookies, local storage, etc.).web_scrape url="https://example.com/dashboard" mode=browser sessionId="site-session"
-
Clear the Session when Done (Optional) Wipe the saved session and context from your local disk.
web_scrape url="https://example.com" mode=browser sessionId="site-session" clearSession=true
| Option | Type | Description |
|---|---|---|
timezone |
string | IANA timezone (e.g. "America/New_York"). Set via binary flag β undetectable. |
locale |
string | BCP 47 locale (e.g. "en-US"). Set via --lang binary flag. |
proxy |
string | HTTP or SOCKS5 proxy URL. |
These are safe to set even with the Playwright backend (ignored or applied via JS patches).
For well-known sites, pi-scraper uses optimized "vertical" extractors that hit APIs directly, bypassing slow HTML scraping.
| Vertical | Platforms / Sites | Extracted Data / Possibilities |
|---|---|---|
| GitHub Repo | GitHub | Metadata, README, File Tree, Languages, Topics. |
| GitHub Issue | GitHub | Issue body, comments, participants, labels, status. |
| GitHub PR | GitHub | Pull request body, diff stats, reviews, comments. |
| GitHub Release | GitHub | Release notes, tag info, assets, author metadata. |
| npm Package | npmjs.com | Manifest JSON, versions, dependencies, README. |
| PyPI Package | pypi.org | Package metadata, versions, author, description. |
| crates.io | crates.io | Rust crate metadata, versions, dependencies. |
| Docker Hub | hub.docker.com | Image metadata, tags, architectures, layers. |
| HF Model | huggingface.co | Model cards, metadata, files, community stats. |
| HF Dataset | huggingface.co | Dataset cards, configuration, metadata, previews. |
| Hacker News | ycombinator.com | Story/Comment trees via Firebase API. |
| arXiv | arxiv.org | Academic paper metadata and Atom feeds. |
| DeepWiki | deepwiki.io | Structured wiki content and metadata. |
| Docs Site | Docusaurus, RTD | Sections, sidebar navigation, and page metadata. |
| docstrings | TS/JS/Py/Rs | Exported symbols, types, and function signatures. |
| Youtube Metadata | youtube.com | Video title, views, channel name, duration, and description. |
| Youtube Transcriptions | youtube.com | Full transcripts in plain-text and timed segments. |
| Youtube Comments | youtube.com | Preview of top video comments and engagement stats. |
| Reddit Post | reddit.com | Post content, scoring, flairs, and author metadata. |
| Reddit Thread | reddit.com | Full nested comment trees (retains original thread depth). |
| Reddit List | reddit.com | Subreddit listings (hot/new/top) and search results. |
| OSS Analytics | ossinsight.io | Real-time repository metrics, stars, and contribution trends. |
| OSS Trending | ossinsight.io | Daily/weekly trending repositories and collections. |
| OSS Rankings | ossinsight.io | Collection-based rankings and ecosystem comparison data. |
// Get structured data for an npm package
web_extract({ action: "vertical", url: "https://www.npmjs.com/package/undici" })
// Get YouTube video metadata, transcript, and comment preview
web_extract({ action: "vertical", extractor: "youtube", url: "https://www.youtube.com/watch?v=arj7oStGLkU" })
Large results are stored automatically. You can retrieve them later using web_get_result.
| Data | Path |
|---|---|
| SQLite Index | ~/.pi/scraper/index.db |
| Payload Blobs | ~/.pi/scraper/blobs/ |
| Downloads | ~/.pi/scraper/downloads/ |
Add saveToFile: true to persist PDFs, images, or archives to disk.
{ "url": "https://arxiv.org/pdf/1706.03762", "saveToFile": true }Control the fetch limit per request (default: 30 MB).
{ "url": "https://example.com/large.zip", "maxBytes": 104857600 }Use web_map for fast discovery of a domain's structure without downloading full page bodies. It is an "inventory-only" tool.
What it discovers:
robots.txt: Respects crawl delays and discovers sitemap links.- Sitemaps: Automatically parses
sitemap.xmland gzipped sitemaps. llms.txt: Finds specialized manifests designed for AI consumption.
// Inventory all known URLs for a domain
{ "url": "https://example.com", "action": "inventory" }- SSRF Protection: Built-in validation at the connect and redirect layers.
- Robots.txt: Full respect for site crawling rules (configurable).
- Memory Efficient: Large responses are streamed and stored locally.
- Incremental Enforcement:
maxByteslimits are enforced during the stream.
Use the /scrape-config slash command to manage your settings interactively or via the CLI:
/scrape-config status # View current settings
/scrape-config scrape-mode browser # Set default mode to browser
/scrape-config robots off # Disable robots.txt respect
/scrape-config cache clear # Wipe the local response cacheIf you are contributing to or building on top of pi-scraper:
- Node.js:
>=22.19.0 - Pi:
>=0.74.0
npm install # Install dependencies
npm run typecheck # Verify types
npm test # Run unit tests
npm run test:tools # Run tool smoke testsTo use stock Playwright Chromium instead of CloakBrowser:
npm install playwright
npx playwright install chromiumweb_scrape url="https://example.com" mode=browser browserBackend=playwright stealth=true
This project is licensed under the MIT License. See LICENSE for details.