The intelligent backbone for large-scale web data extraction, powered by LLMs and high-performance recursive crawling.
- Overview
- Features
- Architecture
- Quick Start
- Usage & Examples
- Configuration
- API Reference
- Development
- Troubleshooting
- Performance & Security
- Contributing
- Roadmap & Known Issues
- License & Credits
Modern web scraping faces two major hurdles: fragile CSS-based extraction and the inability to "understand" content context. The Enterprise AI Recursive Web Scraper addresses these by combining robust recursive crawling with Large Language Models (LLMs) like Google Gemini and Groq. It doesn't just scrape text; it comprehends the semantic structure of a website to navigate knowledge bases, documentation, and complex web apps autonomously.
By utilizing semantic chunking and cosine similarity clustering, the engine avoids redundant data processing and generates structured, RAG-ready (Retrieval-Augmented Generation) datasets.
Who is this for?
- AI Engineers building context-aware datasets for LLM fine-tuning.
- Market Researchers tracking competitor updates across dynamic domains.
- Data Scientists requiring structured intelligence from unstructured web sources.
- Enterprise Developers needing a scalable, proxy-aware crawling solution.
- ✨ Recursive Discovery: Deep-crawling logic that maps and follows internal links.
- ⚡ Concurrent Processing: Multi-threaded execution with configurable concurrency.
- 🎯 Smart Retries: Automated recovery from transient network errors and rate limits.
- 🧠 Semantic Analysis: Native LLM integration for summarization and data categorization.
- 🧩 Smart Chunking: Automatically breaks long-form content into context-aware segments.
- 🚫 Deduplication: Cosine similarity clustering to ensure data uniqueness.
- 🛡️ Stealth Mode: Built-in proxy rotation and user-agent spoofing.
- 🖼️ Visual Capture: Integrated screenshot generation for visual verification.
- 🧼 NSFW Filtering: Automated content validation to ensure dataset safety.
The system follows a modular orchestration pattern, separating the concerns of browser automation, content analysis, and data persistence.
graph TD
subgraph "Interface Layer"
CLI[CLI Utility]
SDK[Library API]
end
subgraph "Orchestration Layer"
Scraper[Scraper Orchestrator]
Queue[URL Queue Manager]
Rate[Rate Limiter]
end
subgraph "Execution Layer"
Web[Web Controller]
AI[Content Analyzer]
Val[Content Validator]
end
subgraph "External Providers"
Playwright[[Playwright / Browser]]
LLM{{Gemini / Groq API}}
Proxy{{Proxy Service}}
end
CLI --> Scraper
SDK --> Scraper
Scraper --> Queue
Scraper --> Rate
Queue --> Web
Web --> Playwright
Web --> Proxy
Web --> Val
Val --> AI
AI --> LLM
flowchart LR
Start([Seed URL]) --> Fetch[Web Fetcher]
Fetch --> |"Raw HTML/DOM"| Clean[Cleaner & Validator]
Clean --> |"Sanitized Text"| AI{AI Analyzer}
AI --> |"Embeddings"| Cluster[Deduplication]
AI --> |"Links Found"| Filter[Domain Filter]
Filter --> |"New Links"| Start
Cluster --> |"Unique Data"| Export[(JSON/CSV/MD)]
Clean -.-> |"NSFW/Invalid"| Drop([Drop Record])
| Layer | Technology | Purpose |
|---|---|---|
| Language | TypeScript | Type-safety and enterprise maintainability |
| Runtime | Node.js (v18+) | Core execution environment |
| Automation | Playwright / Puppeteer | Browser rendering and DOM interaction |
| AI Models | Google Gemini / Groq | Semantic analysis and extraction |
| Tooling | Vitest / Biome | Testing and code quality |
- Node.js:
v18.19.0or higher. - API Key: Access to Google AI Studio (Gemini) or Groq.
# Global installation for CLI usage
npm install -g enterprise-ai-recursive-web-scraper
# Local installation as a project dependency
npm install enterprise-ai-recursive-web-scraperCreate a file named scrape.ts:
import { WebScraper } from "enterprise-ai-recursive-web-scraper";
const scraper = new WebScraper({
apiKey: "YOUR_GEMINI_API_KEY",
});
const results = await scraper.crawl("https://example.com", {
maxDepth: 1
});
console.log(`Scraped ${results.length} pages!`);Run it via npx ts-node scrape.ts.
Execute a deep crawl with markdown output and proxy settings.
web-scraper --url "https://docs.elysium.com" \
--api-key "GEMINI_KEY_HERE" \
--depth 3 \
--concurrency 10 \
--format md \
--proxy "http://user:pass@proxy.net:8080" \
--output ./outputCustom Rate Limiting & Schema Extraction
import { WebScraper, RateLimiter } from "enterprise-ai-recursive-web-scraper";
const scraper = new WebScraper({
apiKey: process.env.GEMINI_API_KEY,
rateLimiter: new RateLimiter({
maxTokens: 5,
refillRate: 1 // 1 request per second
}),
browserConfig: {
headless: true,
executablePath: "/usr/bin/google-chrome"
}
});
const results = await scraper.crawl("https://news.ycombinator.com", {
maxDepth: 2,
selectors: {
titles: ".titleline > a",
scores: ".score"
},
onPageProcessed: (url, data) => {
console.log(`Finished processing: ${url}`);
}
});sequenceDiagram
participant U as User/CLI
participant S as Scraper
participant W as WebController
participant A as ContentAnalyzer
U->>S: crawl(url, depth=2)
S->>W: navigate(url)
W-->>S: pageContent, links[]
S->>A: analyze(pageContent)
A-->>S: structuredData
S->>S: Filter internal links
loop for each Link
S->>W: navigate(sublink)
W-->>S: subPageContent
end
S-->>U: Final Dataset
| Variable | Required | Default | Description |
|---|---|---|---|
GEMINI_API_KEY |
Yes | - | API key for Google Gemini |
GROQ_API_KEY |
No | - | Optional alternative LLM provider |
PROXY_URL |
No | - | Proxy connection string |
LOG_LEVEL |
No | info |
logging verbosity (debug/info/warn/error) |
| Option | Type | Default | Description |
|---|---|---|---|
concurrency |
number |
5 |
Simultaneous page processing |
maxDepth |
number |
3 |
Recursion limit for links |
timeout |
number |
30000 |
Page load timeout in ms |
userAgent |
string |
Custom | Spoofed browser user agent |
Initializes the scraper instance.
The primary entry point for extraction.
- Input:
url(valid HTTP/HTTPS string). - Return: Array of objects containing
url,content,metadata, andtimestamp.
| Parameter | Type | Description |
|---|---|---|
maxTokens |
number |
Bucket size for burst requests |
refillRate |
number |
Tokens added per second |
# Clone and install dependencies
git clone https://github.com/ElysiumOSS/enterprise-ai-recursive-web-scraper.git
cd enterprise-ai-recursive-web-scraper
npm install
# Run tests
npm test
# Build the project
npm run buildsrc/classes/: Core logic (Scraper, AI, Web controllers).src/data/: Static datasets (e.g., NSFW keyword lists).src/cli.ts: Command-line interface entry point.test/: Unit and integration test suites.
| Error Message | Cause | Solution |
|---|---|---|
API_KEY_INVALID |
Incorrect Gemini/Groq key | Verify environment variables and key permissions. |
ERR_PROXY_CONNECTION_FAILED |
Proxy server unreachable | Check proxy credentials and firewall settings. |
MAX_DEPTH_REACHED |
Scraper stopped at limit | Increase the maxDepth parameter in config. |
NSFW_CONTENT_DETECTED |
Page failed safety check | Adjust ContentValidator settings or white-list the URL. |
- Memory Management: For large crawls (>1000 pages), ensure Node.js is started with
--max-old-space-size=4096. - Concurrency: High concurrency (20+) may trigger anti-bot protections. Use residential proxies if scaling aggressively.
- API Keys: Never commit
.envfiles. Use secret management (AWS Secrets Manager, GitHub Secrets). - Sandboxing: Playwright runs in a sandbox by default; do not disable this in production environments.
We welcome contributions! Please see our CONTRIBUTING.md for detailed guidelines.
- Fork the repository.
- Create a feature branch (
git checkout -b feature/cool-new-logic). - Commit changes using Conventional Commits.
- Push to the branch and open a Pull Request.
- Support for local LLMs via Ollama.
- Distributed crawling using Redis as a backplane.
- Export to Vector Databases (Pinecone, Weaviate).
⚠️ Known Issue: Certain SPAs (Single Page Apps) with heavy obfuscation may require custom delay settings.
- License: MIT - see LICENSE.md for details.
- Maintained by: ElysiumOSS Team.
- Inspirations: Built on the shoulders of Playwright and the Google Generative AI SDK.