Enterprise AI Recursive Web Scraper

The intelligent backbone for large-scale web data extraction, powered by LLMs and high-performance recursive crawling.

Overview

Modern web scraping faces two major hurdles: fragile CSS-based extraction and the inability to "understand" content context. The Enterprise AI Recursive Web Scraper addresses these by combining robust recursive crawling with Large Language Models (LLMs) like Google Gemini and Groq. It doesn't just scrape text; it comprehends the semantic structure of a website to navigate knowledge bases, documentation, and complex web apps autonomously.

By utilizing semantic chunking and cosine similarity clustering, the engine avoids redundant data processing and generates structured, RAG-ready (Retrieval-Augmented Generation) datasets.

Who is this for?

AI Engineers building context-aware datasets for LLM fine-tuning.
Market Researchers tracking competitor updates across dynamic domains.
Data Scientists requiring structured intelligence from unstructured web sources.
Enterprise Developers needing a scalable, proxy-aware crawling solution.

Features

🚀 High-Performance Engine

✨ Recursive Discovery: Deep-crawling logic that maps and follows internal links.
⚡ Concurrent Processing: Multi-threaded execution with configurable concurrency.
🎯 Smart Retries: Automated recovery from transient network errors and rate limits.

🤖 AI-Driven Intelligence

🧠 Semantic Analysis: Native LLM integration for summarization and data categorization.
🧩 Smart Chunking: Automatically breaks long-form content into context-aware segments.
🚫 Deduplication: Cosine similarity clustering to ensure data uniqueness.

🌐 Advanced Web Capabilities

🛡️ Stealth Mode: Built-in proxy rotation and user-agent spoofing.
🖼️ Visual Capture: Integrated screenshot generation for visual verification.
🧼 NSFW Filtering: Automated content validation to ensure dataset safety.

Architecture

The system follows a modular orchestration pattern, separating the concerns of browser automation, content analysis, and data persistence.

Component Relationship

graph TD
    subgraph "Interface Layer"
        CLI[CLI Utility]
        SDK[Library API]
    end

    subgraph "Orchestration Layer"
        Scraper[Scraper Orchestrator]
        Queue[URL Queue Manager]
        Rate[Rate Limiter]
    end

    subgraph "Execution Layer"
        Web[Web Controller]
        AI[Content Analyzer]
        Val[Content Validator]
    end

    subgraph "External Providers"
        Playwright[[Playwright / Browser]]
        LLM{{Gemini / Groq API}}
        Proxy{{Proxy Service}}
    end

    CLI --> Scraper
    SDK --> Scraper
    Scraper --> Queue
    Scraper --> Rate
    Queue --> Web
    Web --> Playwright
    Web --> Proxy
    Web --> Val
    Val --> AI
    AI --> LLM

Data Flow Diagram

flowchart LR
    Start([Seed URL]) --> Fetch[Web Fetcher]
    Fetch --> |"Raw HTML/DOM"| Clean[Cleaner & Validator]
    Clean --> |"Sanitized Text"| AI{AI Analyzer}
    AI --> |"Embeddings"| Cluster[Deduplication]
    AI --> |"Links Found"| Filter[Domain Filter]
    Filter --> |"New Links"| Start
    Cluster --> |"Unique Data"| Export[(JSON/CSV/MD)]
    
    Clean -.-> |"NSFW/Invalid"| Drop([Drop Record])

Tech Stack

Layer	Technology	Purpose
Language	TypeScript	Type-safety and enterprise maintainability
Runtime	Node.js (v18+)	Core execution environment
Automation	Playwright / Puppeteer	Browser rendering and DOM interaction
AI Models	Google Gemini / Groq	Semantic analysis and extraction
Tooling	Vitest / Biome	Testing and code quality

Quick Start

Prerequisites

Node.js: v18.19.0 or higher.
API Key: Access to Google AI Studio (Gemini) or Groq.

Installation

# Global installation for CLI usage
npm install -g enterprise-ai-recursive-web-scraper

# Local installation as a project dependency
npm install enterprise-ai-recursive-web-scraper

2-Minute Hello World

Create a file named scrape.ts:

import { WebScraper } from "enterprise-ai-recursive-web-scraper";

const scraper = new WebScraper({
  apiKey: "YOUR_GEMINI_API_KEY",
});

const results = await scraper.crawl("https://example.com", {
  maxDepth: 1
});

console.log(`Scraped ${results.length} pages!`);

Run it via npx ts-node scrape.ts.

Usage & Examples

CLI Deep Crawl

Execute a deep crawl with markdown output and proxy settings.

web-scraper --url "https://docs.elysium.com" \
            --api-key "GEMINI_KEY_HERE" \
            --depth 3 \
            --concurrency 10 \
            --format md \
            --proxy "http://user:pass@proxy.net:8080" \
            --output ./output

Advanced Programmatic Usage

Custom Rate Limiting & Schema Extraction

import { WebScraper, RateLimiter } from "enterprise-ai-recursive-web-scraper";

const scraper = new WebScraper({
  apiKey: process.env.GEMINI_API_KEY,
  rateLimiter: new RateLimiter({
    maxTokens: 5,
    refillRate: 1 // 1 request per second
  }),
  browserConfig: {
    headless: true,
    executablePath: "/usr/bin/google-chrome"
  }
});

const results = await scraper.crawl("https://news.ycombinator.com", {
  maxDepth: 2,
  selectors: {
    titles: ".titleline > a",
    scores: ".score"
  },
  onPageProcessed: (url, data) => {
    console.log(`Finished processing: ${url}`);
  }
});

Sequence Diagram: Recursive Crawl

sequenceDiagram
    participant U as User/CLI
    participant S as Scraper
    participant W as WebController
    participant A as ContentAnalyzer
    
    U->>S: crawl(url, depth=2)
    S->>W: navigate(url)
    W-->>S: pageContent, links[]
    S->>A: analyze(pageContent)
    A-->>S: structuredData
    S->>S: Filter internal links
    loop for each Link
        S->>W: navigate(sublink)
        W-->>S: subPageContent
    end
    S-->>U: Final Dataset

Configuration

Environment Variables

Variable	Required	Default	Description
`GEMINI_API_KEY`	Yes	-	API key for Google Gemini
`GROQ_API_KEY`	No	-	Optional alternative LLM provider
`PROXY_URL`	No	-	Proxy connection string
`LOG_LEVEL`	No	`info`	logging verbosity (debug/info/warn/error)

Scraper Options

Option	Type	Default	Description
`concurrency`	`number`	`5`	Simultaneous page processing
`maxDepth`	`number`	`3`	Recursion limit for links
`timeout`	`number`	`30000`	Page load timeout in ms
`userAgent`	`string`	Custom	Spoofed browser user agent

API Reference

`WebScraper` Class

`constructor(config: ScraperConfig)`

Initializes the scraper instance.

`crawl(url: string, options?: CrawlOptions): Promise<ScrapedData[]>`

The primary entry point for extraction.

Input: url (valid HTTP/HTTPS string).
Return: Array of objects containing url, content, metadata, and timestamp.

`RateLimiter` Class

`constructor(options: RateLimitOptions)`

Parameter	Type	Description
`maxTokens`	`number`	Bucket size for burst requests
`refillRate`	`number`	Tokens added per second

Development

Setup

# Clone and install dependencies
git clone https://github.com/ElysiumOSS/enterprise-ai-recursive-web-scraper.git
cd enterprise-ai-recursive-web-scraper
npm install

# Run tests
npm test

# Build the project
npm run build

Project Structure

src/classes/: Core logic (Scraper, AI, Web controllers).
src/data/: Static datasets (e.g., NSFW keyword lists).
src/cli.ts: Command-line interface entry point.
test/: Unit and integration test suites.

Troubleshooting

Error Message	Cause	Solution
`API_KEY_INVALID`	Incorrect Gemini/Groq key	Verify environment variables and key permissions.
`ERR_PROXY_CONNECTION_FAILED`	Proxy server unreachable	Check proxy credentials and firewall settings.
`MAX_DEPTH_REACHED`	Scraper stopped at limit	Increase the `maxDepth` parameter in config.
`NSFW_CONTENT_DETECTED`	Page failed safety check	Adjust `ContentValidator` settings or white-list the URL.

Performance & Security

Performance Considerations

Memory Management: For large crawls (>1000 pages), ensure Node.js is started with --max-old-space-size=4096.
Concurrency: High concurrency (20+) may trigger anti-bot protections. Use residential proxies if scaling aggressively.

Security Notes

API Keys: Never commit .env files. Use secret management (AWS Secrets Manager, GitHub Secrets).
Sandboxing: Playwright runs in a sandbox by default; do not disable this in production environments.

Contributing

We welcome contributions! Please see our CONTRIBUTING.md for detailed guidelines.

Fork the repository.
Create a feature branch (git checkout -b feature/cool-new-logic).
Commit changes using Conventional Commits.
Push to the branch and open a Pull Request.

Roadmap & Known Issues

Support for local LLMs via Ollama.
Distributed crawling using Redis as a backplane.
Export to Vector Databases (Pinecone, Weaviate).
⚠️ Known Issue: Certain SPAs (Single Page Apps) with heavy obfuscation may require custom delay settings.

License & Credits

License: MIT - see LICENSE.md for details.
Maintained by: ElysiumOSS Team.
Inspirations: Built on the shoulders of Playwright and the Google Generative AI SDK.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github		.github
.vscode		.vscode
src		src
test/cli		test/cli
.all-contributorsrc		.all-contributorsrc
.env.example		.env.example
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.markdownlintignore		.markdownlintignore
.npmignore		.npmignore
.nvmrc		.nvmrc
.prettierignore		.prettierignore
.prettierrc.json		.prettierrc.json
.release-it.json		.release-it.json
LICENSE.md		LICENSE.md
README.md		README.md
biome.json		biome.json
bun.lock		bun.lock
cspell.json		cspell.json
env.d.ts		env.d.ts
eslint.config.js		eslint.config.js
knip.json		knip.json
package.json		package.json
test.ts		test.ts
test2.ts		test2.ts
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts
vitest.config.ts		vitest.config.ts

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Enterprise AI Recursive Web Scraper

Table of Contents

Overview

Features

🚀 High-Performance Engine

🤖 AI-Driven Intelligence

🌐 Advanced Web Capabilities

Architecture

Component Relationship

Data Flow Diagram

Tech Stack

Quick Start

Prerequisites

Installation

2-Minute Hello World

Usage & Examples

CLI Deep Crawl

Advanced Programmatic Usage

Sequence Diagram: Recursive Crawl

Configuration

Environment Variables

Scraper Options

API Reference

WebScraper Class

constructor(config: ScraperConfig)

crawl(url: string, options?: CrawlOptions): Promise<ScrapedData[]>

RateLimiter Class

constructor(options: RateLimitOptions)

Development

Setup

Project Structure

Troubleshooting

Performance & Security

Performance Considerations

Security Notes

Contributing

Roadmap & Known Issues

License & Credits

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`WebScraper` Class

`constructor(config: ScraperConfig)`

`crawl(url: string, options?: CrawlOptions): Promise<ScrapedData[]>`

`RateLimiter` Class

`constructor(options: RateLimitOptions)`