Skip to content

ElysiumOSS/enterprise-ai-recursive-web-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Enterprise AI Recursive Web Scraper

The intelligent backbone for large-scale web data extraction, powered by LLMs and high-performance recursive crawling.

Build Status NPM Version License TypeScript Coverage

Table of Contents

Overview

Modern web scraping faces two major hurdles: fragile CSS-based extraction and the inability to "understand" content context. The Enterprise AI Recursive Web Scraper addresses these by combining robust recursive crawling with Large Language Models (LLMs) like Google Gemini and Groq. It doesn't just scrape text; it comprehends the semantic structure of a website to navigate knowledge bases, documentation, and complex web apps autonomously.

By utilizing semantic chunking and cosine similarity clustering, the engine avoids redundant data processing and generates structured, RAG-ready (Retrieval-Augmented Generation) datasets.

Who is this for?

  • AI Engineers building context-aware datasets for LLM fine-tuning.
  • Market Researchers tracking competitor updates across dynamic domains.
  • Data Scientists requiring structured intelligence from unstructured web sources.
  • Enterprise Developers needing a scalable, proxy-aware crawling solution.

Features

🚀 High-Performance Engine

  • Recursive Discovery: Deep-crawling logic that maps and follows internal links.
  • Concurrent Processing: Multi-threaded execution with configurable concurrency.
  • 🎯 Smart Retries: Automated recovery from transient network errors and rate limits.

🤖 AI-Driven Intelligence

  • 🧠 Semantic Analysis: Native LLM integration for summarization and data categorization.
  • 🧩 Smart Chunking: Automatically breaks long-form content into context-aware segments.
  • 🚫 Deduplication: Cosine similarity clustering to ensure data uniqueness.

🌐 Advanced Web Capabilities

  • 🛡️ Stealth Mode: Built-in proxy rotation and user-agent spoofing.
  • 🖼️ Visual Capture: Integrated screenshot generation for visual verification.
  • 🧼 NSFW Filtering: Automated content validation to ensure dataset safety.

Architecture

The system follows a modular orchestration pattern, separating the concerns of browser automation, content analysis, and data persistence.

Component Relationship

graph TD
    subgraph "Interface Layer"
        CLI[CLI Utility]
        SDK[Library API]
    end

    subgraph "Orchestration Layer"
        Scraper[Scraper Orchestrator]
        Queue[URL Queue Manager]
        Rate[Rate Limiter]
    end

    subgraph "Execution Layer"
        Web[Web Controller]
        AI[Content Analyzer]
        Val[Content Validator]
    end

    subgraph "External Providers"
        Playwright[[Playwright / Browser]]
        LLM{{Gemini / Groq API}}
        Proxy{{Proxy Service}}
    end

    CLI --> Scraper
    SDK --> Scraper
    Scraper --> Queue
    Scraper --> Rate
    Queue --> Web
    Web --> Playwright
    Web --> Proxy
    Web --> Val
    Val --> AI
    AI --> LLM
Loading

Data Flow Diagram

flowchart LR
    Start([Seed URL]) --> Fetch[Web Fetcher]
    Fetch --> |"Raw HTML/DOM"| Clean[Cleaner & Validator]
    Clean --> |"Sanitized Text"| AI{AI Analyzer}
    AI --> |"Embeddings"| Cluster[Deduplication]
    AI --> |"Links Found"| Filter[Domain Filter]
    Filter --> |"New Links"| Start
    Cluster --> |"Unique Data"| Export[(JSON/CSV/MD)]
    
    Clean -.-> |"NSFW/Invalid"| Drop([Drop Record])
Loading

Tech Stack

Layer Technology Purpose
Language TypeScript Type-safety and enterprise maintainability
Runtime Node.js (v18+) Core execution environment
Automation Playwright / Puppeteer Browser rendering and DOM interaction
AI Models Google Gemini / Groq Semantic analysis and extraction
Tooling Vitest / Biome Testing and code quality

Quick Start

Prerequisites

  • Node.js: v18.19.0 or higher.
  • API Key: Access to Google AI Studio (Gemini) or Groq.

Installation

# Global installation for CLI usage
npm install -g enterprise-ai-recursive-web-scraper

# Local installation as a project dependency
npm install enterprise-ai-recursive-web-scraper

2-Minute Hello World

Create a file named scrape.ts:

import { WebScraper } from "enterprise-ai-recursive-web-scraper";

const scraper = new WebScraper({
  apiKey: "YOUR_GEMINI_API_KEY",
});

const results = await scraper.crawl("https://example.com", {
  maxDepth: 1
});

console.log(`Scraped ${results.length} pages!`);

Run it via npx ts-node scrape.ts.

Usage & Examples

CLI Deep Crawl

Execute a deep crawl with markdown output and proxy settings.

web-scraper --url "https://docs.elysium.com" \
            --api-key "GEMINI_KEY_HERE" \
            --depth 3 \
            --concurrency 10 \
            --format md \
            --proxy "http://user:pass@proxy.net:8080" \
            --output ./output

Advanced Programmatic Usage

Custom Rate Limiting & Schema Extraction
import { WebScraper, RateLimiter } from "enterprise-ai-recursive-web-scraper";

const scraper = new WebScraper({
  apiKey: process.env.GEMINI_API_KEY,
  rateLimiter: new RateLimiter({
    maxTokens: 5,
    refillRate: 1 // 1 request per second
  }),
  browserConfig: {
    headless: true,
    executablePath: "/usr/bin/google-chrome"
  }
});

const results = await scraper.crawl("https://news.ycombinator.com", {
  maxDepth: 2,
  selectors: {
    titles: ".titleline > a",
    scores: ".score"
  },
  onPageProcessed: (url, data) => {
    console.log(`Finished processing: ${url}`);
  }
});

Sequence Diagram: Recursive Crawl

sequenceDiagram
    participant U as User/CLI
    participant S as Scraper
    participant W as WebController
    participant A as ContentAnalyzer
    
    U->>S: crawl(url, depth=2)
    S->>W: navigate(url)
    W-->>S: pageContent, links[]
    S->>A: analyze(pageContent)
    A-->>S: structuredData
    S->>S: Filter internal links
    loop for each Link
        S->>W: navigate(sublink)
        W-->>S: subPageContent
    end
    S-->>U: Final Dataset
Loading

Configuration

Environment Variables

Variable Required Default Description
GEMINI_API_KEY Yes - API key for Google Gemini
GROQ_API_KEY No - Optional alternative LLM provider
PROXY_URL No - Proxy connection string
LOG_LEVEL No info logging verbosity (debug/info/warn/error)

Scraper Options

Option Type Default Description
concurrency number 5 Simultaneous page processing
maxDepth number 3 Recursion limit for links
timeout number 30000 Page load timeout in ms
userAgent string Custom Spoofed browser user agent

API Reference

WebScraper Class

constructor(config: ScraperConfig)

Initializes the scraper instance.

crawl(url: string, options?: CrawlOptions): Promise<ScrapedData[]>

The primary entry point for extraction.

  • Input: url (valid HTTP/HTTPS string).
  • Return: Array of objects containing url, content, metadata, and timestamp.

RateLimiter Class

constructor(options: RateLimitOptions)

Parameter Type Description
maxTokens number Bucket size for burst requests
refillRate number Tokens added per second

Development

Setup

# Clone and install dependencies
git clone https://github.com/ElysiumOSS/enterprise-ai-recursive-web-scraper.git
cd enterprise-ai-recursive-web-scraper
npm install

# Run tests
npm test

# Build the project
npm run build

Project Structure

  • src/classes/: Core logic (Scraper, AI, Web controllers).
  • src/data/: Static datasets (e.g., NSFW keyword lists).
  • src/cli.ts: Command-line interface entry point.
  • test/: Unit and integration test suites.

Troubleshooting

Error Message Cause Solution
API_KEY_INVALID Incorrect Gemini/Groq key Verify environment variables and key permissions.
ERR_PROXY_CONNECTION_FAILED Proxy server unreachable Check proxy credentials and firewall settings.
MAX_DEPTH_REACHED Scraper stopped at limit Increase the maxDepth parameter in config.
NSFW_CONTENT_DETECTED Page failed safety check Adjust ContentValidator settings or white-list the URL.

Performance & Security

Performance Considerations

  • Memory Management: For large crawls (>1000 pages), ensure Node.js is started with --max-old-space-size=4096.
  • Concurrency: High concurrency (20+) may trigger anti-bot protections. Use residential proxies if scaling aggressively.

Security Notes

  • API Keys: Never commit .env files. Use secret management (AWS Secrets Manager, GitHub Secrets).
  • Sandboxing: Playwright runs in a sandbox by default; do not disable this in production environments.

Contributing

We welcome contributions! Please see our CONTRIBUTING.md for detailed guidelines.

  1. Fork the repository.
  2. Create a feature branch (git checkout -b feature/cool-new-logic).
  3. Commit changes using Conventional Commits.
  4. Push to the branch and open a Pull Request.

Roadmap & Known Issues

  • Support for local LLMs via Ollama.
  • Distributed crawling using Redis as a backplane.
  • Export to Vector Databases (Pinecone, Weaviate).
  • ⚠️ Known Issue: Certain SPAs (Single Page Apps) with heavy obfuscation may require custom delay settings.

License & Credits

  • License: MIT - see LICENSE.md for details.
  • Maintained by: ElysiumOSS Team.
  • Inspirations: Built on the shoulders of Playwright and the Google Generative AI SDK.

About

An intelligent, enterprise-grade recursive web scraper powered by LLMs (Gemini/Groq). Features semantic analysis, smart chunking, deduplication, proxy rotation, and visual capture for building structured, RAG-ready datasets.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Contributors