WebCrawl - Distributed Web Crawling System

A high-performance, distributed web crawling and search system built with Python. This project implements a complete search engine solution with distributed crawling, content indexing, and a modern search interface.

2025-05-11.22-27-57.mp4

WebCrawl System Demo

Features

Distributed Architecture

Master Node: Coordinates crawling tasks and manages the URL frontier
Multiple Crawler Nodes: Independently fetch and process web pages in parallel
Indexer Node: Processes crawled content for efficient search
AWS Integration: Optional SQS queues for task distribution and S3 for content storage
Horizontal Scaling: Add more crawler nodes dynamically to increase throughput

Crawler Features

Robots.txt Compliance: Respects website crawling policies
Politeness Mechanisms: Rate limiting and crawl delays to avoid overloading websites
Recursive Crawling: Discovers new URLs and adds them to the crawl frontier
Content Extraction: Parses HTML to extract valuable text and links
Error Handling: Robust error recovery and retry mechanisms

Indexing & Search

Full-Text Indexing: Uses Whoosh for efficient content indexing
Relevance Ranking: BM25F algorithm for high-quality search results
Field-Specific Search: Supports querying by title, content, or domain
Query Highlighting: Shows matching terms in context
Rich Search Results: Displays titles, snippets, domains, and metadata

Web Interface

Clean Modern UI: Responsive search interface with clear result display
Real-Time Statistics: Shows system performance metrics
Minimalist Design: Focused on search functionality
Cross-Browser Compatible: Works across modern browsers

System Features

Fault Tolerance: Automatic recovery from node failures
Heartbeat Mechanism: Detects and handles crawler node failures
Task Requeuing: Failed tasks are automatically retried
Comprehensive Logging: Detailed logs for monitoring and debugging
Scalability Testing: Built-in tools to measure performance with different configurations

Architecture

┌─────────────────┐     ┌──────────────────┐     ┌──────────────────┐
│                 │     │                  │     │                  │
│  Master Node    │◄────┤  Crawler Node 1  │     │  Crawler Node 2  │
│  (Coordinator)  │     │                  │     │                  │
│                 │────►│                  │     │                  │
└────────┬────────┘     └──────────┬───────┘     └──────────┬───────┘
         │                         │                        │
         │                         │                        │
         ▼                         ▼                        ▼
┌─────────────────┐      ┌──────────────────┐     ┌──────────────────┐
│                 │      │                  │     │                  │
│  Amazon SQS     │      │  Amazon S3       │     │  Indexer Node    │
│  (Task Queues)  │      │  (Content Store) │     │  (Search Engine) │
│                 │      │                  │     │                  │
└─────────────────┘      └──────────────────┘     └──────────────────┘

Setup

Prerequisites

Python 3.7+
AWS account (optional, for S3 and SQS features)
boto3 Python library (for AWS integration)
Flask, Whoosh, Beautiful Soup, Requests

Installation

Clone the repository

git clone https://github.com/yourusername/webCrawl.git
cd webCrawl

Install dependencies

pip install -r requirements.txt

Configure AWS credentials (optional)

aws configure
# Follow prompts to enter your AWS Access Key, Secret Key, and region

Starting the System

Start the Master Node

python master/master_node.py --port 5000

Start the Indexer Node

python indexer/indexer_node.py --port 5002 --crawler-api http://<MASTER_IP>:5000

Start Crawler Nodes (can be on separate machines)

python crawler/crawler_node.py --master http://<MASTER_IP>:5000 --s3-bucket your-crawler-bucket --use-sqs

To run multiple crawler processes on the same machine:

# In separate terminal windows:
python crawler/crawler_node.py --master http://<MASTER_IP>:5000 --s3-bucket your-crawler-bucket --use-sqs
python crawler/crawler_node.py --master http://<MASTER_IP>:5000 --s3-bucket your-crawler-bucket --use-sqs
python crawler/crawler_node.py --master http://<MASTER_IP>:5000 --s3-bucket your-crawler-bucket --use-sqs

Usage

Adding Seed URLs

curl -X POST http://<MASTER_IP>:5000/add_urls -H "Content-Type: application/json" -d '{"urls":["https://en.wikipedia.org", "https://news.ycombinator.com"]}'

Using the Search Interface

Open your browser and navigate to http://<INDEXER_IP>:5002
Enter your search query in the search box
Review the results with titles, snippets, and metadata

Checking System Status

# Master node status
curl http://<MASTER_IP>:5000/status

# Indexer node status
curl http://<INDEXER_IP>:5002/status

# Run scalability tests
python test_system.py scalability --master http://<MASTER_IP>:5000 --indexer http://<INDEXER_IP>:5002 --crawlers <CRAWLER_IP_1> <CRAWLER_IP_2> <CRAWLER_IP_3>

Performance Tuning

Crawler Performance

Adjust the number of crawler nodes based on desired throughput
Modify crawler_node.py politeness settings to balance speed with website etiquette
Ensure sufficient bandwidth and CPU resources on crawler machines

Indexer Performance

Increase instance size for better search performance
Configure Whoosh indexing parameters in indexer_node.py for optimal search

AWS Configuration

Use appropriate SQS settings for visibility timeout
Choose S3 storage class based on access patterns
Consider using AWS Elastic Beanstalk or ECS for easier scaling

Testing

The system includes comprehensive testing tools:

# Run all tests
python test_system.py all --master http://<MASTER_IP>:5000 --indexer http://<INDEXER_IP>:5002

# Run just functional tests
python test_system.py functional --master http://<MASTER_IP>:5000 --indexer http://<INDEXER_IP>:5002

# Run fault tolerance tests
python test_system.py fault --master http://<MASTER_IP>:5000 --indexer http://<INDEXER_IP>:5002

# Run scalability tests with specific crawler counts
python test_system.py scalability --master http://<MASTER_IP>:5000 --indexer http://<INDEXER_IP>:5002 --counts 1 2 4 8

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
archive		archive
crawler		crawler
images		images
indexer		indexer
master		master
.gitignore		.gitignore
Nodekey.pem		Nodekey.pem
Project Plan.pdf		Project Plan.pdf
README.md		README.md
System Architecture Document.pdf		System Architecture Document.pdf
Technology_selection_Report.pdf		Technology_selection_Report.pdf
monitor.py		monitor.py
performance_optimizer.py		performance_optimizer.py
requirements.txt		requirements.txt
security_review.py		security_review.py
system_architecture_v3.md		system_architecture_v3.md
test_system.py		test_system.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebCrawl - Distributed Web Crawling System

Features

Distributed Architecture

Crawler Features

Indexing & Search

Web Interface

System Features

Architecture

Setup

Prerequisites

Installation

Starting the System

Usage

Adding Seed URLs

Using the Search Interface

Checking System Status

Performance Tuning

Crawler Performance

Indexer Performance

AWS Configuration

Testing

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WebCrawl - Distributed Web Crawling System

Features

Distributed Architecture

Crawler Features

Indexing & Search

Web Interface

System Features

Architecture

Setup

Prerequisites

Installation

Starting the System

Usage

Adding Seed URLs

Using the Search Interface

Checking System Status

Performance Tuning

Crawler Performance

Indexer Performance

AWS Configuration

Testing

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages