A high-performance, distributed web crawling and search system built with Python. This project implements a complete search engine solution with distributed crawling, content indexing, and a modern search interface.
2025-05-11.22-27-57.mp4
- Master Node: Coordinates crawling tasks and manages the URL frontier
- Multiple Crawler Nodes: Independently fetch and process web pages in parallel
- Indexer Node: Processes crawled content for efficient search
- AWS Integration: Optional SQS queues for task distribution and S3 for content storage
- Horizontal Scaling: Add more crawler nodes dynamically to increase throughput
- Robots.txt Compliance: Respects website crawling policies
- Politeness Mechanisms: Rate limiting and crawl delays to avoid overloading websites
- Recursive Crawling: Discovers new URLs and adds them to the crawl frontier
- Content Extraction: Parses HTML to extract valuable text and links
- Error Handling: Robust error recovery and retry mechanisms
- Full-Text Indexing: Uses Whoosh for efficient content indexing
- Relevance Ranking: BM25F algorithm for high-quality search results
- Field-Specific Search: Supports querying by title, content, or domain
- Query Highlighting: Shows matching terms in context
- Rich Search Results: Displays titles, snippets, domains, and metadata
- Clean Modern UI: Responsive search interface with clear result display
- Real-Time Statistics: Shows system performance metrics
- Minimalist Design: Focused on search functionality
- Cross-Browser Compatible: Works across modern browsers
- Fault Tolerance: Automatic recovery from node failures
- Heartbeat Mechanism: Detects and handles crawler node failures
- Task Requeuing: Failed tasks are automatically retried
- Comprehensive Logging: Detailed logs for monitoring and debugging
- Scalability Testing: Built-in tools to measure performance with different configurations
┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ │ │ │ │ │
│ Master Node │◄────┤ Crawler Node 1 │ │ Crawler Node 2 │
│ (Coordinator) │ │ │ │ │
│ │────►│ │ │ │
└────────┬────────┘ └──────────┬───────┘ └──────────┬───────┘
│ │ │
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ │ │ │ │ │
│ Amazon SQS │ │ Amazon S3 │ │ Indexer Node │
│ (Task Queues) │ │ (Content Store) │ │ (Search Engine) │
│ │ │ │ │ │
└─────────────────┘ └──────────────────┘ └──────────────────┘
- Python 3.7+
- AWS account (optional, for S3 and SQS features)
- boto3 Python library (for AWS integration)
- Flask, Whoosh, Beautiful Soup, Requests
- Clone the repository
git clone https://github.com/yourusername/webCrawl.git
cd webCrawl- Install dependencies
pip install -r requirements.txt- Configure AWS credentials (optional)
aws configure
# Follow prompts to enter your AWS Access Key, Secret Key, and region- Start the Master Node
python master/master_node.py --port 5000- Start the Indexer Node
python indexer/indexer_node.py --port 5002 --crawler-api http://<MASTER_IP>:5000- Start Crawler Nodes (can be on separate machines)
python crawler/crawler_node.py --master http://<MASTER_IP>:5000 --s3-bucket your-crawler-bucket --use-sqsTo run multiple crawler processes on the same machine:
# In separate terminal windows:
python crawler/crawler_node.py --master http://<MASTER_IP>:5000 --s3-bucket your-crawler-bucket --use-sqs
python crawler/crawler_node.py --master http://<MASTER_IP>:5000 --s3-bucket your-crawler-bucket --use-sqs
python crawler/crawler_node.py --master http://<MASTER_IP>:5000 --s3-bucket your-crawler-bucket --use-sqscurl -X POST http://<MASTER_IP>:5000/add_urls -H "Content-Type: application/json" -d '{"urls":["https://en.wikipedia.org", "https://news.ycombinator.com"]}'- Open your browser and navigate to
http://<INDEXER_IP>:5002 - Enter your search query in the search box
- Review the results with titles, snippets, and metadata
# Master node status
curl http://<MASTER_IP>:5000/status
# Indexer node status
curl http://<INDEXER_IP>:5002/status
# Run scalability tests
python test_system.py scalability --master http://<MASTER_IP>:5000 --indexer http://<INDEXER_IP>:5002 --crawlers <CRAWLER_IP_1> <CRAWLER_IP_2> <CRAWLER_IP_3>- Adjust the number of crawler nodes based on desired throughput
- Modify
crawler_node.pypoliteness settings to balance speed with website etiquette - Ensure sufficient bandwidth and CPU resources on crawler machines
- Increase instance size for better search performance
- Configure Whoosh indexing parameters in
indexer_node.pyfor optimal search
- Use appropriate SQS settings for visibility timeout
- Choose S3 storage class based on access patterns
- Consider using AWS Elastic Beanstalk or ECS for easier scaling
The system includes comprehensive testing tools:
# Run all tests
python test_system.py all --master http://<MASTER_IP>:5000 --indexer http://<INDEXER_IP>:5002
# Run just functional tests
python test_system.py functional --master http://<MASTER_IP>:5000 --indexer http://<INDEXER_IP>:5002
# Run fault tolerance tests
python test_system.py fault --master http://<MASTER_IP>:5000 --indexer http://<INDEXER_IP>:5002
# Run scalability tests with specific crawler counts
python test_system.py scalability --master http://<MASTER_IP>:5000 --indexer http://<INDEXER_IP>:5002 --counts 1 2 4 8Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.

