A fully functional search engine built in Python that mimics Google's core functionality including web crawling, indexing, ranking, and a web interface.
π Live Demo | π Full Documentation | π Deploy Now
- Web Crawling: Automated web scraping with respectful crawling (delays, robots.txt respect)
- Text Indexing: TF-IDF based indexing with stemming and stopword removal
- Ranking Algorithm: Combines content relevance with simple PageRank-style scoring
- Query Processing: Advanced query preprocessing with spelling suggestions
- Result Snippets: Smart extraction of relevant text snippets with query highlighting
- Google-like UI: Clean, responsive design similar to Google's interface
- Search Results: Paginated results with relevance scores
- Spelling Suggestions: "Did you mean?" functionality for misspelled queries
- Mobile Responsive: Works on desktop and mobile devices
- SQLite Database: Efficient storage and retrieval of crawled pages
- RESTful API: JSON API endpoints for integration with other applications
- Modular Architecture: Separate components for crawling, indexing, and serving
- Error Handling: Robust error handling and logging
search_engine/
βββ main.py # Main entry point and CLI
βββ crawler.py # Web crawler implementation
βββ search_engine.py # Core search and ranking logic
βββ web_app.py # Flask web application
βββ requirements.txt # Python dependencies
βββ templates/
β βββ search.html # Web interface template
βββ database.db # SQLite database (created after crawling)
βββ README.md # This file
cd search_engine
pip install -r requirements.txtFor a quick demo with pre-configured URLs:
python main.pyThis will:
- Crawl sample websites (Wikipedia, Python docs, etc.)
- Build the search index
- Provide instructions to start the web server
python main.py serverThen visit http://127.0.0.1:5000 in your browser.
# Crawl specific URLs
python main.py crawl https://example.com https://another-site.com
# Crawl with custom settings
python main.py crawl https://example.com --max-pages 100 --delay 1python main.py test# Default settings (localhost:5000)
python main.py server
# Custom host and port
python main.py server --host 0.0.0.0 --port 8080
# Production mode (no debug)
python main.py server --no-debug# Basic search
curl "http://localhost:5000/api/search?q=python+programming"
# Limit results
curl "http://localhost:5000/api/search?q=web+development&max_results=5"curl "http://localhost:5000/api/suggest?q=pythong"from search_engine import SearchEngine
# Initialize search engine
engine = SearchEngine('database.db')
# Perform search
results = engine.search("python programming", max_results=10)
# Process results
for result in results:
print(f"Title: {result['title']}")
print(f"URL: {result['url']}")
print(f"Score: {result['final_score']}")
print(f"Snippet: {result['content_snippet']}")
print("---")- Respectful Crawling: Includes delays between requests and respects robots.txt
- Content Extraction: Uses BeautifulSoup to parse HTML and extract text content
- Link Discovery: Follows links to discover new pages
- Duplicate Prevention: Tracks visited URLs to avoid crawling the same page twice
- Text Preprocessing: Tokenization, stemming, and stopword removal using NLTK
- TF-IDF Vectorization: Creates numerical representations of documents using scikit-learn
- Efficient Storage: Stores processed content in SQLite database
- Query Processing: Applies same preprocessing to search queries
- Similarity Calculation: Uses cosine similarity between query and document vectors
- PageRank-style Scoring: Boosts authoritative domains and HTTPS sites
- Result Ranking: Combines content relevance with authority signals
- Flask Backend: Lightweight web framework for serving search results
- Responsive Design: HTML/CSS interface that works on all devices
- Pagination: Handles large result sets with page navigation
- Real-time Search: Fast response times through efficient indexing
CREATE TABLE pages (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE,
title TEXT,
content TEXT,
keywords TEXT,
crawl_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);- Term Frequency: How often a term appears in a document
- Inverse Document Frequency: How rare a term is across all documents
- Combined Score: TF Γ IDF gives relevance score for each term
- HTTPS sites get 10% boost
- Shorter URLs get 5% boost
- Authoritative domains (Wikipedia, official docs) get 20% boost
- Edit distance algorithm to find similar words in vocabulary
- Suggests alternatives for misspelled query terms
This project includes configurations for one-click deployment to popular cloud platforms:
- Fork this repository
- Connect your GitHub account to Render
- Create a new "Web Service" from your forked repository
- Render will automatically detect the
render.yamlconfiguration - Your search engine will be live with pre-populated data!
- Fork this repository
- Connect your GitHub account to Railway
- Deploy from your forked repository
- Railway will use the
railway.jsonconfiguration
- Fork this repository
- Connect your Heroku account to GitHub
- Create a new app and connect it to your forked repository
- The
Procfilewill handle the deployment
The deploy.py script automatically:
- Installs dependencies from
requirements.txt - Crawls and indexes content from popular programming sites:
- Python documentation and Wikipedia
- JavaScript, Machine Learning, Web Development topics
- Algorithm and Computer Science resources
- Starts the web server on the platform's assigned port
- Provides search functionality immediately after deployment
# Clone and setup
git clone https://github.com/ShubhamPhapale/search-engine-proto.git
cd search-engine-proto
pip install -r requirements.txt
# Initialize with sample data
python deploy.pyEdit the calculate_page_rank() method in search_engine.py:
def calculate_page_rank(self, urls):
scores = []
for url in urls:
score = 1.0
# Add your custom ranking factors here
if 'github.com' in url:
score *= 1.3 # Boost GitHub repos
if url.count('/') < 4: # Homepage or top-level page
score *= 1.1
scores.append(score)
return scoresModify crawler.py to:
- Respect robots.txt
- Add custom headers
- Filter specific content types
- Implement rate limiting per domain
The HTML template in templates/search.html can be customized to:
- Change the design and branding
- Add filters and advanced search options
- Implement auto-complete functionality
- Add search analytics
- Database: Consider PostgreSQL for better performance with large datasets
- Indexing: Implement Elasticsearch for faster full-text search
- Caching: Add Redis for caching frequent queries
- Load Balancing: Use multiple server instances behind a load balancer
- Current implementation loads all documents into memory for TF-IDF
- For large corpora, consider incremental learning or external indexing
- Implement parallel crawling with threading/async
- Add distributed crawling across multiple machines
- Implement smarter URL prioritization
- Scale: Designed for thousands of pages, not millions like Google
- Ranking: Simplified PageRank without link graph analysis
- Language: Currently optimized for English content only
- Real-time: Index updates require full rebuild
- Security: No input sanitization for production use
- Image search capability
- Real-time index updates
- Machine learning-based ranking
- Multi-language support
- Advanced query operators (site:, filetype:, etc.)
- Search analytics and user behavior tracking
- Distributed architecture for scalability
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is open source and available under the MIT License.
- Inspired by Google's search architecture
- Built with Python's excellent ecosystem (Flask, scikit-learn, NLTK, BeautifulSoup)
- Thanks to the open source community for the foundational libraries