A professional-grade Python scraper for extracting emails, phone numbers, and contact information from websites and PDFs. Supports both basic HTML scraping and advanced Selenium-based scraping for JavaScript-heavy websites.
- Dual Scraping Modes:
- Basic mode (Requests + BeautifulSoup) - Fast and efficient
- Advanced mode (Selenium) - JavaScript support, dynamic content
- Multi-Format Support:
- Websites with configurable crawl depth (1-3 levels)
- PDF documents
- Contact Information Extraction:
- Email addresses (including hidden formats like [at], [dot])
- Phone numbers (multiple formats)
- Advanced Data Processing:
- Automatic duplicate removal
- Name extraction from email addresses
- Medical context detection
- Hidden email format normalization
- SSL certificate error handling
- Automatic duplicate prevention
- Same-domain crawling only
- Configurable page limits
- Comprehensive error handling
- CSV export with columns: Name, Email, Phone, URL, Source
- Beautiful formatted console display
- Duplicate checking across sessions
- Python 3.7+
- pip (Python package manager)
python main.py╔════════════════════════════════════════════════════════════════════════════╗
║ Echo Scraper - Advanced Edition ║
╚════════════════════════════════════════════════════════════════════════════╝
Select Input Type:
1) Website - Basic Scraping (Requests + BeautifulSoup) - Fast
2) Website - Advanced Scraping (Selenium) - JavaScript Support
3) PDF File - Extract from PDF Document
4) Exit
Enter your choice (1-4): 1
Enter Website URL: example-hospital.com
Enter crawl depth (1-3, default 2): 2
Enter your choice (1-4): 2
Enter Website URL: reactjs-hospital-site.com
Enter crawl depth (1-3, default 2): 2
Select Browser:
1) Chrome (Recommended)
2) Firefox
Select browser (1-2, default 1): 1
Enter your choice (1-4): 3
Enter PDF file path: /path/to/directory.pdf
Echo_scraper/
├── main.py # Main entry point with menu UI
├── scraper.py # Basic scraper (Requests + BeautifulSoup)
├── advanced_scraper.py # Advanced scraper (Selenium)
├── pdf_extractor.py # PDF text extraction
├── utils.py # Utilities (email, phone, name extraction)
├── requirements.txt # Dependencies
├── output.csv # Generated output file
└── README.md # This file
- Menu-driven user interface
- Orchestrates different scraping modes
- Handles CSV export and formatting
- Result display and user interaction
BasicScraper Class
- Fast HTML scraping with Requests + BeautifulSoup
- Internal link crawling (same domain only)
- Configurable depth and page limits
- Automatic duplicate prevention
AdvancedScraper Class
- Selenium WebDriver integration
- JavaScript content rendering
- Dynamic content loading
- Browser selection (Chrome/Firefox)
- Headless mode for server environments
- PDF text extraction using pdfplumber
- Handles multi-page PDFs
- Error handling for corrupted PDFs
This project is provided for educational and authorized use only.
For issues or feature requests, review the code documentation or check common troubleshooting solutions above.
Last Updated: March 2026 Version: 2.0 (Advanced Edition) Status: Production Ready