Skip to content

CyberHunter12-ui/Web_Scraper

Repository files navigation

Echo Scraper

A professional-grade Python scraper for extracting emails, phone numbers, and contact information from websites and PDFs. Supports both basic HTML scraping and advanced Selenium-based scraping for JavaScript-heavy websites.

Features

🎯 Core Capabilities

  • Dual Scraping Modes:
    • Basic mode (Requests + BeautifulSoup) - Fast and efficient
    • Advanced mode (Selenium) - JavaScript support, dynamic content
  • Multi-Format Support:
    • Websites with configurable crawl depth (1-3 levels)
    • PDF documents
  • Contact Information Extraction:
    • Email addresses (including hidden formats like [at], [dot])
    • Phone numbers (multiple formats)
  • Advanced Data Processing:
    • Automatic duplicate removal
    • Name extraction from email addresses
    • Medical context detection
    • Hidden email format normalization

🔒 Reliability Features

  • SSL certificate error handling
  • Automatic duplicate prevention
  • Same-domain crawling only
  • Configurable page limits
  • Comprehensive error handling

📊 Output

  • CSV export with columns: Name, Email, Phone, URL, Source
  • Beautiful formatted console display
  • Duplicate checking across sessions

Installation

Prerequisites

  • Python 3.7+
  • pip (Python package manager)

Usage

Quick Start

python main.py

Menu Options

╔════════════════════════════════════════════════════════════════════════════╗
║                      Echo Scraper - Advanced Edition                       ║
╚════════════════════════════════════════════════════════════════════════════╝

Select Input Type:
  1) Website - Basic Scraping (Requests + BeautifulSoup) - Fast
  2) Website - Advanced Scraping (Selenium) - JavaScript Support
  3) PDF File - Extract from PDF Document
  4) Exit

Example Workflows

Option 1: Basic Website Scraping

Enter your choice (1-4): 1
Enter Website URL: example-hospital.com
Enter crawl depth (1-3, default 2): 2

Option 2: Advanced Website Scraping (JavaScript)

Enter your choice (1-4): 2
Enter Website URL: reactjs-hospital-site.com
Enter crawl depth (1-3, default 2): 2
Select Browser:
  1) Chrome (Recommended)
  2) Firefox
Select browser (1-2, default 1): 1

Option 3: PDF Processing

Enter your choice (1-4): 3
Enter PDF file path: /path/to/directory.pdf

Project Structure

Echo_scraper/
├── main.py                 # Main entry point with menu UI
├── scraper.py              # Basic scraper (Requests + BeautifulSoup)
├── advanced_scraper.py     # Advanced scraper (Selenium)
├── pdf_extractor.py        # PDF text extraction
├── utils.py                # Utilities (email, phone, name extraction)
├── requirements.txt        # Dependencies
├── output.csv              # Generated output file
└── README.md               # This file

File Descriptions

main.py

  • Menu-driven user interface
  • Orchestrates different scraping modes
  • Handles CSV export and formatting
  • Result display and user interaction

scraper.py

BasicScraper Class

  • Fast HTML scraping with Requests + BeautifulSoup
  • Internal link crawling (same domain only)
  • Configurable depth and page limits
  • Automatic duplicate prevention

advanced_scraper.py

AdvancedScraper Class

  • Selenium WebDriver integration
  • JavaScript content rendering
  • Dynamic content loading
  • Browser selection (Chrome/Firefox)
  • Headless mode for server environments

pdf_extractor.py

  • PDF text extraction using pdfplumber
  • Handles multi-page PDFs
  • Error handling for corrupted PDFs

License

This project is provided for educational and authorized use only.

Support

For issues or feature requests, review the code documentation or check common troubleshooting solutions above.


Last Updated: March 2026 Version: 2.0 (Advanced Edition) Status: Production Ready

About

A simple powerful python script to scrape all the emails,data and more from websites.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages