Skip to content

crawlee-cloud/awesome-web-scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Awesome Web Scraping Awesome

A curated list of web scraping tools, libraries, and resources.

Maintained by Crawlee Cloud — self-hosted web scraping platform.


📖 Contents


🤖 AI & LLM Scraping

Tools for extracting data for Large Language Models.

Tool Language Description
Crawl4AI Python Open-source LLM-friendly web crawler and scraper
Firecrawl Multi Turn websites into LLM-ready markdown
ScrapeGraphAI Python Python library for graph-based AI scraping
Stagehand TypeScript AI-powered programmable browser

🕷️ Scraping Frameworks

Full-featured frameworks for building web scrapers.

Tool Language Description
Colly Go Fast and elegant scraping framework
Crawlee TypeScript Reliable crawling library with autoscaling, session management, and stealth features
Ferret Go Declarative web scraping
Scrapy Python Battle-tested framework for large-scale scraping

🎭 Browser Automation

Headless browser control for JavaScript-heavy sites.

Tool Language Description
Browserbase - Serverless headless browser platform
Cypress JavaScript E2E testing with scraping capabilities
Playwright Multi Cross-browser automation by Microsoft
Puppeteer JavaScript Headless Chrome/Chromium control
rod Go High-level Chrome DevTools controller
Selenium Multi Industry standard browser automation
Steel - Browser API for AI agents

🛡️ Anti-Detection

Tools for avoiding bot detection and CAPTCHAs.

Tool Description
Camoufox Stealthy Firefox automation
curl-impersonate curl with browser TLS fingerprints
Rebrowser Patches Playwright/Puppeteer anti-detection
undetected-chromedriver Selenium patch for anti-detection

⛏️ Data Extraction

HTML parsing and data extraction libraries.

Tool Language Description
Beautiful Soup Python HTML/XML parsing
Cheerio JavaScript Fast jQuery-like HTML parsing
jsdom JavaScript DOM implementation for Node.js
lxml Python High-performance XML/HTML processing
Parsel Python XPath/CSS selector extraction
Selectolax Python Ultra-fast HTML5 parser using Modest engine

🧩 CAPTCHA Solving

Services and libraries to solve CAPTCHAs.

Tool Type Description
2Captcha Service Human-powered CAPTCHA solving service
Anti-Captcha Service Reliable CAPTCHA solving API
CapMonster Cloud Service AI-powered cloud CAPTCHA solving service
CapSolver Service AI-powered CAPTCHA solving
nocaptchaai Service AI solution for recaptcha/hcaptcha

🌐 Proxy Management

Rotating proxies and IP management.

Type Description
Residential Real user IPs, higher trust
Datacenter Fast, cheap, easily detected
Mobile 4G/5G IPs, highest trust
ISP Static residential IPs

Popular providers: Bright Data, Oxylabs, Smartproxy, IPRoyal


🧰 Utils & User Agents

Helper libraries for common scraping tasks.

Tool Language Description
fake-useragent Python Random User-Agent generator
protego Python Pure-Python robots.txt parser
robots-parser JavaScript robots.txt parser for Node.js
user-agents JavaScript Comprehensive User-Agent generator

⏰ Scheduling

Job scheduling for recurring scrapes.

Tool Language Description
APScheduler Python Advanced scheduler
BullMQ JavaScript Redis-based job queue
Celery Python Distributed task queue
node-cron JavaScript Cron-like scheduler

📚 Tutorials

Learning resources for web scraping.


🚀 Self-Hosted Platforms

Run scrapers on your own infrastructure.

Platform Description
Crawlee Cloud Open-source, self-hosted Actor platform

🤝 Contributing

Contributions welcome! Please read the contributing guidelines first.


📝 License

CC0

About

A curated list of web scraping tools, libraries, and resources.

Topics

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors