Awesome Web Scraping

A curated list of web scraping tools, libraries, and resources.

Maintained by Crawlee Cloud — self-hosted web scraping platform.

📖 Contents

AI & LLM Scraping
Scraping Frameworks
Browser Automation
Anti-Detection
Data Extraction
CAPTCHA Solving
Proxy Management
Utils & User Agents
Scheduling
Tutorials

🤖 AI & LLM Scraping

Tools for extracting data for Large Language Models.

Tool	Language	Description
Crawl4AI	Python	Open-source LLM-friendly web crawler and scraper
Firecrawl	Multi	Turn websites into LLM-ready markdown
ScrapeGraphAI	Python	Python library for graph-based AI scraping
Stagehand	TypeScript	AI-powered programmable browser

🕷️ Scraping Frameworks

Full-featured frameworks for building web scrapers.

Tool	Language	Description
Colly	Go	Fast and elegant scraping framework
Crawlee	TypeScript	Reliable crawling library with autoscaling, session management, and stealth features
Ferret	Go	Declarative web scraping
Scrapy	Python	Battle-tested framework for large-scale scraping

🎭 Browser Automation

Headless browser control for JavaScript-heavy sites.

Tool	Language	Description
Browserbase	-	Serverless headless browser platform
Cypress	JavaScript	E2E testing with scraping capabilities
Playwright	Multi	Cross-browser automation by Microsoft
Puppeteer	JavaScript	Headless Chrome/Chromium control
rod	Go	High-level Chrome DevTools controller
Selenium	Multi	Industry standard browser automation
Steel	-	Browser API for AI agents

🛡️ Anti-Detection

Tools for avoiding bot detection and CAPTCHAs.

Tool	Description
Camoufox	Stealthy Firefox automation
curl-impersonate	curl with browser TLS fingerprints
Rebrowser Patches	Playwright/Puppeteer anti-detection
undetected-chromedriver	Selenium patch for anti-detection

⛏️ Data Extraction

HTML parsing and data extraction libraries.

Tool	Language	Description
Beautiful Soup	Python	HTML/XML parsing
Cheerio	JavaScript	Fast jQuery-like HTML parsing
jsdom	JavaScript	DOM implementation for Node.js
lxml	Python	High-performance XML/HTML processing
Parsel	Python	XPath/CSS selector extraction
Selectolax	Python	Ultra-fast HTML5 parser using Modest engine

🧩 CAPTCHA Solving

Services and libraries to solve CAPTCHAs.

Tool	Type	Description
2Captcha	Service	Human-powered CAPTCHA solving service
Anti-Captcha	Service	Reliable CAPTCHA solving API
CapMonster Cloud	Service	AI-powered cloud CAPTCHA solving service
CapSolver	Service	AI-powered CAPTCHA solving
nocaptchaai	Service	AI solution for recaptcha/hcaptcha

🌐 Proxy Management

Rotating proxies and IP management.

Type	Description
Residential	Real user IPs, higher trust
Datacenter	Fast, cheap, easily detected
Mobile	4G/5G IPs, highest trust
ISP	Static residential IPs

Popular providers: Bright Data, Oxylabs, Smartproxy, IPRoyal

🧰 Utils & User Agents

Helper libraries for common scraping tasks.

Tool	Language	Description
fake-useragent	Python	Random User-Agent generator
protego	Python	Pure-Python robots.txt parser
robots-parser	JavaScript	robots.txt parser for Node.js
user-agents	JavaScript	Comprehensive User-Agent generator

⏰ Scheduling

Job scheduling for recurring scrapes.

Tool	Language	Description
APScheduler	Python	Advanced scheduler
BullMQ	JavaScript	Redis-based job queue
Celery	Python	Distributed task queue
node-cron	JavaScript	Cron-like scheduler

📚 Tutorials

Learning resources for web scraping.

Apify Academy — Free web scraping course
Crawlee Docs — Official Crawlee documentation
ScrapingBee Blog — Practical guides

🚀 Self-Hosted Platforms

Run scrapers on your own infrastructure.

Platform	Description
Crawlee Cloud	Open-source, self-hosted Actor platform

🤝 Contributing

Contributions welcome! Please read the contributing guidelines first.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Web Scraping

📖 Contents

🤖 AI & LLM Scraping

🕷️ Scraping Frameworks

🎭 Browser Automation

🛡️ Anti-Detection

⛏️ Data Extraction

🧩 CAPTCHA Solving

🌐 Proxy Management

🧰 Utils & User Agents

⏰ Scheduling

📚 Tutorials

🚀 Self-Hosted Platforms

🤝 Contributing

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Web Scraping

📖 Contents

🤖 AI & LLM Scraping

🕷️ Scraping Frameworks

🎭 Browser Automation

🛡️ Anti-Detection

⛏️ Data Extraction

🧩 CAPTCHA Solving

🌐 Proxy Management

🧰 Utils & User Agents

⏰ Scheduling

📚 Tutorials

🚀 Self-Hosted Platforms

🤝 Contributing

📝 License

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages