A curated list of web scraping tools, libraries, and resources.
Maintained by Crawlee Cloud — self-hosted web scraping platform.
- AI & LLM Scraping
- Scraping Frameworks
- Browser Automation
- Anti-Detection
- Data Extraction
- CAPTCHA Solving
- Proxy Management
- Utils & User Agents
- Scheduling
- Tutorials
Tools for extracting data for Large Language Models.
| Tool | Language | Description |
|---|---|---|
| Crawl4AI | Python | Open-source LLM-friendly web crawler and scraper |
| Firecrawl | Multi | Turn websites into LLM-ready markdown |
| ScrapeGraphAI | Python | Python library for graph-based AI scraping |
| Stagehand | TypeScript | AI-powered programmable browser |
Full-featured frameworks for building web scrapers.
| Tool | Language | Description |
|---|---|---|
| Colly | Go | Fast and elegant scraping framework |
| Crawlee | TypeScript | Reliable crawling library with autoscaling, session management, and stealth features |
| Ferret | Go | Declarative web scraping |
| Scrapy | Python | Battle-tested framework for large-scale scraping |
Headless browser control for JavaScript-heavy sites.
| Tool | Language | Description |
|---|---|---|
| Browserbase | - | Serverless headless browser platform |
| Cypress | JavaScript | E2E testing with scraping capabilities |
| Playwright | Multi | Cross-browser automation by Microsoft |
| Puppeteer | JavaScript | Headless Chrome/Chromium control |
| rod | Go | High-level Chrome DevTools controller |
| Selenium | Multi | Industry standard browser automation |
| Steel | - | Browser API for AI agents |
Tools for avoiding bot detection and CAPTCHAs.
| Tool | Description |
|---|---|
| Camoufox | Stealthy Firefox automation |
| curl-impersonate | curl with browser TLS fingerprints |
| Rebrowser Patches | Playwright/Puppeteer anti-detection |
| undetected-chromedriver | Selenium patch for anti-detection |
HTML parsing and data extraction libraries.
| Tool | Language | Description |
|---|---|---|
| Beautiful Soup | Python | HTML/XML parsing |
| Cheerio | JavaScript | Fast jQuery-like HTML parsing |
| jsdom | JavaScript | DOM implementation for Node.js |
| lxml | Python | High-performance XML/HTML processing |
| Parsel | Python | XPath/CSS selector extraction |
| Selectolax | Python | Ultra-fast HTML5 parser using Modest engine |
Services and libraries to solve CAPTCHAs.
| Tool | Type | Description |
|---|---|---|
| 2Captcha | Service | Human-powered CAPTCHA solving service |
| Anti-Captcha | Service | Reliable CAPTCHA solving API |
| CapMonster Cloud | Service | AI-powered cloud CAPTCHA solving service |
| CapSolver | Service | AI-powered CAPTCHA solving |
| nocaptchaai | Service | AI solution for recaptcha/hcaptcha |
Rotating proxies and IP management.
| Type | Description |
|---|---|
| Residential | Real user IPs, higher trust |
| Datacenter | Fast, cheap, easily detected |
| Mobile | 4G/5G IPs, highest trust |
| ISP | Static residential IPs |
Popular providers: Bright Data, Oxylabs, Smartproxy, IPRoyal
Helper libraries for common scraping tasks.
| Tool | Language | Description |
|---|---|---|
| fake-useragent | Python | Random User-Agent generator |
| protego | Python | Pure-Python robots.txt parser |
| robots-parser | JavaScript | robots.txt parser for Node.js |
| user-agents | JavaScript | Comprehensive User-Agent generator |
Job scheduling for recurring scrapes.
| Tool | Language | Description |
|---|---|---|
| APScheduler | Python | Advanced scheduler |
| BullMQ | JavaScript | Redis-based job queue |
| Celery | Python | Distributed task queue |
| node-cron | JavaScript | Cron-like scheduler |
Learning resources for web scraping.
- Apify Academy — Free web scraping course
- Crawlee Docs — Official Crawlee documentation
- ScrapingBee Blog — Practical guides
Run scrapers on your own infrastructure.
| Platform | Description |
|---|---|
| Crawlee Cloud | Open-source, self-hosted Actor platform |
Contributions welcome! Please read the contributing guidelines first.
