Release v0.8.0: Crash Recovery, Prefetch Mode & Security Fixes #1712

ntohidi · 2026-01-16T10:49:22Z

Summary

Deep Crawl Crash Recovery with resume_state and on_state_change for long-running crawls
Prefetch Mode (prefetch=True) for 5-10x faster URL discovery
Critical security fixes for Docker API (RCE and LFI vulnerabilities)
CDP improvements for concurrent browser sessions
Multiple bug fixes and documentation updates

Breaking Changes

Docker API Security (Action Required)

Hooks disabled by default: Set CRAWL4AI_HOOKS_ENABLED=true to re-enable
file:// URLs blocked: Use Python library directly for local file processing

New Features

Crash Recovery for Deep Crawl

strategy = BFSDeepCrawlStrategy(
    max_depth=3,
    resume_state=saved_state,  # Continue from checkpoint
    on_state_change=save_to_redis,  # Called after each URL
)

Prefetch Mode

  config = CrawlerRunConfig(prefetch=True)
  result = await crawler.arun(url, config=config)
  # Returns HTML and links only - 5-10x faster

CDP Improvements

browser_context_id and target_id for concurrent sessions
cdp_cleanup_on_close flag for cloud/server scenarios
init_scripts for pre-page-load JavaScript injection

Other Features

Async agenerate_schema method for schema generation
Proxy support for HTTP crawler strategy
base_url parameter for raw HTML processing
PDF and MHTML support for raw: and file:// URLs

Security Fixes

CVE Pending (Critical): RCE via hooks parameter - import removed from sandbox
CVE Pending (High): LFI via file:// URLs - URL scheme validation added
Hooks require explicit opt-in via environment variable

Bug Fixes

Fix LLM backoff to be configurable end-to-end
Fix ContentRelevanceFilter deserialization in deep crawl
Fix ProxyConfig JSON serialization
Fix .cache folder permissions in Docker
Fix CDP connection handling for WS URLs
Fix raw URL parsing truncation at # character
Fix URL variable for raw HTML extraction
Replace deprecated PyPDF2 with pypdf
Pydantic v2 ConfigDict compatibility

Documentation

Added crash recovery documentation and examples
Added prefetch mode documentation and examples
Added v0.8.0 migration guide
Added security advisory drafts
Updated self-hosting guide with new version

Test Plan

Run python docs/releases_review/demo_v0.8.0.py - all tests pass
Verify crash recovery with simulated interruption
Verify prefetch mode returns HTML/links only
Verify hooks are blocked by default
Verify file:// URLs are rejected on API endpoints

Files Changed

58 files changed, ~12,000 insertions, ~2,400 deletions

Related Issues

Fixes security vulnerabilities reported by Neo (ProjectDiscovery)

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added/updated unit tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

- Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html

The remove_empty_elements_fast() method was removing whitespace-only span elements inside <pre> and <code> tags, causing import statements like "import torch" to become "importtorch". Now skips elements inside code blocks where whitespace is significant.

… types

… and clean up mock data. ref #1621

Fix: Wrong URL variable used for extraction of raw html

… dirs. ref #1638

Refactor Pydantic model configuration to use ConfigDict for arbitrary…

- extend LLMConfig with backoff delay/attempt/factor fields and thread them through LLMExtractionStrategy, LLMContentFilter, table extraction, and Docker API handlers - expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff and document them in the md_v2 guides

…d test for delayed redirects. ref #1268

…develop

Fix BrowserConfig proxy_config serialization

Make LLM backoff configurable end-to-end

…nce-filter [Fix]: Docker server does not decode ContentRelevanceFilter

Enable Crawl4AI to connect to pre-created CDP browser contexts, which is essential for cloud browser services that pre-create isolated contexts. Changes: - Add browser_context_id and target_id parameters to BrowserConfig - Update from_kwargs() and to_dict() methods - Modify BrowserManager.start() to use existing context when provided - Add _get_page_by_target_id() helper method - Update get_page() to handle pre-existing targets - Add test for browser_context_id functionality This enables cloud services to: 1. Create isolated CDP contexts before Crawl4AI connects 2. Pass context/target IDs to BrowserConfig 3. Have Crawl4AI reuse existing contexts instead of creating new ones

… scenarios

…contexts

When True, forces creation of a new browser context instead of reusing the default context. Essential for concurrent crawls on the same browser to prevent navigation conflicts.

Uses contexts_by_config cache (same as non-CDP mode) to reuse contexts for multiple URLs with same config. Still creates new page per crawl for navigation isolation. Benefits batch/deep crawls.

This adds the ability to inject JavaScript that runs before any page loads, useful for stealth evasions (canvas/audio fingerprinting, userAgentData). - Add init_scripts parameter to BrowserConfig (list of JS strings) - Apply init_scripts in setup_context() via context.add_init_script() - Update from_kwargs() and to_dict() for serialization

Changes to browser_manager.py: 1. _verify_cdp_ready(): Support multiple URL formats - WebSocket URLs (ws://, wss://): Skip HTTP verification, Playwright handles directly - HTTP URLs with query params: Properly parse with urlparse to preserve query string - Fixes issue where naive f"{cdp_url}/json/version" broke WS URLs and query params 2. close(): Proper cleanup when cdp_cleanup_on_close=True - Close all sessions (pages) - Close all contexts - Call browser.close() to disconnect (doesn't terminate browser, just releases connection) - Wait 1 second for CDP connection to fully release - Stop Playwright instance to prevent memory leaks This enables: - Connecting to specific browsers via WS URL - Reusing the same browser with multiple sequential connections - No user wait needed between connections (internal 1s delay handles it) Added tests/browser/test_cdp_cleanup_reuse.py with comprehensive tests.

Implements the missing method that was being called but never defined. Now raw: and file:// URLs can generate screenshots by: 1. Loading HTML into a browser page via page.set_content() 2. Taking screenshot using existing take_screenshot() method 3. Cleaning up the page afterward This enables cached HTML to be rendered with screenshots in crawl4ai-cloud.

- Replace _generate_screenshot_from_html with _generate_media_from_html - New method handles screenshot, PDF, and MHTML in one browser session - Update raw: and file:// URL handlers to use new method - Enables cached HTML to generate all media types

Add optional resume_state and on_state_change parameters to all deep crawl strategies (BFS, DFS, Best-First) for cloud deployment crash recovery. Features: - resume_state: Pass saved state to resume from checkpoint - on_state_change: Async callback fired after each URL for real-time state persistence to external storage (Redis, DB, etc.) - export_state(): Get last captured state manually - Zero overhead when features are disabled (None defaults) State includes visited URLs, pending queue/stack, depths, and pages_crawled count. All state is JSON-serializable.

The AsyncHTTPCrawlerStrategy.crawl() method used urlparse() to extract content from raw: URLs. This caused HTML with CSS color codes like #eee to be truncated because # is treated as a URL fragment delimiter. Before: raw:body{background:#eee} -> parsed.path = 'body{background:' After: raw:body{background:#eee} -> raw_content = 'body{background:#eee' Fix: Strip the raw: or raw:// prefix directly instead of using urlparse, matching how the browser strategy handles it.

When processing raw: HTML (e.g., from cache), the URL parameter is meaningless for markdown link resolution. This adds a base_url parameter that can be set explicitly to provide proper URL resolution context. Changes: - Add base_url parameter to CrawlerRunConfig.__init__ - Add base_url to CrawlerRunConfig.from_kwargs - Update aprocess_html to use base_url for markdown generation Usage: config = CrawlerRunConfig(base_url='https://example.com') result = await crawler.arun(url='raw:{html}', config=config)

- Add `prefetch` parameter to CrawlerRunConfig - Add `quick_extract_links()` function for fast link extraction - Add short-circuit in aprocess_html() for prefetch mode - Add 42 tests (unit, integration, regression) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add process_in_browser parameter to CrawlerRunConfig - Route raw:/file:// URLs through _crawl_web() when browser operations needed - Use page.set_content() instead of goto() for local content - Fix cookie handling for non-HTTP URLs in browser_manager - Auto-detect browser requirements: js_code, wait_for, screenshot, etc. - Maintain fast path for raw:/file:// without browser params Fixes #310

- Add cache_ttl_hours and validate_sitemap_lastmod params to SeedingConfig - New JSON cache format with metadata (version, created_at, lastmod, url_count) - Cache validation by TTL expiry and sitemap lastmod comparison - Auto-migration from old .jsonl to new .json format - Fixes bug where incomplete cache was used indefinitely

- Add cache_ttl_hours and validate_sitemap_lastmod to parameter table - Document smart TTL cache validation with examples - Add cache-related troubleshooting entries - Update key features summary

Add documentation explaining how to pass multiple HTML samples to generate_schema() for stable selectors that work across pages with varying DOM structures. Includes: - Problem explanation (fragile nth-child selectors) - Solution with code example - Key points for multi-sample queries - Comparison table of fragile vs stable selectors

Security fixes for vulnerabilities reported by ProjectDiscovery: 1. Remote Code Execution via Hooks (CVE pending) - Remove __import__ from allowed_builtins in hook_manager.py - Prevents arbitrary module imports (os, subprocess, etc.) - Hooks now disabled by default via CRAWL4AI_HOOKS_ENABLED env var 2. Local File Inclusion via file:// URLs (CVE pending) - Add URL scheme validation to /execute_js, /screenshot, /pdf, /html - Block file://, javascript:, data: and other dangerous schemes - Only allow http://, https://, and raw: (where appropriate) 3. Security hardening - Add CRAWL4AI_HOOKS_ENABLED=false as default (opt-in for hooks) - Add security warning comments in config.yml - Add validate_url_scheme() helper for consistent validation Testing: - Add unit tests (test_security_fixes.py) - 16 tests - Add integration tests (run_security_tests.py) for live server Affected endpoints: - POST /crawl (hooks disabled by default) - POST /crawl/stream (hooks disabled by default) - POST /execute_js (URL validation added) - POST /screenshot (URL validation added) - POST /pdf (URL validation added) - POST /html (URL validation added) Breaking changes: - Hooks require CRAWL4AI_HOOKS_ENABLED=true to function - file:// URLs no longer work on API endpoints (use library directly)

…dding authorization headers to API requests

…xes, new features, bug fixes, and documentation updates

…xes, new features, bug fixes, and documentation updates Documentation for v0.8.0 release: - SECURITY.md: Security policy and vulnerability reporting guidelines - RELEASE_NOTES_v0.8.0.md: Comprehensive release notes - migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide - security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts - CHANGELOG.md: Updated with v0.8.0 changes Breaking changes documented: - Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED) - file:// URLs blocked on Docker API endpoints Security fixes credited to Neo by ProjectDiscovery

…entation

- Updated version to 0.8.0 - Added comprehensive demo and release notes - Updated all documentation

… ProjectDiscovery

- Extract prompt building to shared _build_schema_prompt() method - Add agenerate_schema() async version using aperform_completion_with_backoff - Refactor generate_schema() to use shared prompt builder - Fixes Gemini/Vertex AI compatibility in async contexts (FastAPI)

O-series (o1, o3) and GPT-5 models only support temperature=1. Setting litellm.drop_params=True auto-drops unsupported parameters instead of throwing UnsupportedParamsError. Fixes temperature=0.01 error for these models in LLM extraction.

rbushri and others added 30 commits September 1, 2025 23:15

Fix: Use correct URL variable for raw HTML extraction (#1116)

edd0b57

- Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html

Refactor Pydantic model configuration to use ConfigDict for arbitrary…

eca04b0

… types

Merge branch 'develop' into fix/wrong_url_raw

7771ed3

Fix EmbeddingStrategy: Uncomment response handling for the variations…

84bfea8

… and clean up mock data. ref #1621

Merge pull request #1447 from rbushri/fix/wrong_url_raw

94c8a83

Fix: Wrong URL variable used for extraction of raw html

Fix: permission issues with .cache/url_seeder and other runtime cache…

b36c6da

… dirs. ref #1638

fix: ensure BrowserConfig.to_dict serializes proxy_config

a0c5f0f

Merge pull request #1623 from unclecode/fix/deprecated_pydantic

dcb77c9

Refactor Pydantic model configuration to use ConfigDict for arbitrary…

reproduced AttributeError from #1642

33a3cc3

pass timeout parameter to docker client request

6ec6bc4

added missing deep crawling objects to init

eb76df2

generalized query in ContentRelevanceFilter to be a str or list

e95e8e1

import modules from enhanceable deserialization

3a8f829

parameterized tests

6893094

Fix: capture current page URL to reflect JavaScript navigation and ad…

07ccf13

…d test for delayed redirects. ref #1268

Merge branch 'develop' of https://github.com/unclecode/crawl4ai into …

afc31e1

…develop

Merge pull request #1641 from unclecode/fix/serialize-proxy-config

d06c39e

Fix BrowserConfig proxy_config serialization

Merge pull request #1645 from unclecode/fix/configurable-backoff

f32cfc6

Make LLM backoff configurable end-to-end

refactor: replace PyPDF2 with pypdf across the codebase. ref #1412

df4d87e

Merge pull request #1648 from christopher-w-murphy/fix/content-releva…

5a8fb57

…nce-filter [Fix]: Docker server does not decode ContentRelevanceFilter

Merge branch 'main' into develop

306ddcb

Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server…

66941a5

… scenarios

Fix: add cdp_cleanup_on_close to from_kwargs

d22825e

Fix: find context by target_id for concurrent CDP connections

b2e4a1f

Fix: use target_id to find correct page in get_page

c1e485e

Fix: use CDP to find context by browserContextId for concurrent sessions

8014805

Revert context matching attempts - Playwright cannot see CDP-created …

6185d3c

…contexts

unclecode and others added 29 commits December 13, 2025 08:29

Add create_isolated_context flag for concurrent CDP crawls

55eb968

When True, forces creation of a new browser context instead of reusing the default context. Essential for concurrent crawls on the same browser to prevent navigation conflicts.

Add context caching to create_isolated_context branch

ecedb61

Uses contexts_by_config cache (same as non-CDP mode) to reuse contexts for multiple URLs with same config. Still creates new page per crawl for navigation isolation. Benefits batch/deep crawls.

Update gitignore

f6b29a8

Some debugging for caching

48426f7

Updates on proxy rotation and proxy configuration

9e7f5aa

Add proxy support to HTTP crawler strategy

a43256b

Update URL seeder docs with smart TTL cache parameters

db61ab8

- Add cache_ttl_hours and validate_sitemap_lastmod to parameter table - Document smart TTL cache validation with examples - Add cache-related troubleshooting entries - Update key features summary

Add MEMORY.md to gitignore

0d3f9e6

Enhance authentication flow by implementing JWT token retrieval and a…

acfab80

…dding authorization headers to API requests

Add release notes for v0.7.9, detailing breaking changes, security fi…

122b4fe

…xes, new features, bug fixes, and documentation updates

Add examples for deep crawl crash recovery and prefetch mode in docum…

315eae9

…entation

Release v0.8.0: The v0.8.0 Update

f09146c

- Updated version to 0.8.0 - Added comprehensive demo and release notes - Updated all documentation

Update security researcher acknowledgment with a hyperlink for Neo by…

177e298

… ProjectDiscovery

Merge branch 'develop' into release/v0.8.0

a5354f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Release v0.8.0: Crash Recovery, Prefetch Mode & Security Fixes #1712

Release v0.8.0: Crash Recovery, Prefetch Mode & Security Fixes #1712

ntohidi commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Release v0.8.0: Crash Recovery, Prefetch Mode & Security Fixes #1712

Are you sure you want to change the base?

Release v0.8.0: Crash Recovery, Prefetch Mode & Security Fixes #1712

Conversation

ntohidi commented Jan 16, 2026

Summary

Breaking Changes

Docker API Security (Action Required)

New Features

Crash Recovery for Deep Crawl

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants