crawl4md is a minimal, clean CLI tool that crawls web pages or sitemaps and converts them into structured Markdown files.
The project is intentionally designed to stay simple, deterministic, and easy to extend — without unnecessary complexity or hidden behavior.
- Minimal: only what is needed, nothing more
- Deterministic: same input → same output
- Transparent: no magic, clear processing steps
- Composable: ideal as a building block for pipelines (e.g. RAG)
- Crawl from:
sitemap.xml- explicit page lists
- Clean Markdown output via exchangeable parser backends
- Deterministic file structure based on URL paths
- YAML-based project configuration
- CLI-first workflow (uv-compatible)
- Clear, readable progress output
There are two ways to use crawl4md.
If you want to use the project directly for batch crawling via crawl.yml, clone the repository:
git clone git@github.com:ixnode/crawl4md.git && cd crawl4mdThen continue with the configuration section below.
If you want to build your own tooling on top of crawl4md, install it as a package:
pip install crawl4mdOr with uv:
uv add crawl4mdFor local development inside the repository:
uv syncThe CLI reads a crawl.yml file from the current working directory.
Create it from the example:
cp crawl.yml.example crawl.ymlMinimal example:
projects:
planes:
type: pages
crawl:
parser: kreuzberg-dev
parse_type: markdown
sources:
- https://de.wikipedia.org/wiki/Boeing_707
- https://de.wikipedia.org/wiki/Boeing_717
preprocessing:
markdown:
enabled: true
remove_html_comments: true
normalize_whitespace: true
pydantic:
type: sitemap
crawl:
parser: kreuzberg-dev
parse_type: markdown
sources:
- https://pydantic.dev/sitemap.xml
preprocessing:
markdown:
enabled: falseAvailable project settings:
type:pagesorsitemapsources: list of page URLs or sitemap URLsprofile: optional defaults such aswikipediaforcrawl,normalization, andpreprocessing(loaded fromprofiles/*.yml)crawl.parser:kreuzberg-devorcrawl4aicrawl.parse_type:markdown;markdown-fitis available withcrawl4ainormalization.*: HTML normalization options (enabled,entities,hidden_elements,urls,references), all default totruepreprocessing.markdown.enabled: enables Markdown cleanuppreprocessing.markdown.*: optional cleanup rules such asensure_h1,remove_html_comments,remove_sections, andnormalize_whitespace
For the full configuration, see crawl.yml.example.
For details about all Markdown preprocessing options, see docs/markdown_preprocessing.md.
After cloning the repository and creating crawl.yml, use:
crawl planes
crawl pydanticOr with uv inside the project:
uv run crawl planes
uv run crawl pydanticRun the full validation suite with:
uv run checkFor focused checks, grouped test commands, parameters, and examples, see:
crawl4md can also be used as a Python package after installing it with pip install crawl4md or uv add crawl4md.
The public API exports these parser classes:
HtmlFetcherKreuzbergDevMarkdownConverterKreuzbergDevHtmlFetcherCrawl4AIMarkdownConverterCrawl4AIParseTypeKreuzbergDevParseTypeCrawl4AIMarkdownPreprocessingConfigNormalizationConfig
All fetchers provide:
fetch(url): async URL fetch and Markdown conversionfetch_sync(url): sync URL fetch and Markdown conversion
All converters provide:
convert(html, url=None): async HTML-to-Markdown conversionconvert_sync(html, url=None): sync HTML-to-Markdown conversion
Common constructor arguments:
config: aMarkdownPreprocessingConfignormalization: optionalNormalizationConfigfor HTML normalization (HtmlFetcher*only)parse_type: usually"markdown"content_selector: optional CSS selector for selecting only part of the HTML before conversion
When using crawl.yml, use projects.<name>.crawl.parser to choose the parser:
"kreuzberg-dev": recommended default, supportsparse_type: markdown"crawl4ai": supportsparse_type: markdownandparse_type: markdown-fit
In Python, use the concrete class for the parser backend you want. Use ParseTypeCrawl4AI and ParseTypeKreuzbergDev to control valid parse_type values:
"markdown": raw markdown output"markdown-fit": cleaned and reduced markdown output viacrawl4ai
Use MarkdownPreprocessingConfig to enable optional cleanup steps.
For the full list of preprocessing options, see docs/markdown_preprocessing.md.
Simple example:
from crawl4md import MarkdownPreprocessingConfig
config = MarkdownPreprocessingConfig(
enabled=True,
remove_html_comments=True,
normalize_whitespace=True,
)Use NormalizationConfig to control HTML normalization before Markdown conversion (for fetchers).
If omitted, HtmlFetcher* uses NormalizationConfig() defaults.
Explicit example:
from crawl4md import HtmlFetcherKreuzbergDev, MarkdownPreprocessingConfig, NormalizationConfig
fetcher = HtmlFetcherKreuzbergDev(
config=MarkdownPreprocessingConfig(enabled=True),
normalization=NormalizationConfig(
enabled=True,
entities=True,
hidden_elements=True,
urls=True,
references=True,
),
parse_type="markdown",
)Default example (implicit normalization defaults):
from crawl4md import HtmlFetcherKreuzbergDev, MarkdownPreprocessingConfig
fetcher = HtmlFetcherKreuzbergDev(
config=MarkdownPreprocessingConfig(enabled=True),
parse_type="markdown",
)Use HtmlFetcherKreuzbergDev if you want to fetch a page and directly receive Markdown.
from crawl4md import HtmlFetcherKreuzbergDev, MarkdownPreprocessingConfig
config = MarkdownPreprocessingConfig(enabled=True)
fetcher = HtmlFetcherKreuzbergDev(config=config, parse_type="markdown")
markdown = fetcher.fetch_sync("https://example.com")
print(markdown)Async version:
import asyncio
from crawl4md import HtmlFetcherKreuzbergDev, MarkdownPreprocessingConfig
config = MarkdownPreprocessingConfig(enabled=True)
fetcher = HtmlFetcherKreuzbergDev(config=config, parse_type="markdown")
markdown = asyncio.run(fetcher.fetch("https://example.com"))
print(markdown)Use MarkdownConverterKreuzbergDev if you already have HTML and only want the conversion step.
from crawl4md import MarkdownConverterKreuzbergDev, MarkdownPreprocessingConfig
html = "<html><body><h1>Hello</h1><p>World</p></body></html>"
config = MarkdownPreprocessingConfig(enabled=True, ensure_h1=True)
converter = MarkdownConverterKreuzbergDev(config=config, parse_type="markdown")
markdown = converter.convert_sync(html=html, url="https://example.com")
print(markdown)Async version:
import asyncio
from crawl4md import MarkdownConverterKreuzbergDev, MarkdownPreprocessingConfig
html = "<html><body><h1>Hello</h1><p>World</p></body></html>"
config = MarkdownPreprocessingConfig(enabled=True, ensure_h1=True)
converter = MarkdownConverterKreuzbergDev(config=config, parse_type="markdown")
markdown = asyncio.run(
converter.convert(html=html, url="https://example.com")
)
print(markdown)Use content_selector to convert only the matching HTML elements before Markdown conversion.
from crawl4md import MarkdownConverterKreuzbergDev, MarkdownPreprocessingConfig
html = """
<html>
<body>
<nav>Navigation</nav>
<main><h1>Hello</h1><p>World</p></main>
</body>
</html>
"""
converter = MarkdownConverterKreuzbergDev(
config=MarkdownPreprocessingConfig(enabled=True),
parse_type="markdown",
content_selector="main",
)
markdown = converter.convert_sync(html=html, url="https://example.com")
print(markdown)The same option is available on HtmlFetcherKreuzbergDev and HtmlFetcherCrawl4AI.
Use HtmlFetcherKreuzbergDev or MarkdownConverterKreuzbergDev when you want the recommended backend explicitly.
Use HtmlFetcherCrawl4AI or MarkdownConverterCrawl4AI when you need crawl4ai, for example parse_type="markdown-fit":
from crawl4md import HtmlFetcherCrawl4AI, MarkdownPreprocessingConfig
fetcher = HtmlFetcherCrawl4AI(
config=MarkdownPreprocessingConfig(enabled=True),
parse_type="markdown-fit",
)
markdown = fetcher.fetch_sync("https://example.com")
print(markdown)Markdown files are stored deterministically based on the URL path:
crawled/<project>/<url-path>.mdExample:
crawled/planes/wiki/Boeing_707.mdRules:
- Domain is ignored
- URL path is preserved
/→index.md- Query parameters are ignored
1/2 Crawl https://de.wikipedia.org/wiki/Boeing_707
- Fetching ... done
- Processing ... done
- Writing crawled/planes/wiki/Boeing_707.md ... done- RAG data ingestion
- Website snapshotting
- Knowledge base generation
- Offline documentation
src/crawl4md/
├─ core/
│ ├─ cli.py
│ ├─ config.py
│ ├─ language.py
│ ├─ paths.py
│ ├─ profiles.py
│ ├─ sitemap.py
│ └─ writer.py
├─ commands/
│ └─ check.py
├─ convert/
└─ fetch/- No recursive crawling (by design)
- No hidden caching or transformations
- Focus on clean Markdown output only
This project is licensed under the MIT License. See the LICENSE.md file for details.
- Björn Hempel bjoern@hempel.li - Initial work - https://github.com/bjoern-hempel
crawl4md is designed as a small orchestration layer around exchangeable HTML-to-Markdown backends.
It currently integrates the excellent crawl4ai project and html-to-markdown by kreuzberg-dev. Both libraries solve the conversion problem from different angles; crawl4md keeps the project workflow, preprocessing, path handling, and writing logic independent from the selected parser.
Why use crawl4md around these parser backends:
- project-based batch crawling via
crawl.yml - support for both page lists and sitemap-driven crawls
- deterministic output paths for generated Markdown files
- optional Markdown cleanup rules for better downstream text quality
- a small CLI and Python API focused on URL or HTML to Markdown workflows
- clearer separation between fetching, conversion, preprocessing, and writing
In short: the parser backend can change, while crawl4md keeps the surrounding crawl configuration, deterministic output, and Markdown cleanup workflow stable.
Some websites, especially Wikimedia/Wikipedia, may block direct HTTP requests depending on the Python runtime, TLS fingerprint, request frequency, IP reputation, or server-side bot detection.
Example error:
httpx.HTTPStatusError: Client error '403 Forbidden'
Please respect our robot policy ...
This is not necessarily a crawl4md bug. The same request may work in one Python environment and fail in another.
Known workaround:
uv python install 3.14.0
uv venv --python 3.14.0
uv syncThen run again:
uv run crawl <profile>
If the problem persists:
- reduce request frequency
- avoid repeated crawling of the same Wikimedia pages
- use a proper User-Agent
- respect Wikimedia's robot policy
- retry later