Markdown preprocessing runs after the selected parser has converted HTML into Markdown.
Enable preprocessing with:
preprocessing:
markdown:
enabled: trueIf enabled is false, no Markdown preprocessing rules are applied, even if individual options are set to true.
preprocessing:
markdown:
enabled: false
ensure_h1: false
remove_lines: false
remove_blocks: false
remove_sections: false
remove_links: false
remove_images: false
remove_html_comments: false
normalize_tables: false
normalize_linebreak: false
normalize_whitespace: falseAdds a missing top-level # heading when the generated Markdown has no H1.
The rule prefers the first HTML <h1>, then falls back to the HTML <title>, then to the URL path.
Use the removal option that matches the Markdown level you want to remove:
Option Removes Best for remove_linesMatching text inside individual lines; empty result lines are removed Short boilerplate text, labels, subtitles, generated line fragments remove_blocksWhole blocks separated by blank lines Banners, promo boxes, generated multi-line link or table blocks remove_sectionsA heading and everything after it References, appendices, web links, literature sections Prefer the narrowest option that fully covers the unwanted content. Use
remove_linesfor small text fragments,remove_blockswhen a whole paragraph-like block should disappear, andremove_sectionsonly when the rest of the document after a heading should be removed.
Removes configured text or regular-expression matches from Markdown lines.
The option accepts false, a string, or a list of strings:
remove_lines: falseKeeps all lines unchanged.
remove_lines:
- "[Aa]us Wikipedia, der freien Enzyklopädie"
- "[Ff]rom Wikipedia, the free encyclopedia"Removes German and English Wikipedia subtitle text:
Boeing 707 aus Wikipedia, der freien EnzyklopädieBecomes:
Boeing 707If a line is empty after the configured text was removed, the whole line is removed.
Multiple patterns can be configured:
remove_lines:
- "[Aa]us Wikipedia, der freien Enzyklopädie"
- "[Ff]rom Wikipedia, the free encyclopedia"Removes whole Markdown blocks whose content matches the configured regular expression.
Blocks are separated by blank lines. If a block matches, the whole block is removed.
The option accepts false, a string, or a list of strings:
remove_blocks: falseKeeps all blocks unchanged.
remove_blocks:
- "Wikipedia:Wiki_Loves_Earth_"
- "Wikidata:Events/Coordinate_Me_"Removes generated Wiki Loves Earth or Wikidata banner blocks:
[
| Nimm teil am Wikidata-Wettbewerb |
| --- | ](https://www.wikidata.org/wiki/Wikidata:Events/Coordinate_Me_2026)
# Boeing 707Becomes:
# Boeing 707Removes configured sections and everything after the matching heading.
The option accepts false, a string, or a list of strings:
remove_sections: falseKeeps all sections unchanged.
remove_sections: "Einzelnachweise"Or:
remove_sections:
- Einzelnachweise
- Weblinks
- Literatur
- Quellen
- References
- External links
- BibliographyHeading matching is case-insensitive and supports numbered headings and anchor suffixes.
Removes Markdown links whose target or text matches the configured regular expression.
The link target is the value inside the parentheses of a Markdown link:
[link text](link-target)By default, remove_links checks only link-target, not link text.
The configured value does not have to match at the beginning. It matches anywhere inside the checked target or text because the rule wraps the pattern with .*-like matching around it.
The option accepts false, a string, or a list of strings:
remove_links: falseKeeps all Markdown links unchanged.
remove_links: "cite_note"Removes links whose target contains cite_note. This is useful for inline Wikipedia citation links:
Text [[17]](#cite_note-17) [[10]](#cite_note-10)Becomes:
TextThe leading space before a removed link is removed too.
The configured string is used as a regular expression. This makes it possible to remove other link groups without adding a new preprocessing option.
Pattern prefixes:
remove_links: "#content"
remove_links: "anchor:#content"
remove_links: "text:Zum Inhalt springen"
remove_links: "unwrap:Air India"
remove_links: "unwrap:*""#content" and "anchor:#content" both check the link target. "text:Zum Inhalt springen" checks the visible link text.
"unwrap:Air India" checks the visible link text and removes only the Markdown link syntax.
"unwrap:*" unwraps all remaining Markdown links.
Processing behavior:
- Image syntax is handled by
remove_images: true. remove_linksrules are applied to Markdown links.anchor:andtext:remove the whole matching link, including visible text.unwrap:keeps the visible link text and removes the URL and optional title.
remove_images removes only Markdown image syntax. If an image is wrapped by a Markdown link, the outer link remains and can be processed later by remove_links.
Rules are applied in this order: complete link removals (anchor: and text:) run first, then unwrap: runs on the remaining links. This means a catch-all unwrap:* can be placed after specific removal rules:
remove_links:
- "anchor:cite_note"
- "anchor:#(?:[Bb]ody[Cc]ontent|content|content-start|main|main-content|maincontent)"
- "anchor:#[Vv]orlage_[Ll]esenswert"
- "anchor:#[Vv]orlage_[Ee]xzellent"
- "anchor:&veaction=edit[^)]*section="
- "anchor:&action=edit[^)]*section="
- "unwrap:*"Remove skip-to-content links by target:
remove_links: "anchor:#(?:[Bb]ody[Cc]ontent|content|content-start|main|main-content|maincontent)"[Zum Inhalt springen](https://de.wikipedia.org/wiki/Boeing_707#bodyContent)
# Boeing 707Becomes:
# Boeing 707Remove skip-to-content links by text:
remove_links: "text:Zum Inhalt springen"[Zum Inhalt springen](#bodyContent) [keep](#bodyContent)Becomes:
[keep](#bodyContent)Unwrap links by visible text:
remove_links: "unwrap:Air India"[Air India](https://de.wikipedia.org/wiki/Air_India "Air India")Becomes:
Air Indiaunwrap: accepts regular expressions:
remove_links: "unwrap:^Air India$"Supported Markdown link forms:
[Text](url)
[Text](url "Title")
[Text](url 'Title')
[Text](url (Title))All become:
TextUnwrap multiple links in one line:
remove_links:
- "unwrap:Boeing"
- "unwrap:Air India"[Boeing](https://de.wikipedia.org/wiki/Boeing) und [Air India](https://de.wikipedia.org/wiki/Air_India)Becomes:
Boeing und Air IndiaRemove links to a specific anchor prefix:
remove_links: "custom-link"Text [custom](#custom-link) [keep](#other-link)Becomes:
Text [keep](#other-link)Remove links to generated Wikipedia anchors:
remove_links: "wiki_[a-z]+"Text [one](#wiki_intro) [two](#wiki_history) [keep](#plain)Becomes:
Text [keep](#plain)Remove Wikipedia featured/readable article badges:
remove_links:
- "#[Vv]orlage_[Ll]esenswert"
- "#[Vv]orlage_[Ee]xzellent"[](#Vorlage_Lesenswert "Dies ist ein als lesenswert ausgezeichneter Artikel.")
# Boeing 707Becomes:
# Boeing 707Remove Wikipedia section edit links:
remove_links:
- "anchor:&veaction=edit[^)]*section="
- "anchor:&action=edit[^)]*section="## Geschichte
[[Bearbeiten](https://de.wikipedia.org/w/index.php?title=Boeing_707&veaction=edit§ion=1) | [Quelltext bearbeiten](https://de.wikipedia.org/w/index.php?title=Boeing_707&action=edit§ion=1)]
TextBecomes:
## Geschichte
TextRemove image links by target:
remove_links: "upload\\.wikimedia\\.org"Text [](https://upload.wikimedia.org/file.jpg)Becomes:
TextRemove multiple link variants in one run:
remove_links:
- "anchor:cite_note"
- "anchor:custom-link"
- "anchor:upload\\.wikimedia\\.org"
- "anchor:&veaction=edit[^)]*section="
- "anchor:&action=edit[^)]*section="Text [[17]](#cite_note-17) [custom](#custom-link) [image](https://upload.wikimedia.org/file.jpg) [keep](#plain)Becomes:
Text [keep](#plain)Removes Markdown image syntax while preserving semantic image text.
The option accepts false or true:
remove_images: falseKeeps all Markdown images unchanged.
remove_images: trueFor each Markdown image, the replacement text is selected with this priority:
alt > title > remove
Only the image syntax is removed. No prefix such as Image: is added.
If an image is wrapped by a Markdown link, the wrapper link target is kept.
Image with alt text:
Becomes:
Boeing 707 CockpitImage with title but no alt text:
Becomes:
Cockpit einer Boeing 707Image without alt text or title:
Becomes an empty string.
Linked image with alt text:
[](https://de.wikipedia.org/wiki/Datei:image.jpg "Eine Boeing 707 der Air India")Becomes:
[Eine Boeing 707 der Air India](https://de.wikipedia.org/wiki/Datei:image.jpg "Eine Boeing 707 der Air India")Removes HTML comments from Markdown output:
Text <!-- hidden --> more textNormalizes Markdown tables.
This removes empty table rows such as:
| | |It also adjusts rows to the table column count where possible by padding missing cells or trimming extra cells.
Normalizes block-level line breaks.
This rule controls paragraph and block spacing, including:
- collapsing excessive blank lines
- adding spacing around tables and code blocks
- removing blank lines between list items
- splitting adjacent paragraph lines into separate Markdown paragraphs
Normalizes whitespace inside individual lines.
This rule controls inline spacing, including:
- trimming trailing spaces outside code fences
- inserting missing spaces before Markdown links when text touches the link
- inserting missing spaces before opening parentheses outside Markdown link targets
Markdown link targets are protected so URLs such as Airport_(Film) are not changed.
For Wikipedia pages, use the built-in project profile:
projects:
planes:
profile: wikipedia
type: pages
sources:
- https://de.wikipedia.org/wiki/Boeing_707
crawl:
parser: kreuzberg-dev
parse_type: markdown
content_selector: ".mw-parser-output"The profile provides defaults for crawl, normalization, and preprocessing.markdown. Project-level crawl, normalization, and preprocessing values override profile defaults, so you can adjust individual options without copying the whole profile. Built-in profiles are loaded from profiles/*.yml (for example profiles/wikipedia.yml).
The wikipedia profile currently applies these Markdown preprocessing defaults:
preprocessing:
markdown:
enabled: true
ensure_h1: true
remove_lines:
- "[Aa]us Wikipedia, der freien Enzyklopädie"
- "[Ff]rom Wikipedia, the free encyclopedia"
remove_blocks:
- "Wikipedia:Wiki_Loves_Earth_"
- "Wikidata:Events/Coordinate_Me_"
remove_sections:
- Einzelnachweise
- Weblinks
- Literatur
- Quellen
- References
- External links
- Bibliography
remove_links:
- "anchor:cite_note"
- "anchor:#(?:[Bb]ody[Cc]ontent|content|content-start|main|main-content|maincontent)"
- "anchor:#[Vv]orlage_[Ll]esenswert"
- "anchor:#[Vv]orlage_[Ee]xzellent"
- "anchor:&veaction=edit[^)]*section="
- "anchor:&action=edit[^)]*section="
remove_images: true
remove_html_comments: true
normalize_tables: true
normalize_linebreak: true
normalize_whitespace: true