Skip to content

Raw LaTeX math delimiters mangled by markdown escaping #224

@MaxWolf-01

Description

@MaxWolf-01

When a page uses MathJax with raw $/$$ delimiters and is extracted without JS execution, the LaTeX is left as plain text in the HTML. Turndown then escapes markdown-special characters inside the math, breaking it.

Example

https://graphdeeplearning.github.io/post/transformers-are-gnns/

The page loads MathJax 3 via <script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js"> and has LaTeX directly in the HTML text:

<p>$$
h_{i}^{\ell+1} = \text{Attention} \left( Q^{\ell} h_{i}^{\ell} \, K^{\ell} h_{j}^{\ell} \, V^{\ell} h_{j}^{\ell} \right),
$$</p>

Expected:

$$ h_{i}^{\ell+1} = \text{Attention} \left( Q^{\ell} h_{i}^{\ell} \, K^{\ell} h_{j}^{\ell} \, V^{\ell} h_{j}^{\ell} \right), $$

Actual:

$$ h\_{i}^{\\ell+1} = \\text{Attention} \\left( Q^{\\ell} h\_{i}^{\\ell} \\, K^{\\ell} h\_{j}^{\\ell} \\, V^{\\ell} h\_{j}^{\\ell} \\right), $$

Root cause

Defuddle's math pipeline handles MathJax's rendered output (mjx-container, .MathJax, script[type="math/tex"]) but not its unrendered input. When MathJax doesn't execute (static/node extraction), the LaTeX stays as raw text in <p> elements — no structured math elements exist for the math rules to match. Turndown then escapes \\\ and _\_ in the text, breaking the LaTeX.

This affects any site using MathJax's tex input with $/$$ delimiters when extracted without JS execution (common on Hugo/Jekyll academic blogs, lecture notes, etc.).

Fix direction

Detect raw LaTeX delimiters in text nodes and wrap matched $...$ / $$...$$ spans in <math data-latex="..."> elements before Turndown conversion. This feeds into the existing math → data-latex → markdown pipeline that KaTeX and MathML already use.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions