feat: Add substack.com custom parser by jocmp · Pull Request #177 · jocmp/mercury-parser

jocmp · 2026-05-02T02:31:19Z

Summary

Adds a custom mercury-parser extractor for substack.com (and any *.substack.com subdomain via the base-domain match in get-extractor.js).

Selectors target the canonical Substack web post structure:

title / dek / lead image / author come from og: and name="author" meta tags
content uses .available-content (the wrapper around .body.markup)
transforms wrap .captioned-image-container figures and unwrap .image-link lightbox links so embedded images survive
cleans Substack's inline subscribe / share / poll widgets

Verified against a real article (https://garymarcus.substack.com/p/dario-amodei-hype-ai-safety-and-the) that 7 images and 4 figures are preserved end-to-end through the parse pipeline.

Context: capyreader#2014

Filed against jocmp/capyreader#2014. The reporter is reading Substack newsletters via kill-the-newsletter (KTN) in capyreader, and "extract full content" gives them cropped images and missing headings.

Important caveat: capyreader's Account.fetchFullContent passes article.url to mercury-parser, and KTN entry URLs are kill-the-newsletter.com/feeds/<id>/entries/<id>.html — not Substack URLs. So a substack.com parser will not directly fix that user's KTN setup. It will, however, fix:

direct Substack subscriptions (any *.substack.com feed extracted in capyreader or any other mercury-parser consumer)
the broader Substack ecosystem in general

A follow-up kill-the-newsletter.com parser is likely needed to fully close capyreader#2014, but that would require a real KTN entry fixture (the email body, raw, with no surrounding <html> wrapper). I didn't have one available.

Limitations

Substacks on custom domains (e.g. noahpinion.blog, astralcodexten.com, stratechery.com) won't match this extractor — mercury-parser routes by hostname / base domain, and a custom domain's base differs. They would need their own per-domain parsers (or detection-by-html, but that's a larger change).
date_published is left to mercury's generic detection. Substack only emits the published date inside JSON-LD (<script type="application/ld+json">), which mercury's resource cleaner strips before extractor selectors run, and the on-page date markup uses obfuscated emotion-style class hashes that aren't stable across builds.

Test plan

npx jest src/extractors/custom/substack.com/index.test.js — all 6 tests pass
Smoke-tested against a longer Substack post (Dario Amodei essay) and confirmed 7 images / 4 figures / 45 paragraphs survive parsing
npx eslint src/extractors/custom/substack.com/ clean

feat: Add substack.com custom parser

6bbc0b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add substack.com custom parser#177

feat: Add substack.com custom parser#177
jocmp wants to merge 1 commit into
mainfrom
jc/2014/substack-parser

jocmp commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jocmp commented May 2, 2026

Summary

Context: capyreader#2014

Limitations

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant