Skip to content

feat: Add substack.com custom parser#177

Open
jocmp wants to merge 1 commit into
mainfrom
jc/2014/substack-parser
Open

feat: Add substack.com custom parser#177
jocmp wants to merge 1 commit into
mainfrom
jc/2014/substack-parser

Conversation

@jocmp
Copy link
Copy Markdown
Owner

@jocmp jocmp commented May 2, 2026

Summary

Adds a custom mercury-parser extractor for substack.com (and any *.substack.com subdomain via the base-domain match in get-extractor.js).

Selectors target the canonical Substack web post structure:

  • title / dek / lead image / author come from og: and name="author" meta tags
  • content uses .available-content (the wrapper around .body.markup)
  • transforms wrap .captioned-image-container figures and unwrap .image-link lightbox links so embedded images survive
  • cleans Substack's inline subscribe / share / poll widgets

Verified against a real article (https://garymarcus.substack.com/p/dario-amodei-hype-ai-safety-and-the) that 7 images and 4 figures are preserved end-to-end through the parse pipeline.

Context: capyreader#2014

Filed against jocmp/capyreader#2014. The reporter is reading Substack newsletters via kill-the-newsletter (KTN) in capyreader, and "extract full content" gives them cropped images and missing headings.

Important caveat: capyreader's Account.fetchFullContent passes article.url to mercury-parser, and KTN entry URLs are kill-the-newsletter.com/feeds/<id>/entries/<id>.html — not Substack URLs. So a substack.com parser will not directly fix that user's KTN setup. It will, however, fix:

  • direct Substack subscriptions (any *.substack.com feed extracted in capyreader or any other mercury-parser consumer)
  • the broader Substack ecosystem in general

A follow-up kill-the-newsletter.com parser is likely needed to fully close capyreader#2014, but that would require a real KTN entry fixture (the email body, raw, with no surrounding <html> wrapper). I didn't have one available.

Limitations

  • Substacks on custom domains (e.g. noahpinion.blog, astralcodexten.com, stratechery.com) won't match this extractor — mercury-parser routes by hostname / base domain, and a custom domain's base differs. They would need their own per-domain parsers (or detection-by-html, but that's a larger change).
  • date_published is left to mercury's generic detection. Substack only emits the published date inside JSON-LD (<script type="application/ld+json">), which mercury's resource cleaner strips before extractor selectors run, and the on-page date markup uses obfuscated emotion-style class hashes that aren't stable across builds.

Test plan

  • npx jest src/extractors/custom/substack.com/index.test.js — all 6 tests pass
  • Smoke-tested against a longer Substack post (Dario Amodei essay) and confirmed 7 images / 4 figures / 45 paragraphs survive parsing
  • npx eslint src/extractors/custom/substack.com/ clean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant