Skip to content

feat(imdb): migrate IMDB scraper to GraphQL and Suggest API#1981

Open
Arny80Hexa wants to merge 12 commits intoKomet:masterfrom
Arny80Hexa:feat/imdb-graphql-migration
Open

feat(imdb): migrate IMDB scraper to GraphQL and Suggest API#1981
Arny80Hexa wants to merge 12 commits intoKomet:masterfrom
Arny80Hexa:feat/imdb-graphql-migration

Conversation

@Arny80Hexa
Copy link
Copy Markdown
Contributor

@Arny80Hexa Arny80Hexa commented Mar 24, 2026

Summary

Migrates the entire IMDB scraper from HTML parsing to JSON APIs, fixing the complete scraper breakdown caused by AWS WAF blocking all www.imdb.com HTML requests (#1966).

  • Search: Replaced HTML parser with Suggest API (GET, JSON, no auth)
  • Details: Replaced __NEXT_DATA__ / Reference page parsing with GraphQL API (POST, JSON, no auth)
  • Episodes: Bulk loading via GraphQL (1 query per season instead of 1 request per episode)
  • Localization: Language selector now functional — localized titles (AKAs), country-specific age ratings (e.g. FSK), localized release dates
  • Images: Poster + backdrop support, actor thumbnails with URLs

What's new compared to the old scraper

Feature Old New
Cast Top 5 actors Full cast with character names + photo URLs
Episode loading ~2s per episode (sequential) 1 query per season (bulk)
Language support English only Localized titles, FSK ratings, release dates
Backdrop images Not supported Supported
Original title Not populated Always populated
Trailer YouTube URL (rare) IMDB video page URL (browser-only)

Known limitations

  • Top 250 ranking is not available via GraphQL (STARmeter ≠ Top250)
  • Network (TV) has no dedicated field in IMDB GraphQL — use TMDb via Custom TV Scraper
  • Trailer URL points to IMDB video page (works in browser, not in Kodi)
  • Episode limit of 250 per query (sufficient for most shows; pagination not yet implemented)
  • Both APIs are unofficial/internal — same legal status as the old HTML scraper. Used by Kodi, Jellyfin, Stremio, and Infuse.

API note

Neither the Suggest API nor the GraphQL API are officially documented. No authentication or API tokens are required. See #1966 comment for details.

Test plan

  • Unit tests passing (687 assertions)
  • Integration tests passing
  • ./scripts/quick_checks.sh passing (our files)
  • E2E: Movie search + scrape (Inception, Shawshank Redemption) — EN and DE
  • E2E: TV show search + scrape (Raised by Wolves, Travelers) — EN and DE
  • E2E: Episode bulk loading with actors, directors, writers, thumbnails
  • E2E: Localized titles ("Die Verurteilten", "Travelers: Die Reisenden")
  • E2E: FSK certification (show + episode level)
  • E2E: Backdrop images loading
  • E2E: Actor images in NFO (URLs written correctly)
  • Custom Movie/TV Scraper with IMDB as sub-scraper (planned for dev-build testing)

Closes #1966. Also addresses #1881, #1774, #605, #1497.

Developed with AI assistance (Claude Code / Opus 4.6).

Christoph Arndt and others added 11 commits March 24, 2026 20:46
Add new API methods to ImdbApi for the IMDB GraphQL API and Suggest API
as preparation for migrating away from WAF-blocked HTML endpoints.

New files:
- ImdbGraphQLQueries.h: comprehensive GraphQL query strings for title
  details and episode listings (includes future fields like budget,
  awards, filming locations)

New ImdbApi methods:
- suggestSearch(): GET request to Suggest API for search
- sendGraphQLRequest(): POST request to GraphQL API
- loadTitleViaGraphQL(): load full title details in one request
- loadEpisodesViaGraphQL(): load episode listings

Old HTML-based methods are kept in parallel for gradual migration.

Part of Komet#1966

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace HTML-based search with the IMDB Suggest API
(v3.sg.media-imdb.com/suggestion/) which returns JSON directly and
is not affected by the AWS WAF blocking.

Changes:
- ImdbSearchPage: add parseSuggestResponse() for JSON parsing
- ImdbMovieSearchJob: use suggestSearch() + GraphQL for ID lookup
- ImdbTvShowSearchJob: use suggestSearch() + GraphQL for ID lookup
- Filter by qid types: movie/tvMovie/short/video for movies,
  tvSeries/tvMiniSeries for TV shows

Old HTML parsing methods kept as legacy until Phase 6 cleanup.

Part of Komet#1966

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add new parsing methods to ImdbJsonParser for GraphQL API responses:
- parseFromGraphQL(): full title details (movies + TV shows)
- parseEpisodesFromGraphQL(): bulk episode data
- parseSeasonsFromGraphQL(): season number listing

Key improvements over old HTML parser:
- Full cast with character names (not just top 5)
- Localized title via AKAs (e.g. German title)
- Localized certification (e.g. FSK instead of US rating)
- Localized release date by country
- Trailer as IMDB video page URL (browser-compatible)
- Metacritic score
- Outline heuristic: shortest plot vs first sentence of longest

New ImdbData fields: localizedTitle, localizedCertification,
isOngoing, network.

New ImdbEpisodeData struct for bulk episode parsing with full
metadata (directors, writers, actors, ratings, thumbnails).

Legacy HTML parsing methods kept for Phase 6 cleanup.

Part of Komet#1966

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace multi-page HTML loading with a single GraphQL request for all
movie details. This eliminates the download counter pattern and separate
requests for keywords, plot summary, and reference page.

Key changes:
- ImdbMovieScrapeJob: single loadTitleViaGraphQL() call replaces 3-4
  HTML page loads (reference, keywords, plot summary)
- Localization: localized title used as title, original kept as
  originalTitle when a non-English locale is selected
- ImdbMovieConfiguration: extend supportedLanguages from just "en" to
  16 languages (de, fr, es, it, pt, ja, ko, zh, ru, nl, pl, sv, da,
  fi, no) — enables the language dropdown in the scraper dialog

Part of Komet#1966

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace all TV scraper jobs with GraphQL-based implementations:

- ImdbTvShowScrapeJob: single GraphQL request for all show details
  (replaces reference page + shouldLoad/setIsLoaded/checkIfDone pattern)
- ImdbTvSeasonScrapeJob: bulk episode loading via GraphQL — one request
  for up to 250 episodes replaces sequential per-episode HTML loading
  (previously ~120 requests for a full series, now 1)
- ImdbTvEpisodeScrapeJob: individual episode via GraphQL, with fallback
  to bulk loading + filtering when no episode ID is available
- ImdbTvConfiguration: extend supportedLanguages from NoLocale to 16
  languages, default to "en"

Performance improvement: Breaking Bad (62 episodes) went from ~120
sequential HTTP requests to 1 GraphQL request.

Part of Komet#1966

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove all HTML-based scraping code that has been replaced by the
GraphQL and Suggest API implementations:

Deleted files:
- ImdbReferencePage.h/.cpp — HTML reference page parser
- ImdbTvShowParser.h/.cpp — TV show HTML parser
- ImdbTvSeasonParser.h/.cpp — season HTML parser
- ImdbTvEpisodeParser.h/.cpp — episode HTML parser
- testImdbTvEpisodeParser.cpp — unit test for deleted parser

Removed from existing files:
- ImdbApi: PageKind enum, loadTitle(), searchForMovie(),
  searchForShow(), loadSeason(), loadDefaultEpisodesPage(),
  sendGetRequest(), addHeadersToRequest(), and all HTML URL
  construction methods
- ImdbJsonParser: all __NEXT_DATA__ parsing (parseFromReferencePage,
  parseOverviewFromPlotSummaryPage, parseSeasonNumbersFromEpisodesPage,
  parseEpisodeIds, extractJsonFromHtml, followJsonPath, and all
  legacy private methods)
- ImdbSearchPage: parseSearch() HTML method
- ImdbShortEpisodeData struct

Updated CMakeLists.txt for both imdb/ and tv_show/imdb/ targets.

Net change: -1385 lines of dead code removed.

Part of Komet#1966

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix three GraphQL schema issues discovered during integration testing:
- Remove principalCredits block (used limit not first, unused by parser)
- Fix seasons structure: array of {number} not edges/node wrapper
- Fix episode numbers: displayableSeason.text + episodeNumber.text
  (text strings, not nested int fields)

Add season-filtered episode query (SEASON_EPISODES_FILTERED) for
efficient single-episode loading on shows with 250+ episodes.

Update test assertions:
- Tags test: GraphQL always returns all keywords, loadAllTags flag
  has no effect (removed upper bound check)
- TV search: Suggest API returns original titles, not localized
- TV search: fewer results from Suggest API vs old HTML search

Update all IMDB reference files for new GraphQL data format.

Part of Komet#1966

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parse images from GraphQL response as backdrops for both movies and TV
shows. Add Backdrop/Fanart to supportedDetails. The old IMDB scraper
never supported backdrops - this is a new capability from the GraphQL API.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the locale is "de" (language only, no country), locale.country()
returns an empty string. Derive country code from language code
(de→DE, fr→FR) so AKAs, certificates, and release dates are
correctly filtered by country.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pass locale to parseEpisodesFromGraphQL so episode certificates can be
filtered by country (e.g. FSK for German locale). Falls back to US
certificate, then to the simple certificate field.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Respect m_loadAllTags setting: limit keywords to 20 when disabled
- Remove unused network field from ImdbData (IMDB GraphQL has no
  dedicated network field; use TMDb via Custom TV Scraper instead)
- Clarify Top250 unavailability comment in parser

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The IMDB movie scraper settings page only showed the "Load all tags"
checkbox. Add a LanguageCombo dropdown so users can select their
preferred language (e.g. German for localized titles and FSK ratings).
The TV scraper settings already showed the dropdown via the default
layout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

IMDB scraper broken: AWS WAF blocks all HTML requests — migration to API endpoints needed

1 participant