feat(imdb): migrate IMDB scraper to GraphQL and Suggest API#1981
Open
Arny80Hexa wants to merge 12 commits intoKomet:masterfrom
Open
feat(imdb): migrate IMDB scraper to GraphQL and Suggest API#1981Arny80Hexa wants to merge 12 commits intoKomet:masterfrom
Arny80Hexa wants to merge 12 commits intoKomet:masterfrom
Conversation
Add new API methods to ImdbApi for the IMDB GraphQL API and Suggest API as preparation for migrating away from WAF-blocked HTML endpoints. New files: - ImdbGraphQLQueries.h: comprehensive GraphQL query strings for title details and episode listings (includes future fields like budget, awards, filming locations) New ImdbApi methods: - suggestSearch(): GET request to Suggest API for search - sendGraphQLRequest(): POST request to GraphQL API - loadTitleViaGraphQL(): load full title details in one request - loadEpisodesViaGraphQL(): load episode listings Old HTML-based methods are kept in parallel for gradual migration. Part of Komet#1966 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace HTML-based search with the IMDB Suggest API (v3.sg.media-imdb.com/suggestion/) which returns JSON directly and is not affected by the AWS WAF blocking. Changes: - ImdbSearchPage: add parseSuggestResponse() for JSON parsing - ImdbMovieSearchJob: use suggestSearch() + GraphQL for ID lookup - ImdbTvShowSearchJob: use suggestSearch() + GraphQL for ID lookup - Filter by qid types: movie/tvMovie/short/video for movies, tvSeries/tvMiniSeries for TV shows Old HTML parsing methods kept as legacy until Phase 6 cleanup. Part of Komet#1966 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add new parsing methods to ImdbJsonParser for GraphQL API responses: - parseFromGraphQL(): full title details (movies + TV shows) - parseEpisodesFromGraphQL(): bulk episode data - parseSeasonsFromGraphQL(): season number listing Key improvements over old HTML parser: - Full cast with character names (not just top 5) - Localized title via AKAs (e.g. German title) - Localized certification (e.g. FSK instead of US rating) - Localized release date by country - Trailer as IMDB video page URL (browser-compatible) - Metacritic score - Outline heuristic: shortest plot vs first sentence of longest New ImdbData fields: localizedTitle, localizedCertification, isOngoing, network. New ImdbEpisodeData struct for bulk episode parsing with full metadata (directors, writers, actors, ratings, thumbnails). Legacy HTML parsing methods kept for Phase 6 cleanup. Part of Komet#1966 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace multi-page HTML loading with a single GraphQL request for all movie details. This eliminates the download counter pattern and separate requests for keywords, plot summary, and reference page. Key changes: - ImdbMovieScrapeJob: single loadTitleViaGraphQL() call replaces 3-4 HTML page loads (reference, keywords, plot summary) - Localization: localized title used as title, original kept as originalTitle when a non-English locale is selected - ImdbMovieConfiguration: extend supportedLanguages from just "en" to 16 languages (de, fr, es, it, pt, ja, ko, zh, ru, nl, pl, sv, da, fi, no) — enables the language dropdown in the scraper dialog Part of Komet#1966 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace all TV scraper jobs with GraphQL-based implementations: - ImdbTvShowScrapeJob: single GraphQL request for all show details (replaces reference page + shouldLoad/setIsLoaded/checkIfDone pattern) - ImdbTvSeasonScrapeJob: bulk episode loading via GraphQL — one request for up to 250 episodes replaces sequential per-episode HTML loading (previously ~120 requests for a full series, now 1) - ImdbTvEpisodeScrapeJob: individual episode via GraphQL, with fallback to bulk loading + filtering when no episode ID is available - ImdbTvConfiguration: extend supportedLanguages from NoLocale to 16 languages, default to "en" Performance improvement: Breaking Bad (62 episodes) went from ~120 sequential HTTP requests to 1 GraphQL request. Part of Komet#1966 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove all HTML-based scraping code that has been replaced by the GraphQL and Suggest API implementations: Deleted files: - ImdbReferencePage.h/.cpp — HTML reference page parser - ImdbTvShowParser.h/.cpp — TV show HTML parser - ImdbTvSeasonParser.h/.cpp — season HTML parser - ImdbTvEpisodeParser.h/.cpp — episode HTML parser - testImdbTvEpisodeParser.cpp — unit test for deleted parser Removed from existing files: - ImdbApi: PageKind enum, loadTitle(), searchForMovie(), searchForShow(), loadSeason(), loadDefaultEpisodesPage(), sendGetRequest(), addHeadersToRequest(), and all HTML URL construction methods - ImdbJsonParser: all __NEXT_DATA__ parsing (parseFromReferencePage, parseOverviewFromPlotSummaryPage, parseSeasonNumbersFromEpisodesPage, parseEpisodeIds, extractJsonFromHtml, followJsonPath, and all legacy private methods) - ImdbSearchPage: parseSearch() HTML method - ImdbShortEpisodeData struct Updated CMakeLists.txt for both imdb/ and tv_show/imdb/ targets. Net change: -1385 lines of dead code removed. Part of Komet#1966 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix three GraphQL schema issues discovered during integration testing:
- Remove principalCredits block (used limit not first, unused by parser)
- Fix seasons structure: array of {number} not edges/node wrapper
- Fix episode numbers: displayableSeason.text + episodeNumber.text
(text strings, not nested int fields)
Add season-filtered episode query (SEASON_EPISODES_FILTERED) for
efficient single-episode loading on shows with 250+ episodes.
Update test assertions:
- Tags test: GraphQL always returns all keywords, loadAllTags flag
has no effect (removed upper bound check)
- TV search: Suggest API returns original titles, not localized
- TV search: fewer results from Suggest API vs old HTML search
Update all IMDB reference files for new GraphQL data format.
Part of Komet#1966
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parse images from GraphQL response as backdrops for both movies and TV shows. Add Backdrop/Fanart to supportedDetails. The old IMDB scraper never supported backdrops - this is a new capability from the GraphQL API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the locale is "de" (language only, no country), locale.country() returns an empty string. Derive country code from language code (de→DE, fr→FR) so AKAs, certificates, and release dates are correctly filtered by country. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pass locale to parseEpisodesFromGraphQL so episode certificates can be filtered by country (e.g. FSK for German locale). Falls back to US certificate, then to the simple certificate field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Respect m_loadAllTags setting: limit keywords to 20 when disabled - Remove unused network field from ImdbData (IMDB GraphQL has no dedicated network field; use TMDb via Custom TV Scraper instead) - Clarify Top250 unavailability comment in parser Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The IMDB movie scraper settings page only showed the "Load all tags" checkbox. Add a LanguageCombo dropdown so users can select their preferred language (e.g. German for localized titles and FSK ratings). The TV scraper settings already showed the dropdown via the default layout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Migrates the entire IMDB scraper from HTML parsing to JSON APIs, fixing the complete scraper breakdown caused by AWS WAF blocking all
www.imdb.comHTML requests (#1966).__NEXT_DATA__/ Reference page parsing with GraphQL API (POST, JSON, no auth)What's new compared to the old scraper
Known limitations
API note
Neither the Suggest API nor the GraphQL API are officially documented. No authentication or API tokens are required. See #1966 comment for details.
Test plan
./scripts/quick_checks.shpassing (our files)Closes #1966. Also addresses #1881, #1774, #605, #1497.
Developed with AI assistance (Claude Code / Opus 4.6).