perf: improve GitHub commit activity fetching#258
Merged
Conversation
Replace the previous ThreadPoolExecutor-based timeout approach with direct REST calls for repo commit activity. Add status constants (ready/pending/failed), helper functions (_commit_activity_url, _write_commit_activity, _fetch_commit_activity) to handle GitHub 202/204/200 responses, and a two-pass _collect_commit_activity to trigger and then collect pending stats. Update update_github to prefetch commit activity for active repos, include proper headers, and iterate only non-archived repos. Update tests to cover the new fetch/collect behavior and remove the previous timeout-based tests.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #258 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 6 6
Lines 862 955 +93
=========================================
+ Hits 862 955 +93
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
Introduce a per-repository GitHub step runner with timeouts (GITHUB_REPO_STEP_TIMEOUT=90s) to guard against slow or failing API calls. Adds _run_github_repo_step which runs callables in a daemon thread, returns a default on timeout/error, and logs warnings. Extracts helper functions _collect_open_pulls and _fetch_open_graph_image_url and integrate the timeout wrapper into _process_github_repo for languages, pulls, code scanning alerts, star history, and OpenGraph image fetch/download (image download uses a 30s timeout). Update tests: add test_run_github_repo_step_timeout and adjust an existing test to assert warning behavior instead of raising SystemExit.
Add retry polling to _collect_commit_activity to handle GitHub's 202/async stats calculation. Introduce COMMIT_ACTIVITY_POLL_ATTEMPTS and COMMIT_ACTIVITY_POLL_INTERVAL (defaults 6 and 15s) and allow callers to override poll_attempts and poll_interval. Between attempts the function logs and writes progress messages, sleeps for the configured interval, and retries only pending repos; when repos remain pending after all attempts it logs a consolidated warning. Update unit tests to pass poll parameters, add a test that verifies repeated polls and sleeps, and ensure existing tests monkeypatch time.sleep where needed.
Switch commit-activity collection to use /stats/participation (weekly totals) instead of /stats/commit_activity to avoid long 202 responses in CI. Add cache helper paths and functions to read/write commitActivity and commitActivityHashes (by default-branch SHA), compute commitActivity-shaped records from participation totals, and only refresh stats when the default-branch SHA changes. Remove polling logic/constants and simplify collection to skip repos with up-to-date cached data. Also add a GH Actions step to restore the generated data cache before collection. Update unit tests to cover the new caching behaviour, participation conversion, and workflow changes.
Consolidate logging in _run_github_repo_step to use single log.warning calls (remove tqdm.write duplicates) and keep returning the default on timeout or error. Revamp _collect_commit_activity to use a two-pass approach: a priming pass to trigger GitHub stats calculation and a second pass to re-check only pending repositories, aggregating pending repo names into a single warning. Update unit tests to match the new logging/behavior (add test_run_github_repo_step_error, adjust timeout test expectations, replace the pending test with deterministic status sequences, and add a test ensuring early return when all repos are ready).
9143603 to
04e9065
Compare
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Description
Replace the previous ThreadPoolExecutor-based timeout approach with direct REST calls for repo commit activity. Add status constants (ready/pending/failed), helper functions (_commit_activity_url, _write_commit_activity, _fetch_commit_activity) to handle GitHub 202/204/200 responses, and a two-pass _collect_commit_activity to trigger and then collect pending stats. Update update_github to prefetch commit activity for active repos, include proper headers, and iterate only non-archived repos. Update tests to cover the new fetch/collect behavior and remove the previous timeout-based tests.
Screenshot
Issues Fixed or Closed
Roadmap Issues
Type of Change
Checklist
AI Usage