Skip to content

moonshot medic: tool for resetting and collecting data on moonshots#1193

Closed
aerickson wants to merge 55 commits into
masterfrom
042926-moonshot-medic
Closed

moonshot medic: tool for resetting and collecting data on moonshots#1193
aerickson wants to merge 55 commits into
masterfrom
042926-moonshot-medic

Conversation

@aerickson
Copy link
Copy Markdown
Member

@aerickson aerickson commented Apr 29, 2026

Summary

New script to automate hang diagnostics on t-linux64-ms-* Moonshot cartridges.
Intended to run continuously via --auto --confirm --loop-interval N.

  • Fetches bad-host list from fleetroll, resets via iLO (reset_moonshot.py),
    SSHes in to run moonshot_hang_report.py, and appends a fleetroll host-audit
  • Processes all detected hosts per run in sequential batches of 10
  • Persists per-host reset history in state.json; automatically skips hosts
    with 3+ consecutive reset failures for 6h and flags them for human attention
  • Validates fleetroll data freshness (--stale-threshold + --min-fresh-pct 65)
    before each run; skips the iteration (rather than exiting) if data is stale
  • Generates OVERVIEW.md and OVERVIEW.html on startup and after each batch,
    including daily activity counts and a skipped-host attention list
  • 15s countdown before first run; graceful Ctrl-C (finish current host or exit
    immediately on second press); spoken announcements via say during work hours

Test plan

  • Run with --auto --confirm against a small set of known-bad hosts; verify
    all batches process, OVERVIEW files update, and results land in the right dir
  • Confirm stale fleetroll data warns and retries rather than exiting
  • Confirm Ctrl-C stops cleanly at batch boundary on first press, exits
    immediately on second
  • Open OVERVIEW.html and verify daily counts and host table render correctly
  • Trigger 3 consecutive reset failures on a host and confirm it gets skipped
    and appears in the attention section

- capture() in auto loop now uses check=False to survive transient fleetroll failures
- record_reset_failure no longer double-counts total_resets
- collect_host writes to a .tmp file and renames on success; failed runs leave no .md
- reset-failure hosts now appear in the FAIL: summary line
- worker_fqdn 616-630 band documented with a comment explaining the DC layout exception
- state.json writes are atomic (write-tmp + replace)
- load_state logs a warning instead of silently swallowing parse errors
- _log_fh wrapped in try/finally to ensure it closes on exceptions and sys.exit
- signal handler installed before dep checks so Ctrl-C is handled everywhere
- --loop-interval and --freshness-requirement validated to >= 1
- freshness_requirement falsy-or replaced with is not None check
- all host inputs normalised to FQDNs at resolution time (argv, stdin, auto)
- per-host loop body wrapped in try/except to route unexpected errors to fail_hosts
- last_failed reset at top of each loop iteration for correct exit code in auto mode
- rglob pattern tightened to match only script-generated filenames
- fmt() includes seconds and uses " UTC" suffix instead of stripping tz and appending Z
- voice-hours help string corrected to reflect half-open interval
- docstring HOST indentation aligned; RECENCY_MINUTES referenced by name
@aerickson aerickson changed the title 042926 moonshot medic bin/collect_moonshot_hang_reports.py: initial implementation Apr 29, 2026
@aerickson aerickson changed the title bin/collect_moonshot_hang_reports.py: initial implementation moonshot medic: tool for resetting and collecting data on moonshots Apr 29, 2026
@aerickson
Copy link
Copy Markdown
Member Author

Code moved to mozilla-platform-ops/relops-infra#46.

@aerickson aerickson closed this May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant