moonshot medic: tool for resetting and collecting data on moonshots#1193
Closed
aerickson wants to merge 55 commits into
Closed
moonshot medic: tool for resetting and collecting data on moonshots#1193aerickson wants to merge 55 commits into
aerickson wants to merge 55 commits into
Conversation
- capture() in auto loop now uses check=False to survive transient fleetroll failures - record_reset_failure no longer double-counts total_resets - collect_host writes to a .tmp file and renames on success; failed runs leave no .md - reset-failure hosts now appear in the FAIL: summary line - worker_fqdn 616-630 band documented with a comment explaining the DC layout exception - state.json writes are atomic (write-tmp + replace) - load_state logs a warning instead of silently swallowing parse errors - _log_fh wrapped in try/finally to ensure it closes on exceptions and sys.exit - signal handler installed before dep checks so Ctrl-C is handled everywhere - --loop-interval and --freshness-requirement validated to >= 1 - freshness_requirement falsy-or replaced with is not None check - all host inputs normalised to FQDNs at resolution time (argv, stdin, auto) - per-host loop body wrapped in try/except to route unexpected errors to fail_hosts - last_failed reset at top of each loop iteration for correct exit code in auto mode - rglob pattern tightened to match only script-generated filenames - fmt() includes seconds and uses " UTC" suffix instead of stripping tz and appending Z - voice-hours help string corrected to reflect half-open interval - docstring HOST indentation aligned; RECENCY_MINUTES referenced by name
… before first run
Member
Author
|
Code moved to mozilla-platform-ops/relops-infra#46. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New script to automate hang diagnostics on t-linux64-ms-* Moonshot cartridges.
Intended to run continuously via
--auto --confirm --loop-interval N.reset_moonshot.py),SSHes in to run
moonshot_hang_report.py, and appends a fleetroll host-auditstate.json; automatically skips hostswith 3+ consecutive reset failures for 6h and flags them for human attention
--stale-threshold+--min-fresh-pct 65)before each run; skips the iteration (rather than exiting) if data is stale
OVERVIEW.mdandOVERVIEW.htmlon startup and after each batch,including daily activity counts and a skipped-host attention list
immediately on second press); spoken announcements via
sayduring work hoursTest plan
--auto --confirmagainst a small set of known-bad hosts; verifyall batches process, OVERVIEW files update, and results land in the right dir
immediately on second
and appears in the attention section