-
Notifications
You must be signed in to change notification settings - Fork 247
chore: add kernel log validation to abe2es #7835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds a new E2E validation step to scan kernel logs for critical errors (panic/lockup/OOM/I/O), intended to catch severe node issues during validation runs.
Changes:
- Added
ValidateKernelLogsvalidator that grepsdmesgfor critical kernel error patterns and fails the scenario if matches are found. - Wired the new validator into
ValidateCommonLinuxso it runs as part of default Linux validations.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| e2e/validators.go | Introduces ValidateKernelLogs with category-based regex matching and failure reporting/logging. |
| e2e/validation.go | Adds ValidateKernelLogs to the common Linux validation sequence. |
| "set -e", | ||
| fmt.Sprintf("sudo dmesg | grep -iE '%s' | grep -ivE '%s' || true", cp.pattern, cp.exclude), | ||
| } | ||
| } else { | ||
| command = []string{ | ||
| "set -e", |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sudo dmesg | grep ... || true pipeline combined with only set -e can mask failures retrieving dmesg output (and pipeline failures in general), causing false negatives. Consider enabling pipefail and structuring the script so a sudo dmesg failure fails the validator, while still allowing grep “no matches” to succeed.
| "set -e", | |
| fmt.Sprintf("sudo dmesg | grep -iE '%s' | grep -ivE '%s' || true", cp.pattern, cp.exclude), | |
| } | |
| } else { | |
| command = []string{ | |
| "set -e", | |
| "set -euo pipefail", | |
| // First, verify that sudo dmesg itself works; if this fails, the script will exit due to set -e. | |
| "sudo dmesg >/dev/null", | |
| // Then, search for the pattern while excluding known-benign messages. Allow no matches without failing. | |
| fmt.Sprintf("sudo dmesg | grep -iE '%s' | grep -ivE '%s' || true", cp.pattern, cp.exclude), | |
| } | |
| } else { | |
| command = []string{ | |
| "set -euo pipefail", | |
| // First, verify that sudo dmesg itself works; if this fails, the script will exit due to set -e. | |
| "sudo dmesg >/dev/null", | |
| // Then, search for the pattern. Allow no matches without failing. |
| "IO/FS": { | ||
| pattern: `I/O error|read-only file system|EXT[2-4]-fs error|XFS (ERROR|corruption)|BTRFS (error|warning)|nvme .* (timeout|reset)|ata[0-9].*(failed|error|reset)|scsi.*(error|failed)`, | ||
| // sr0 is the virtual CD-ROM drive on Azure VMs. This error occurs when the VM tries to read from an empty virtual optical drive, which is normal and expected. | ||
| exclude: `sr[0-9]`, |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says only sr0 (virtual CD-ROM) errors are excluded, but the exclude regex is sr[0-9], which will filter sr1, sr2, etc. Either tighten the exclude pattern to sr0 or update the comment to reflect that all srX devices are excluded.
| exclude: `sr[0-9]`, | |
| exclude: `sr0`, |
| fullDmesgResult := execScriptOnVMForScenarioValidateExitCode(ctx, s, "sudo dmesg", 0, "failed to retrieve full kernel logs") | ||
| s.T.Logf("=== FULL KERNEL LOG DUMP (dmesg) ===\n%s\n=== END KERNEL LOG DUMP ===", fullDmesgResult.stdout) |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On failure this logs the entire dmesg output, which can be very large and may get truncated in CI logs, making the failure summary harder to find. Consider limiting the dump (e.g., tailing the last N lines, or dumping only around the matched lines) while keeping the category summaries for quick triage.
| summary.WriteString("Critical kernel issues detected:\n") | ||
| for category, issues := range issuesFound { | ||
| summary.WriteString(fmt.Sprintf("\n[%s]:\n%s\n", category, issues)) | ||
| } |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issuesFound is a map, so the failure summary order will be nondeterministic across runs. For easier comparison/debugging, consider emitting categories in a stable order (e.g., iterate over a fixed slice of category names).
| exclude string // optional pattern to exclude false positives | ||
| } | ||
| patterns := map[string]categoryPattern{ | ||
| "PANIC/CRASH": {pattern: `(kernel: )?(panic[: -]|oops|call trace|backtrace|general protection fault|BUG:|RIP:)`}, |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The panic[: -] regex in the PANIC/CRASH pattern won’t match panic:/panic as intended because [: -] isn’t a character class here (missing []). As written, this is likely to miss real kernel panic lines and defeats the purpose of this validator. Please update the regex to correctly require panic followed by :/space/- (or use an equivalent alternation) while still avoiding matches like panic=-1.
| "PANIC/CRASH": {pattern: `(kernel: )?(panic[: -]|oops|call trace|backtrace|general protection fault|BUG:|RIP:)`}, | |
| "PANIC/CRASH": {pattern: `(kernel: )?(panic[ :\-]|oops|call trace|backtrace|general protection fault|BUG:|RIP:)`}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lilypan26 might me work confirming
…add-kernel-warning-e2e-validator
| if len(issuesFound) > 0 { | ||
| // Get full kernel log dump | ||
| fullDmesgResult := execScriptOnVMForScenarioValidateExitCode(ctx, s, "sudo dmesg", 0, "failed to retrieve full kernel logs") | ||
| s.T.Logf("=== FULL KERNEL LOG DUMP (dmesg) ===\n%s\n=== END KERNEL LOG DUMP ===", fullDmesgResult.stdout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how big is the dump? if it's huge it might be better to just pipe this out to a log file bundled in scenario-logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be smallish since it based on a recent boot. size should be constant
| for category, issues := range issuesFound { | ||
| summary.WriteString(fmt.Sprintf("\n[%s]:\n%s\n", category, issues)) | ||
| } | ||
| s.T.Fatalf("%s", summary.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing here, depending on how large might make more sense to just pipe to output file instead of cluttering test run output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's a question of ease of access, do we want to have to do multiple clicks to get to the full logs ?
maybe printing snippets around the errors (-B 10, -A 10) and doing the full dump as a pipeline artifact makes sense.
I think we also need to understand how often those will occur.
| exclude string // optional pattern to exclude false positives | ||
| } | ||
| patterns := map[string]categoryPattern{ | ||
| "PANIC/CRASH": {pattern: `(kernel: )?(panic[: -]|oops|call trace|backtrace|general protection fault|BUG:|RIP:)`}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lilypan26 might me work confirming
| if cp.exclude != "" { | ||
| command = []string{ | ||
| "set -e", | ||
| fmt.Sprintf("sudo dmesg | grep -iE '%s' | grep -ivE '%s' || true", cp.pattern, cp.exclude), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to call sudo :) knowing we have to remove all usage of sudo, unless this is because we are running as packer user ?
| if len(issuesFound) > 0 { | ||
| // Get full kernel log dump | ||
| fullDmesgResult := execScriptOnVMForScenarioValidateExitCode(ctx, s, "sudo dmesg", 0, "failed to retrieve full kernel logs") | ||
| s.T.Logf("=== FULL KERNEL LOG DUMP (dmesg) ===\n%s\n=== END KERNEL LOG DUMP ===", fullDmesgResult.stdout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be smallish since it based on a recent boot. size should be constant
| for category, issues := range issuesFound { | ||
| summary.WriteString(fmt.Sprintf("\n[%s]:\n%s\n", category, issues)) | ||
| } | ||
| s.T.Fatalf("%s", summary.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's a question of ease of access, do we want to have to do multiple clicks to get to the full logs ?
maybe printing snippets around the errors (-B 10, -A 10) and doing the full dump as a pipeline artifact makes sense.
I think we also need to understand how often those will occur.
What this PR does / why we need it:
Adds validation against critical kernel errors and aims to catch issues like these: https://supportability.visualstudio.com/AzureContainers/_workitems/edit/163429
Which issue(s) this PR fixes:
Fixes #