Skip to content

Conversation

@lilypan26
Copy link
Contributor

What this PR does / why we need it:
Adds validation against critical kernel errors and aims to catch issues like these: https://supportability.visualstudio.com/AzureContainers/_workitems/edit/163429

Which issue(s) this PR fixes:

Fixes #

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new E2E validation step to scan kernel logs for critical errors (panic/lockup/OOM/I/O), intended to catch severe node issues during validation runs.

Changes:

  • Added ValidateKernelLogs validator that greps dmesg for critical kernel error patterns and fails the scenario if matches are found.
  • Wired the new validator into ValidateCommonLinux so it runs as part of default Linux validations.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
e2e/validators.go Introduces ValidateKernelLogs with category-based regex matching and failure reporting/logging.
e2e/validation.go Adds ValidateKernelLogs to the common Linux validation sequence.

Comment on lines +1779 to +1784
"set -e",
fmt.Sprintf("sudo dmesg | grep -iE '%s' | grep -ivE '%s' || true", cp.pattern, cp.exclude),
}
} else {
command = []string{
"set -e",
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sudo dmesg | grep ... || true pipeline combined with only set -e can mask failures retrieving dmesg output (and pipeline failures in general), causing false negatives. Consider enabling pipefail and structuring the script so a sudo dmesg failure fails the validator, while still allowing grep “no matches” to succeed.

Suggested change
"set -e",
fmt.Sprintf("sudo dmesg | grep -iE '%s' | grep -ivE '%s' || true", cp.pattern, cp.exclude),
}
} else {
command = []string{
"set -e",
"set -euo pipefail",
// First, verify that sudo dmesg itself works; if this fails, the script will exit due to set -e.
"sudo dmesg >/dev/null",
// Then, search for the pattern while excluding known-benign messages. Allow no matches without failing.
fmt.Sprintf("sudo dmesg | grep -iE '%s' | grep -ivE '%s' || true", cp.pattern, cp.exclude),
}
} else {
command = []string{
"set -euo pipefail",
// First, verify that sudo dmesg itself works; if this fails, the script will exit due to set -e.
"sudo dmesg >/dev/null",
// Then, search for the pattern. Allow no matches without failing.

Copilot uses AI. Check for mistakes.
"IO/FS": {
pattern: `I/O error|read-only file system|EXT[2-4]-fs error|XFS (ERROR|corruption)|BTRFS (error|warning)|nvme .* (timeout|reset)|ata[0-9].*(failed|error|reset)|scsi.*(error|failed)`,
// sr0 is the virtual CD-ROM drive on Azure VMs. This error occurs when the VM tries to read from an empty virtual optical drive, which is normal and expected.
exclude: `sr[0-9]`,
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says only sr0 (virtual CD-ROM) errors are excluded, but the exclude regex is sr[0-9], which will filter sr1, sr2, etc. Either tighten the exclude pattern to sr0 or update the comment to reflect that all srX devices are excluded.

Suggested change
exclude: `sr[0-9]`,
exclude: `sr0`,

Copilot uses AI. Check for mistakes.
Comment on lines +1799 to +1800
fullDmesgResult := execScriptOnVMForScenarioValidateExitCode(ctx, s, "sudo dmesg", 0, "failed to retrieve full kernel logs")
s.T.Logf("=== FULL KERNEL LOG DUMP (dmesg) ===\n%s\n=== END KERNEL LOG DUMP ===", fullDmesgResult.stdout)
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On failure this logs the entire dmesg output, which can be very large and may get truncated in CI logs, making the failure summary harder to find. Consider limiting the dump (e.g., tailing the last N lines, or dumping only around the matched lines) while keeping the category summaries for quick triage.

Copilot uses AI. Check for mistakes.
Comment on lines +1804 to +1807
summary.WriteString("Critical kernel issues detected:\n")
for category, issues := range issuesFound {
summary.WriteString(fmt.Sprintf("\n[%s]:\n%s\n", category, issues))
}
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issuesFound is a map, so the failure summary order will be nondeterministic across runs. For easier comparison/debugging, consider emitting categories in a stable order (e.g., iterate over a fixed slice of category names).

Copilot uses AI. Check for mistakes.
exclude string // optional pattern to exclude false positives
}
patterns := map[string]categoryPattern{
"PANIC/CRASH": {pattern: `(kernel: )?(panic[: -]|oops|call trace|backtrace|general protection fault|BUG:|RIP:)`},
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The panic[: -] regex in the PANIC/CRASH pattern won’t match panic:/panic as intended because [: -] isn’t a character class here (missing []). As written, this is likely to miss real kernel panic lines and defeats the purpose of this validator. Please update the regex to correctly require panic followed by :/space/- (or use an equivalent alternation) while still avoiding matches like panic=-1.

Suggested change
"PANIC/CRASH": {pattern: `(kernel: )?(panic[: -]|oops|call trace|backtrace|general protection fault|BUG:|RIP:)`},
"PANIC/CRASH": {pattern: `(kernel: )?(panic[ :\-]|oops|call trace|backtrace|general protection fault|BUG:|RIP:)`},

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lilypan26 might me work confirming

if len(issuesFound) > 0 {
// Get full kernel log dump
fullDmesgResult := execScriptOnVMForScenarioValidateExitCode(ctx, s, "sudo dmesg", 0, "failed to retrieve full kernel logs")
s.T.Logf("=== FULL KERNEL LOG DUMP (dmesg) ===\n%s\n=== END KERNEL LOG DUMP ===", fullDmesgResult.stdout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how big is the dump? if it's huge it might be better to just pipe this out to a log file bundled in scenario-logs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be smallish since it based on a recent boot. size should be constant

for category, issues := range issuesFound {
summary.WriteString(fmt.Sprintf("\n[%s]:\n%s\n", category, issues))
}
s.T.Fatalf("%s", summary.String())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing here, depending on how large might make more sense to just pipe to output file instead of cluttering test run output

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's a question of ease of access, do we want to have to do multiple clicks to get to the full logs ?

maybe printing snippets around the errors (-B 10, -A 10) and doing the full dump as a pipeline artifact makes sense.

I think we also need to understand how often those will occur.

exclude string // optional pattern to exclude false positives
}
patterns := map[string]categoryPattern{
"PANIC/CRASH": {pattern: `(kernel: )?(panic[: -]|oops|call trace|backtrace|general protection fault|BUG:|RIP:)`},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lilypan26 might me work confirming

if cp.exclude != "" {
command = []string{
"set -e",
fmt.Sprintf("sudo dmesg | grep -iE '%s' | grep -ivE '%s' || true", cp.pattern, cp.exclude),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to call sudo :) knowing we have to remove all usage of sudo, unless this is because we are running as packer user ?

if len(issuesFound) > 0 {
// Get full kernel log dump
fullDmesgResult := execScriptOnVMForScenarioValidateExitCode(ctx, s, "sudo dmesg", 0, "failed to retrieve full kernel logs")
s.T.Logf("=== FULL KERNEL LOG DUMP (dmesg) ===\n%s\n=== END KERNEL LOG DUMP ===", fullDmesgResult.stdout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be smallish since it based on a recent boot. size should be constant

for category, issues := range issuesFound {
summary.WriteString(fmt.Sprintf("\n[%s]:\n%s\n", category, issues))
}
s.T.Fatalf("%s", summary.String())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's a question of ease of access, do we want to have to do multiple clicks to get to the full logs ?

maybe printing snippets around the errors (-B 10, -A 10) and doing the full dump as a pipeline artifact makes sense.

I think we also need to understand how often those will occur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants