chore: add kernel log validation to abe2es #7835

lilypan26 · 2026-02-09T22:00:03Z

What this PR does / why we need it:
Adds validation against critical kernel errors and aims to catch issues like these: https://supportability.visualstudio.com/AzureContainers/_workitems/edit/163429

Which issue(s) this PR fixes:

Fixes #

Copilot

Pull request overview

Adds a new E2E validation step to scan kernel logs for critical errors (panic/lockup/OOM/I/O), intended to catch severe node issues during validation runs.

Changes:

Added ValidateKernelLogs validator that greps dmesg for critical kernel error patterns and fails the scenario if matches are found.
Wired the new validator into ValidateCommonLinux so it runs as part of default Linux validations.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
e2e/validators.go	Introduces `ValidateKernelLogs` with category-based regex matching and failure reporting/logging.
e2e/validation.go	Adds `ValidateKernelLogs` to the common Linux validation sequence.

Copilot · 2026-02-09T22:04:57Z

e2e/validators.go

+				"set -e",
+				fmt.Sprintf("sudo dmesg | grep -iE '%s' | grep -ivE '%s' || true", cp.pattern, cp.exclude),
+			}
+		} else {
+			command = []string{
+				"set -e",


The sudo dmesg | grep ... || true pipeline combined with only set -e can mask failures retrieving dmesg output (and pipeline failures in general), causing false negatives. Consider enabling pipefail and structuring the script so a sudo dmesg failure fails the validator, while still allowing grep “no matches” to succeed.

Suggested change

"set -e",

fmt.Sprintf("sudo dmesg | grep -iE '%s' | grep -ivE '%s' || true", cp.pattern, cp.exclude),

}

} else {

command = []string{

"set -e",

"set -euo pipefail",

// First, verify that sudo dmesg itself works; if this fails, the script will exit due to set -e.

"sudo dmesg >/dev/null",

// Then, search for the pattern while excluding known-benign messages. Allow no matches without failing.

fmt.Sprintf("sudo dmesg | grep -iE '%s' | grep -ivE '%s' || true", cp.pattern, cp.exclude),

}

} else {

command = []string{

"set -euo pipefail",

// First, verify that sudo dmesg itself works; if this fails, the script will exit due to set -e.

"sudo dmesg >/dev/null",

// Then, search for the pattern. Allow no matches without failing.

Copilot · 2026-02-09T22:04:57Z

e2e/validators.go

+		"IO/FS": {
+			pattern: `I/O error|read-only file system|EXT[2-4]-fs error|XFS (ERROR|corruption)|BTRFS (error|warning)|nvme .* (timeout|reset)|ata[0-9].*(failed|error|reset)|scsi.*(error|failed)`,
+			// sr0 is the virtual CD-ROM drive on Azure VMs. This error occurs when the VM tries to read from an empty virtual optical drive, which is normal and expected.
+			exclude: `sr[0-9]`,


The comment says only sr0 (virtual CD-ROM) errors are excluded, but the exclude regex is sr[0-9], which will filter sr1, sr2, etc. Either tighten the exclude pattern to sr0 or update the comment to reflect that all srX devices are excluded.

Suggested change

exclude: `sr[0-9]`,

exclude: `sr0`,

Copilot · 2026-02-09T22:04:58Z

e2e/validators.go

+		fullDmesgResult := execScriptOnVMForScenarioValidateExitCode(ctx, s, "sudo dmesg", 0, "failed to retrieve full kernel logs")
+		s.T.Logf("=== FULL KERNEL LOG DUMP (dmesg) ===\n%s\n=== END KERNEL LOG DUMP ===", fullDmesgResult.stdout)


On failure this logs the entire dmesg output, which can be very large and may get truncated in CI logs, making the failure summary harder to find. Consider limiting the dump (e.g., tailing the last N lines, or dumping only around the matched lines) while keeping the category summaries for quick triage.

Copilot · 2026-02-09T22:04:58Z

e2e/validators.go

+		summary.WriteString("Critical kernel issues detected:\n")
+		for category, issues := range issuesFound {
+			summary.WriteString(fmt.Sprintf("\n[%s]:\n%s\n", category, issues))
+		}


issuesFound is a map, so the failure summary order will be nondeterministic across runs. For easier comparison/debugging, consider emitting categories in a stable order (e.g., iterate over a fixed slice of category names).

Copilot · 2026-02-09T22:04:58Z

e2e/validators.go

+		exclude string // optional pattern to exclude false positives
+	}
+	patterns := map[string]categoryPattern{
+		"PANIC/CRASH":  {pattern: `(kernel: )?(panic[: -]|oops|call trace|backtrace|general protection fault|BUG:|RIP:)`},


The panic[: -] regex in the PANIC/CRASH pattern won’t match panic:/panic as intended because [: -] isn’t a character class here (missing []). As written, this is likely to miss real kernel panic lines and defeats the purpose of this validator. Please update the regex to correctly require panic followed by :/space/- (or use an equivalent alternation) while still avoiding matches like panic=-1.

Suggested change

"PANIC/CRASH": {pattern: `(kernel: )?(panic[: -]|oops|call trace|backtrace|general protection fault|BUG:|RIP:)`},

"PANIC/CRASH": {pattern: `(kernel: )?(panic[ :\-]|oops|call trace|backtrace|general protection fault|BUG:|RIP:)`},

@lilypan26 might me work confirming

…add-kernel-warning-e2e-validator

cameronmeissner · 2026-02-10T03:09:45Z

e2e/validators.go

+	if len(issuesFound) > 0 {
+		// Get full kernel log dump
+		fullDmesgResult := execScriptOnVMForScenarioValidateExitCode(ctx, s, "sudo dmesg", 0, "failed to retrieve full kernel logs")
+		s.T.Logf("=== FULL KERNEL LOG DUMP (dmesg) ===\n%s\n=== END KERNEL LOG DUMP ===", fullDmesgResult.stdout)


how big is the dump? if it's huge it might be better to just pipe this out to a log file bundled in scenario-logs

it should be smallish since it based on a recent boot. size should be constant

cameronmeissner · 2026-02-10T03:10:24Z

e2e/validators.go

+		for category, issues := range issuesFound {
+			summary.WriteString(fmt.Sprintf("\n[%s]:\n%s\n", category, issues))
+		}
+		s.T.Fatalf("%s", summary.String())


same thing here, depending on how large might make more sense to just pipe to output file instead of cluttering test run output

I guess it's a question of ease of access, do we want to have to do multiple clicks to get to the full logs ?

maybe printing snippets around the errors (-B 10, -A 10) and doing the full dump as a pipeline artifact makes sense.

I think we also need to understand how often those will occur.

djsly · 2026-02-10T16:03:32Z

e2e/validators.go

+		exclude string // optional pattern to exclude false positives
+	}
+	patterns := map[string]categoryPattern{
+		"PANIC/CRASH":  {pattern: `(kernel: )?(panic[: -]|oops|call trace|backtrace|general protection fault|BUG:|RIP:)`},


@lilypan26 might me work confirming

djsly · 2026-02-10T16:04:54Z

e2e/validators.go

+		if cp.exclude != "" {
+			command = []string{
+				"set -e",
+				fmt.Sprintf("sudo dmesg | grep -iE '%s' | grep -ivE '%s' || true", cp.pattern, cp.exclude),


do we need to call sudo :) knowing we have to remove all usage of sudo, unless this is because we are running as packer user ?

djsly · 2026-02-10T16:05:54Z

e2e/validators.go

+	if len(issuesFound) > 0 {
+		// Get full kernel log dump
+		fullDmesgResult := execScriptOnVMForScenarioValidateExitCode(ctx, s, "sudo dmesg", 0, "failed to retrieve full kernel logs")
+		s.T.Logf("=== FULL KERNEL LOG DUMP (dmesg) ===\n%s\n=== END KERNEL LOG DUMP ===", fullDmesgResult.stdout)


it should be smallish since it based on a recent boot. size should be constant

djsly · 2026-02-10T16:08:40Z

e2e/validators.go

+		for category, issues := range issuesFound {
+			summary.WriteString(fmt.Sprintf("\n[%s]:\n%s\n", category, issues))
+		}
+		s.T.Fatalf("%s", summary.String())


I guess it's a question of ease of access, do we want to have to do multiple clicks to get to the full logs ?

maybe printing snippets around the errors (-B 10, -A 10) and doing the full dump as a pipeline artifact makes sense.

I think we also need to understand how often those will occur.

lilypan26 added 2 commits February 9, 2026 13:55

chore: add kernel log validation to abe2es

303027d

chore: log full kernel dump

bb3c98d

Copilot AI review requested due to automatic review settings February 9, 2026 22:00

lilypan26 requested review from AbelHu, Devinwong, awesomenix, cameronmeissner, djsly, ganeshkumarashok, juan-lee, junjiezhang1997, mxj220, pdamianov-dev, phealy, r2k1, timmy-wright and zachary-bailey as code owners February 9, 2026 22:00

lilypan26 temporarily deployed to test February 9, 2026 22:00 — with GitHub Actions Inactive

Copilot started reviewing on behalf of lilypan26 February 9, 2026 22:00 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

Merge branch 'main' of https://github.com/Azure/AgentBaker into lily/…

282ecdf

…add-kernel-warning-e2e-validator

lilypan26 temporarily deployed to test February 9, 2026 23:28 — with GitHub Actions Inactive

cameronmeissner reviewed Feb 10, 2026

View reviewed changes

cameronmeissner approved these changes Feb 10, 2026

View reviewed changes

djsly approved these changes Feb 10, 2026

View reviewed changes

		fullDmesgResult := execScriptOnVMForScenarioValidateExitCode(ctx, s, "sudo dmesg", 0, "failed to retrieve full kernel logs")
		s.T.Logf("=== FULL KERNEL LOG DUMP (dmesg) ===\n%s\n=== END KERNEL LOG DUMP ===", fullDmesgResult.stdout)

	"PANIC/CRASH": {pattern: `(kernel: )?(panic[: -]\|oops\|call trace\|backtrace\|general protection fault\|BUG:\|RIP:)`},
	"PANIC/CRASH": {pattern: `(kernel: )?(panic[ :\-]\|oops\|call trace\|backtrace\|general protection fault\|BUG:\|RIP:)`},

chore: add kernel log validation to abe2es #7835

Are you sure you want to change the base?

chore: add kernel log validation to abe2es #7835

Conversation

lilypan26 commented Feb 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants