Skip to content

fix(e2e): unblock Windows sysprep when VMAgentDisabler.dll load stalls#8544

Open
r2k1 wants to merge 4 commits into
mainfrom
akhantimirov/fix-windows-sysprep-vmagentdisabler-flake
Open

fix(e2e): unblock Windows sysprep when VMAgentDisabler.dll load stalls#8544
r2k1 wants to merge 4 commits into
mainfrom
akhantimirov/fix-windows-sysprep-vmagentdisabler-flake

Conversation

@r2k1
Copy link
Copy Markdown
Contributor

@r2k1 r2k1 commented May 20, 2026

What this PR does / why we need it:

Test_Windows2022_VHDCaching_* scenarios have been failing first-attempt on PR check-in runs — the Sysprep RunCommand on the test VM never returns within the test's vmssCtx, and validation fails with context deadline exceeded.

Repro: on a Win2022 test VM where Sysprep /generalize normally completes in ~10s, renaming C:\Windows\system32\VMAgentDisabler.dll while leaving the SysPrepExternal\Generalize registry entry pointing at it makes sysprep stall past the entire vmssCtx budget. Same symptom as the CI failures.

vhdbuilder/packer/windows/sysprep.ps1 has stripped that registry entry since 2020 (PR #429) for the production VHD-bake path. The e2e CreateImage helper was added later (PR #4631) and never inherited the workaround — it invokes Sysprep directly via RunCommand. This PR brings the e2e path to parity.

Also migrates RunCommand from v1 (VMSSVM.BeginRunCommand) to v2 (VMSSVMRunCommands.BeginCreateOrUpdate) — same migration aks-rp made in PR 15721814 to avoid the v1 extension's Keyset does not exist failure on newer Windows hosts. Two call sites in validators.go updated.

Verification

  • go build ./... clean (e2e module)
  • Test_Windows2022_VHDCaching_LegacyTLSBootstrap passes locally in ~9m36s, sysprep ~1m, on a VM where the prior code hung the full vmssCtx

Which issue(s) this PR fixes:
Fixes #

Windows2022 VHDCaching scenarios have been failing at the Sysprep
/generalize step in PR check-in runs since ~May 9 2026. The Sysprep
RunCommand never completes within the test's vmssCtx budget
(TestTimeoutVMSS - prepareAKSNode time, ~14m), and the validation step
fails with 'context deadline exceeded'.

Root cause: VMAgentDisabler.dll is a Sysprep provider shipped by the
Windows Azure Guest Agent. The agent self-updates from Azure fabric on
every boot, and in Jan 2026 added a WDAC catalog file install feature
(msazure ADO PR 14499782) for the DLL. The feature had bugs (hotfixes
14889344 / 14901019) and rolled out unevenly Feb-May 2026. On hosts
where the catalog install failed, Code Integrity cannot validate the
DLL and LoadLibrary stalls long enough to exhaust our test timeout.
This matches a 2020 incident (ICM 210726081) — the existing
vhdbuilder/packer/windows/sysprep.ps1 already has the same workaround
during VHD bake.

Causal proof: on a healthy Win2022 host where sysprep normally
completes in ~10s, renaming VMAgentDisabler.dll while leaving the
SysPrepExternal\\Generalize registry entry intact reproduces the
stall.

Fix (e2e/test_helpers.go):
- New windowsSysprepScript that removes any SysPrepExternal\\Generalize
  registry value pointing at VMAgentDisabler.dll before invoking
  Sysprep, then polls ImageState until generalization completes.
- Replaces the inline sysprep invocation in CreateImage; reads
  res.Output / res.Error instead of marshaling JSON.

Migrate RunCommand from v1 (VMSSVM.BeginRunCommand) to v2
(VMSSVMRunCommands.BeginCreateOrUpdate). v2 is the supported path
going forward and matches the migration done in aks-rp PR 15721814 to
avoid the 'Keyset does not exist' failure mode of the v1 extension on
newer Windows hosts. Two call sites in validators.go refactored to use
the new wrapper.

Verified: Test_Windows2022_VHDCaching_LegacyTLSBootstrap passes
end-to-end in ~9m36s with sysprep completing in ~1m, vs hanging out
the full vmssCtx on broken hosts before this change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the e2e harness to mitigate a recurring Windows2022 VHDCaching flake where Sysprep /generalize can hang when the SysPrepExternal\Generalize registry points at VMAgentDisabler.dll, and modernizes the test harness to use the VMSS RunCommand v2 API surface for script execution.

Changes:

  • Introduces a VMSS RunCommand v2 wrapper that uses VirtualMachineRunCommand (v2) and fetches the instanceView for stdout/stderr.
  • Adds a Windows sysprep script that removes SysPrepExternal\Generalize entries referencing VMAgentDisabler.dll and polls ImageState until generalize completion.
  • Refactors Linux SSH-related validators to consume stdout/stderr directly from the new RunCommand wrapper instead of marshaling full JSON.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
e2e/test_helpers.go Adds RunCommand v2 wrapper and a Windows sysprep script with registry cleanup + ImageState polling; updates CreateImage to use it.
e2e/validators.go Refactors validator RunCommand call sites to use the new wrapper and parse stdout/stderr directly.
Comments suppressed due to low confidence (1)

e2e/test_helpers.go:733

  • CreateImage only checks err from RunCommand; with RunCommand v2 the ARM operation can succeed even when the guest script fails (non-zero exit code / error output). Please fail fast here based on the runcommand instance view result (exit code/execution state), otherwise the test may proceed to capture a non-generalized disk and produce confusing downstream failures.
		if stderr != "" {
			s.T.Logf("Sysprep stderr: %s", stderr)
		}
		require.NoErrorf(s.T, err, "failed to run sysprep on Windows VM for image creation")
	}

Comment thread e2e/test_helpers.go
Comment on lines +621 to +624
// VirtualMachineRunCommand resources persist on the VM until explicitly deleted;
// use a unique name per call so concurrent / repeated calls don't collide.
runCommandName := fmt.Sprintf("e2e-runcmd-%d", time.Now().UnixNano())

Comment thread e2e/test_helpers.go
Comment on lines +651 to +655
if getResp.Properties == nil || getResp.Properties.InstanceView == nil {
return armcompute.VirtualMachineRunCommandInstanceView{}, errors.New("RunCommand result missing instance view")
}
return *getResp.Properties.InstanceView, nil
}
Previously the poll wrote a line every 10s for up to 10 min (~60 lines).
Log only when ImageState changes — typically 2-3 lines for a normal
sysprep run — to stay well under RunCommand's stdout cap and keep the
test log readable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The ARM CreateOrUpdate operation reports success when the RunCommand
extension successfully runs the script, regardless of whether the script
itself succeeded. A non-zero exit, PowerShell throw, or timeout inside
the script only shows up in InstanceView.ExecutionState / ExitCode (per
https://learn.microsoft.com/en-us/azure/virtual-machines/windows/run-command-managed).

Without this check the helper returns nil err on a failed script, and
callers like CreateImage proceed to capture a non-generalized VM — the
exact silent-failure mode our sysprep poll throw was designed to catch.

Return a descriptive error including ExecutionState / ExitCode / stdout
/ stderr so require.NoError fails with actionable info.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 21, 2026 00:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread e2e/test_helpers.go
Comment on lines 614 to 617
start := time.Now()
defer func() {
elapsed := time.Since(start)
toolkit.Logf(ctx, "Command %q took %s", command, elapsed)
toolkit.Logf(ctx, "Command %q took %s", command, time.Since(start))
}()
Comment thread e2e/test_helpers.go
Comment on lines +621 to +655
// VirtualMachineRunCommand resources persist on the VM until explicitly deleted;
// use a unique name per call so concurrent / repeated calls don't collide.
runCommandName := fmt.Sprintf("e2e-runcmd-%d", time.Now().UnixNano())

runCmd := armcompute.VirtualMachineRunCommand{
Location: to.Ptr(s.Location),
Properties: &armcompute.VirtualMachineRunCommandProperties{
Source: &armcompute.VirtualMachineRunCommandScriptSource{
Script: to.Ptr(command),
},
AsyncExecution: to.Ptr(false),
},
}

poller, err := config.Azure.VMSSVMRunCommands.BeginCreateOrUpdate(ctx, rg, s.Runtime.VMSSName, instanceID, runCommandName, runCmd, nil)
if err != nil {
return armcompute.RunCommandResult{}, fmt.Errorf("failed to run command on Windows VM for image creation: %w", err)
return armcompute.VirtualMachineRunCommandInstanceView{}, fmt.Errorf("failed to start RunCommand on VMSS VM: %w", err)
}
if _, err := poller.PollUntilDone(ctx, nil); err != nil {
return armcompute.VirtualMachineRunCommandInstanceView{}, fmt.Errorf("failed to wait for RunCommand on VMSS VM: %w", err)
}

runResp, err := runPoller.PollUntilDone(ctx, nil)
// The CreateOrUpdate response doesn't always include the InstanceView; fetch it
// explicitly so we get stdout/stderr/exit code.
getResp, err := config.Azure.VMSSVMRunCommands.Get(ctx, rg, s.Runtime.VMSSName, instanceID, runCommandName, &armcompute.VirtualMachineScaleSetVMRunCommandsClientGetOptions{
Expand: to.Ptr("instanceView"),
})
if err != nil {
return runResp.RunCommandResult, fmt.Errorf("failed to run command on Windows VM for image creation: %w", err)
return armcompute.VirtualMachineRunCommandInstanceView{}, fmt.Errorf("failed to get RunCommand instance view: %w", err)
}
if getResp.Properties == nil || getResp.Properties.InstanceView == nil {
return armcompute.VirtualMachineRunCommandInstanceView{}, errors.New("RunCommand result missing instance view")
}
view := *getResp.Properties.InstanceView
return view, runCommandScriptError(view)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants