Skip to content

Conversation

@SriHarsha001
Copy link
Contributor

@SriHarsha001 SriHarsha001 commented Feb 5, 2026

What this PR does / why we need it:
In this PR, change is to run 'aptmarkWALinuxAgent hold' in the foreground to avoid apt lock contention.

The & at the end of line 59 in [cse_main.sh:59]

aptmarkWALinuxAgent hold &
This runs in background without any synchronization, so it races against later apt operations like kubelet installation.

+ timeout 600 apt-get -y -f install /opt/kubelet/downloads/kubelet_1.34.0-ubuntu22.04u5_amd64.deb --allow-downgrades
Reading package lists...+ exitStatus=124    # TIMEOUT!
...
+ timeout 25 apt-mark hold walinuxagent
+ exitStatus=124                         # TIMEOUT! (contention)
+ sleep 5
+ timeout 25 apt-mark hold walinuxagent  # retry
...walinuxagent was already set on hold.
+ exitStatus=0                           # Finally succeeds

Which issue(s) this PR fixes:

Repair Item : Install Kubelet/Kubectl can have extra delay caused by waagent hold command locking dpkg - https://msazure.visualstudio.com/CloudNativeCompute/_workitems/edit/36136247

Failure log -

image

Bad Case logs -

+ logs_to_events AKS.CSE.configureKubeletAndKubectl.installKubeletKubectlPkgFromPMC 'installKubeletKubectlPkgFromPMC 1.34.0'
+ local task=AKS.CSE.configureKubeletAndKubectl.installKubeletKubectlPkgFromPMC
+ shift
++ date +%s%3N
+ local eventsFileName=1764801993508
++ date '+%F %T.%3N'
+ local 'startTime=2025-12-03 22:46:33.510'
+ installKubeletKubectlPkgFromPMC 1.34.0
+ k8sVersion=1.34.0
+ installPkgWithAptGet kubelet 1.34.0
+ packageName=kubelet
+ packageVersion=1.34.0
+ downloadDir=/opt/kubelet/downloads
+ packagePrefix='kubelet_1.34.0-*'
++ find /opt/kubelet/downloads -maxdepth 1 -name 'kubelet_1.34.0-*' -print -quit
+ debFile=/opt/kubelet/downloads/kubelet_1.34.0-ubuntu22.04u5_amd64.deb
+ '[' -z /opt/kubelet/downloads/kubelet_1.34.0-ubuntu22.04u5_amd64.deb ']'
+ '[' -z /opt/kubelet/downloads/kubelet_1.34.0-ubuntu22.04u5_amd64.deb ']'
+ logs_to_events AKS.CSE.installkubelet.installDebPackageFromFile 'installDebPackageFromFile /opt/kubelet/downloads/kubelet_1.34.0-ubuntu22.04u5_amd64.deb'
+ local task=AKS.CSE.installkubelet.installDebPackageFromFile
+ shift
++ date +%s%3N
+ local eventsFileName=1764801993516
++ date '+%F %T.%3N'
+ local 'startTime=2025-12-03 22:46:33.519'
+ installDebPackageFromFile /opt/kubelet/downloads/kubelet_1.34.0-ubuntu22.04u5_amd64.deb
+ DEB_FILE=/opt/kubelet/downloads/kubelet_1.34.0-ubuntu22.04u5_amd64.deb
+ wait_for_apt_locks
+ fuser /var/lib/dpkg/lock /var/lib/apt/lists/lock /var/cache/apt/archives/lock /var/lib/dpkg/lock-frontend
+ retrycmd_if_failure 10 5 600 apt-get -y -f install /opt/kubelet/downloads/kubelet_1.34.0-ubuntu22.04u5_amd64.deb --allow-downgrades
+ _retrycmd_internal 10 5 600 true apt-get -y -f install /opt/kubelet/downloads/kubelet_1.34.0-ubuntu22.04u5_amd64.deb --allow-downgrades
+ local retries=10
+ shift
+ local waitSleep=5
+ shift
+ local timeoutVal=600
+ shift
+ local shouldLog=true
+ shift
+ cmdToRun=('apt-get' '-y' '-f' 'install' '/opt/kubelet/downloads/kubelet_1.34.0-ubuntu22.04u5_amd64.deb' '--allow-downgrades')
+ local cmdToRun
+ local exitStatus=0
++ seq 1 10
+ for i in $(seq 1 "$retries")
+ timeout 600 apt-get -y -f install /opt/kubelet/downloads/kubelet_1.34.0-ubuntu22.04u5_amd64.deb --allow-downgrades
Reading package lists...+ exitStatus=124
+ '[' 124 -eq 0 ']'
+ check_cse_timeout true
+ shouldLog=true
+ maxDurationSeconds=780
+ '[' -z 1764801991 ']'
++ date +%s
+ elapsedSeconds=25
+ '[' 25 -gt 780 ']'
+ return 0
+ '[' 1 -eq 120 ']'
+ sleep 5
+ for i in $(seq 1 "$retries")
+ timeout 25 apt-mark hold walinuxagent
+ exitStatus=124
+ '[' 124 -eq 0 ']'
+ check_cse_timeout true
+ shouldLog=true
+ maxDurationSeconds=780
+ '[' -z 1764801991 ']'
++ date +%s
+ elapsedSeconds=55
+ '[' 55 -gt 780 ']'
+ return 0
+ '[' 2 -eq 120 ']'
+ sleep 5
+ for i in $(seq 1 "$retries")
+ timeout 25 apt-mark hold walinuxagent

Building dependency tree...walinuxagent was already set on hold.
+ exitStatus=0
+ '[' 0 -eq 0 ']'
+ break
+ '[' true = true ']'
+ '[' 0 -eq 0 ']'
+ echo 'Executed "apt-mark hold walinuxagent" 3 times.'
Executed "apt-mark hold walinuxagent" 3 times.
+ return 0
++ date
++ hostname
+ echo Wed Dec 3 22:47:38 UTC 2025,aks-agentpool0-38673647-vmss000003, endAptmarkWALinuxAgent hold
Wed Dec 3 22:47:38 UTC 2025,aks-agentpool0-38673647-vmss000003, endAptmarkWALinuxAgent hold

The issue is apt lock contention caused by a backgrounded 'aptmarkWALinuxAgent hold &' running concurrently with the kubelet installation.

Timeline from logs

  • 22:46:33.510 - basePrep function starts and runs aptmarkWALinuxAgent hold & in the background
  • 22:46:33.519 - installDebPackageFromFile starts for kubelet
    The apt-get install begins but then the backgrounded aptmarkWALinuxAgent hold also tries to acquire apt locks

Two operations compete for apt locks:
apt-get -y -f install kubelet... (times out with exit 124)
apt-mark hold walinuxagent (times out twice with exit 124 before succeeding on 3rd try at 22:47:38)
22:48:22 - kubelet install finally completes

Good Case logs -

+ logs_to_events AKS.CSE.configureKubeletAndKubectl.installKubeletKubectlPkgFromPMC 'installKubeletKubectlPkgFromPMC 1.34.2'
+ local task=AKS.CSE.configureKubeletAndKubectl.installKubeletKubectlPkgFromPMC
+ shift
++ date +%s%3N
+ local eventsFileName=1770161689371
++ date '+%F %T.%3N'
+ local 'startTime=2026-02-03 23:34:49.372'
+ installKubeletKubectlPkgFromPMC 1.34.2
+ k8sVersion=1.34.2
+ installPkgWithAptGet kubelet 1.34.2
+ packageName=kubelet
+ packageVersion=1.34.2
+ downloadDir=/opt/kubelet/downloads
+ packagePrefix='kubelet_1.34.2-*'
++ find /opt/kubelet/downloads -maxdepth 1 -name 'kubelet_1.34.2-*' -print -quit
+ debFile=/opt/kubelet/downloads/kubelet_1.34.2-ubuntu22.04u1_amd64.deb
+ '[' -z /opt/kubelet/downloads/kubelet_1.34.2-ubuntu22.04u1_amd64.deb ']'
+ '[' -z /opt/kubelet/downloads/kubelet_1.34.2-ubuntu22.04u1_amd64.deb ']'
+ logs_to_events AKS.CSE.installkubelet.installDebPackageFromFile 'installDebPackageFromFile /opt/kubelet/downloads/kubelet_1.34.2-ubuntu22.04u1_amd64.deb'
+ local task=AKS.CSE.installkubelet.installDebPackageFromFile
+ shift
++ date +%s%3N
+ local eventsFileName=1770161689376
++ date '+%F %T.%3N'
+ local 'startTime=2026-02-03 23:34:49.377'
+ installDebPackageFromFile /opt/kubelet/downloads/kubelet_1.34.2-ubuntu22.04u1_amd64.deb
+ DEB_FILE=/opt/kubelet/downloads/kubelet_1.34.2-ubuntu22.04u1_amd64.deb
+ wait_for_apt_locks
+ fuser /var/lib/dpkg/lock /var/lib/apt/lists/lock /var/cache/apt/archives/lock /var/lib/dpkg/lock-frontend
+ retrycmd_if_failure 10 5 600 apt-get -y -f install /opt/kubelet/downloads/kubelet_1.34.2-ubuntu22.04u1_amd64.deb --allow-downgrades
+ _retrycmd_internal 10 5 600 true apt-get -y -f install /opt/kubelet/downloads/kubelet_1.34.2-ubuntu22.04u1_amd64.deb --allow-downgrades
+ local retries=10
+ shift
+ local waitSleep=5
+ shift
+ local timeoutVal=600
+ shift
+ local shouldLog=true
+ shift
+ cmdToRun=('apt-get' '-y' '-f' 'install' '/opt/kubelet/downloads/kubelet_1.34.2-ubuntu22.04u1_amd64.deb' '--allow-downgrades')
+ local cmdToRun
+ local exitStatus=0
++ seq 1 10
+ for i in $(seq 1 "$retries")
+ timeout 600 apt-get -y -f install /opt/kubelet/downloads/kubelet_1.34.2-ubuntu22.04u1_amd64.deb --allow-downgrades
Reading package lists...walinuxagent was already set on hold.
+ exitStatus=0
+ '[' 0 -eq 0 ']'
+ break
+ '[' true = true ']'
+ '[' 0 -eq 0 ']'
+ echo 'Executed "apt-mark hold walinuxagent" 1 times.'
Executed "apt-mark hold walinuxagent" 1 times.
+ return 0
++ date
++ hostname
+ echo Tue Feb 3 23:34:49 UTC 2026,aks-userpool-14407792-vmss000000, endAptmarkWALinuxAgent hold
Tue Feb 3 23:34:49 UTC 2026,aks-userpool-14407792-vmss000000, endAptmarkWALinuxAgent hold

Wiki Page with the write up - https://msazure.visualstudio.com/CloudNativeCompute/_wiki/wikis/PersonalPlayground/974093/CSE-124-25s-delay

Copilot AI review requested due to automatic review settings February 5, 2026 22:34
@SriHarsha001 SriHarsha001 changed the title Run aptmarkWALinuxAgent hold operation on foreground fix: Run aptmarkWALinuxAgent hold operation on foreground Feb 5, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the execution of aptmarkWALinuxAgent hold from background (asynchronous with &) to foreground (synchronous) execution to avoid apt lock contention with subsequent apt operations like kubelet installation. The change affects the basePrep function in the main CSE script, with corresponding updates to test data files containing base64-encoded versions of the script.

Changes:

  • Modified aptmarkWALinuxAgent hold to run synchronously instead of in background
  • Updated multiple test data files (CustomData) which contain base64-encoded versions of the modified script
  • Added clarifying comment explaining the reason for synchronous execution

Reviewed changes

Copilot reviewed 63 out of 66 changed files in this pull request and generated no comments.

File Description
parts/linux/cloud-init/artifacts/cse_main.sh Main code change: removed & operator and added explanatory comment
pkg/agent/testdata/*/CustomData Updated base64-encoded test data reflecting the script change

@SriHarsha001 SriHarsha001 changed the title fix: Run aptmarkWALinuxAgent hold operation on foreground fix: run aptmarkwalinuxAgent hold operation on foreground Feb 5, 2026
@SriHarsha001 SriHarsha001 marked this pull request as ready for review February 6, 2026 00:14
function basePrep {
aptmarkWALinuxAgent hold &
# Run synchronously to avoid apt lock contention with later apt operations (eg kubelet install)
aptmarkWALinuxAgent hold
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding this out.
As we talked offline, this was the original PR in 2023.
https://github.com/Azure/AgentBaker/pull/2701/changes

But it doesn't have explanation why it's running in background (with the &).
Not sure if anyone else knows the history.
The impact should be increasing some provisioning but ideally it should have minimal impact

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have the same concern on whether this increases the overall provisioning duration, can we run some tests to see if removing the & has a substantial impact?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, same here, I wanted us to understand more on the why, I would prefer to remove it if we dont need it anymore

Copy link
Contributor Author

@SriHarsha001 SriHarsha001 Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my reading of the code - we need to keep "aptmarkWALinuxAgent unhold" as is, so existing node images do not get impacted.

During VHD image build, install-dependencies.sh (line 72) calls installDeps function and in installDeps fn "aptmarkWALinuxAgent hold" takes place. This hold persists in the VHD.

Later when nodePrep is called, the hold on walinuxagent is released by "aptmarkWALinuxAgent unhold &"

If it is not released, then below will be the consequence -

  • Unattended upgrades will never update walinuxagent
  • Security patches for walinuxagent won't auto-apply (if it is doing it)

Copy link
Contributor Author

@SriHarsha001 SriHarsha001 Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

walinuxagent hold/unhold operation should take <1 sec to complete -

root@aks-nodepool1-31884855-vmss000000:/# time apt-mark hold walinuxagent
walinuxagent was already set on hold.
real    0m0.046s
user    0m0.032s
sys     0m0.015s

root@aks-nodepool1-31884855-vmss000000:/# time apt-mark unhold walinuxagent
Canceled hold on walinuxagent.
real    0m0.058s
user    0m0.042s
sys     0m0.016s

root@aks-nodepool1-31884855-vmss000000:/# time apt-mark unhold walinuxagent
walinuxagent was already not on hold.
real    0m0.165s
user    0m0.063s
sys     0m0.099s

As this PR touches the main critical code path (VHD image creation and node provisioning of every AKS node), I want to defer removing 'aptmarkWALinuxAgent hold' for now, instead make the operation foreground by removing '&'.

So I am proposing 2 changes -

  1. Run 'walinuxagent hold' as a foreground operation.
  2. Reduce timeout from 25s to 5s 'walinuxagent hold/unhold' with 15 minutes overall timeout.

Copy link
Contributor Author

@SriHarsha001 SriHarsha001 Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier, worst case - 120 × (25 + 5) = 3600s (60 minutes) - but wrapper timeout has 15minutes.

retrycmd_if_failure 120 5 25 apt-mark $1 walinuxagent

In this PR, worst case - 90 × (5 + 5) = 900s (15 minutes)

retrycmd_if_failure 90 5 5 apt-mark $1 walinuxagent

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if its supposed to take that 1sec... I vote for something lower, also, we need to log the event_log time so we can know how much time it really adds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please sync with @mxj220 since if we are moving with us upgrading the waagent ourselves, I"m not sure why we would care about it being blocked

Copy link
Collaborator

@Devinwong Devinwong Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running in the background:

  1. If the hold action is fast (about 0.1s, happy path as seen in the experiment), it should have released the lock when it comes to installing kubelet
  2. For edge case if the hold action is taking super long (more than 25s), it occupies the lock thus installing kubelet fails.

Making the hold to run on main thread shouldn't change the behavior of 1 and not introducing latency. And for 2, both will fail.
And yes if we don't need to care about waagent upgrade(need to double check), we can remove all hold/unhold commands to make the logic straightforward.

Copy link
Contributor Author

@SriHarsha001 SriHarsha001 Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you everyone for taking a look.

To be safe and avoid any unintentional walinuxagent upgrades during nodePrep in the future or custom VHD, I am leaving the hold to be a sync operation in this PR. As this is in the critical path and my first PR in NodeSIG, I want to be super careful :).

I have also reduce the sleep time and timeout for each apt hold operation, keeping overall timeout to be 15 minutes. 225 * (2 + 2) = 900s (15 minutes).

On this PR - #7795 plan is to download and update waLinuxAgent during VHD build, however during nodePrep, if there is a new waLinuxAgent available and if some code path does apt upgrade, there is a chance of it getting upgraded.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 64 out of 67 changed files in this pull request and generated 1 comment.

Copilot AI review requested due to automatic review settings February 9, 2026 19:51
@SriHarsha001 SriHarsha001 changed the title fix: run aptmarkwalinuxAgent hold operation on foreground fix: run aptmarkwalinuxagent hold operation on foreground Feb 9, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@djsly
Copy link
Collaborator

djsly commented Feb 10, 2026

fix test data conflict and good to go.

Copy link
Member

@yewmsft yewmsft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the right fix

@djsly
Copy link
Collaborator

djsly commented Feb 10, 2026

@yewmsft please unblock PR

Copy link
Member

@yewmsft yewmsft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resetting the vote.

@yewmsft yewmsft self-requested a review February 10, 2026 19:22
@SriHarsha001 SriHarsha001 enabled auto-merge (squash) February 10, 2026 22:46
@SriHarsha001 SriHarsha001 merged commit 7302513 into main Feb 10, 2026
25 of 27 checks passed
@SriHarsha001 SriHarsha001 deleted the sharsha/vhdCSEPerf branch February 10, 2026 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants