-
Notifications
You must be signed in to change notification settings - Fork 247
fix: run aptmarkwalinuxagent hold operation on foreground #7797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR changes the execution of aptmarkWALinuxAgent hold from background (asynchronous with &) to foreground (synchronous) execution to avoid apt lock contention with subsequent apt operations like kubelet installation. The change affects the basePrep function in the main CSE script, with corresponding updates to test data files containing base64-encoded versions of the script.
Changes:
- Modified
aptmarkWALinuxAgent holdto run synchronously instead of in background - Updated multiple test data files (CustomData) which contain base64-encoded versions of the modified script
- Added clarifying comment explaining the reason for synchronous execution
Reviewed changes
Copilot reviewed 63 out of 66 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| parts/linux/cloud-init/artifacts/cse_main.sh | Main code change: removed & operator and added explanatory comment |
| pkg/agent/testdata/*/CustomData | Updated base64-encoded test data reflecting the script change |
| function basePrep { | ||
| aptmarkWALinuxAgent hold & | ||
| # Run synchronously to avoid apt lock contention with later apt operations (eg kubelet install) | ||
| aptmarkWALinuxAgent hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding this out.
As we talked offline, this was the original PR in 2023.
https://github.com/Azure/AgentBaker/pull/2701/changes
But it doesn't have explanation why it's running in background (with the &).
Not sure if anyone else knows the history.
The impact should be increasing some provisioning but ideally it should have minimal impact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have the same concern on whether this increases the overall provisioning duration, can we run some tests to see if removing the & has a substantial impact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, same here, I wanted us to understand more on the why, I would prefer to remove it if we dont need it anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my reading of the code - we need to keep "aptmarkWALinuxAgent unhold" as is, so existing node images do not get impacted.
During VHD image build, install-dependencies.sh (line 72) calls installDeps function and in installDeps fn "aptmarkWALinuxAgent hold" takes place. This hold persists in the VHD.
Later when nodePrep is called, the hold on walinuxagent is released by "aptmarkWALinuxAgent unhold &"
If it is not released, then below will be the consequence -
- Unattended upgrades will never update walinuxagent
- Security patches for walinuxagent won't auto-apply (if it is doing it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
walinuxagent hold/unhold operation should take <1 sec to complete -
root@aks-nodepool1-31884855-vmss000000:/# time apt-mark hold walinuxagent
walinuxagent was already set on hold.
real 0m0.046s
user 0m0.032s
sys 0m0.015s
root@aks-nodepool1-31884855-vmss000000:/# time apt-mark unhold walinuxagent
Canceled hold on walinuxagent.
real 0m0.058s
user 0m0.042s
sys 0m0.016s
root@aks-nodepool1-31884855-vmss000000:/# time apt-mark unhold walinuxagent
walinuxagent was already not on hold.
real 0m0.165s
user 0m0.063s
sys 0m0.099s
As this PR touches the main critical code path (VHD image creation and node provisioning of every AKS node), I want to defer removing 'aptmarkWALinuxAgent hold' for now, instead make the operation foreground by removing '&'.
So I am proposing 2 changes -
- Run 'walinuxagent hold' as a foreground operation.
- Reduce timeout from 25s to 5s 'walinuxagent hold/unhold' with 15 minutes overall timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier, worst case - 120 × (25 + 5) = 3600s (60 minutes) - but wrapper timeout has 15minutes.
retrycmd_if_failure 120 5 25 apt-mark $1 walinuxagent
In this PR, worst case - 90 × (5 + 5) = 900s (15 minutes)
retrycmd_if_failure 90 5 5 apt-mark $1 walinuxagent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if its supposed to take that 1sec... I vote for something lower, also, we need to log the event_log time so we can know how much time it really adds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please sync with @mxj220 since if we are moving with us upgrading the waagent ourselves, I"m not sure why we would care about it being blocked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running in the background:
- If the hold action is fast (about 0.1s, happy path as seen in the experiment), it should have released the lock when it comes to installing kubelet
- For edge case if the hold action is taking super long (more than 25s), it occupies the lock thus installing kubelet fails.
Making the hold to run on main thread shouldn't change the behavior of 1 and not introducing latency. And for 2, both will fail.
And yes if we don't need to care about waagent upgrade(need to double check), we can remove all hold/unhold commands to make the logic straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you everyone for taking a look.
To be safe and avoid any unintentional walinuxagent upgrades during nodePrep in the future or custom VHD, I am leaving the hold to be a sync operation in this PR. As this is in the critical path and my first PR in NodeSIG, I want to be super careful :).
I have also reduce the sleep time and timeout for each apt hold operation, keeping overall timeout to be 15 minutes. 225 * (2 + 2) = 900s (15 minutes).
On this PR - #7795 plan is to download and update waLinuxAgent during VHD build, however during nodePrep, if there is a new waLinuxAgent available and if some code path does apt upgrade, there is a chance of it getting upgraded.
6c9f2fc to
9870c00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 64 out of 67 changed files in this pull request and generated 1 comment.
fdddffa to
b83d0e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
|
fix test data conflict and good to go. |
yewmsft
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the right fix
|
@yewmsft please unblock PR |
yewmsft
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resetting the vote.
8b4ed3f to
4ff5c2a
Compare
What this PR does / why we need it:
In this PR, change is to run 'aptmarkWALinuxAgent hold' in the foreground to avoid apt lock contention.
The & at the end of line 59 in [cse_main.sh:59]
aptmarkWALinuxAgent hold &
This runs in background without any synchronization, so it races against later apt operations like kubelet installation.
Which issue(s) this PR fixes:
Repair Item : Install Kubelet/Kubectl can have extra delay caused by waagent hold command locking dpkg - https://msazure.visualstudio.com/CloudNativeCompute/_workitems/edit/36136247
Failure log -
Bad Case logs -
The issue is apt lock contention caused by a backgrounded 'aptmarkWALinuxAgent hold &' running concurrently with the kubelet installation.
Timeline from logs
The apt-get install begins but then the backgrounded aptmarkWALinuxAgent hold also tries to acquire apt locks
Two operations compete for apt locks:
apt-get -y -f install kubelet... (times out with exit 124)
apt-mark hold walinuxagent (times out twice with exit 124 before succeeding on 3rd try at 22:47:38)
22:48:22 - kubelet install finally completes
Good Case logs -
Wiki Page with the write up - https://msazure.visualstudio.com/CloudNativeCompute/_wiki/wikis/PersonalPlayground/974093/CSE-124-25s-delay