Reconcile VMs on startup#66
Conversation
VMs spun up from the previous process were left orphaned.
They were either not fully provisioned or were never cleaned up upon job completion.
In addition to that the integration was ignoring the initial resource request message and the runner had to wait for another job to start in order to provision the needed capacity.
This PR introduces the following:
- Reconciliation:
- List existing Orka VMs by scale set name prefix on startup
- Provisioning writes sentinel files (/tmp/orka-runner-setup-complete,
/tmp/orka-runner-run-complete) so the reconciler can probe state.
- For each existing VM, check the GitHub runner and SSH into the VM:
- No GH runner OR setup never completed OR run.sh crashed -> delete.
- run.sh finished but cleanup was missed -> clean up.
- Setup complete and run.sh running -> adopt.
Adoption:
- Track existing running VMs as if they were created by the current process.
Recovery provisioning:
- Start respecting the first message received from GitHub.
b4b03f1 to
e94ade8
Compare
| } | ||
|
|
||
| func (p *RunnerMessageProcessor) startRunner(job jobIdentity) { | ||
| go func() { |
There was a problem hiding this comment.
nit - do we need this nested func? Can we call go startRunner() from the call sites instead? This would make it a bit easier to read (there is already another function definition nested inside here with the defer 😄 we should improve this later)
| # (sentinel files + run.sh process), and either adopts (active), cleans up (run finished), or deletes | ||
| # (no GitHub runner / setup incomplete / run.sh crashed) so the controller recovers from prior process exits. | ||
| # Defaults to true. | ||
| ENABLE_RECONCILIATION=true |
There was a problem hiding this comment.
Do we want this to be enabled by default now? As it changes the behavior, it may cause unexpected issues for existing users if the controller is restarted.
Should we make it opt-in for now?
There was a problem hiding this comment.
I was wondering about htis.
But given the fact the previous behavior introduces stuck jobs and orphaned VM, this seems like a better default
There was a problem hiding this comment.
True. At the least, when we release a new version we should highlight the change in behavior
|
|
||
| if message.MessageId == 0 && len(batchedMessages) == 0 { | ||
| p.logger.Infof("initial message received, provision runners to cover assigned-job gap") | ||
| requiredRunners = requiredRunners - message.Statistics.TotalRunningJobs |
There was a problem hiding this comment.
Is it possible that we double count some VMs?
Imagine if the controller restarts, and there is a runner currently running a job. It will be counted with both TotalRegisteredRunners and TotalRunningJobs, won't it?
There was a problem hiding this comment.
I haven't seen this, but it looks like this line causes us to underprovision, which could lead to issues.
I am currently experimenting without it.
There was a problem hiding this comment.
Here is the initial message if you have 2 runners. Note - they are running, but not registered:)
{"level":"info","ts":"2026-06-03T12:19:44+03:00","logger":"runner-message-processor-172","msg":"Runner Set Statistics - Available: 0, Acquired: 0, Assigned: 2, Running: 2, Registered Runners: 0, Busy: 2, Idle: 0"}
And the first message after that fixes it:
{"level":"info","ts":"2026-06-03T12:19:44+03:00","logger":"runner-message-processor-172","msg":"Runner Set Statistics - Available: 0, Acquired: 0, Assigned: 2, Running: 2, Registered Runners: 2, Busy: 2, Idle: 0"}
I suspect that there is an issue with the initial reporting
There was a problem hiding this comment.
Ha. So there is potentially an issue with under provisioning but it is being masked by something else going on
There was a problem hiding this comment.
@spikeburton you can check my latest commit.
In some cases this is being reportly as it should be. In some not. So I decided to do some math (registered should equal busy and idle)
| } | ||
| r.logger.Infof("reconciliation: found %d existing VM(s) to reconcile", len(vms)) | ||
|
|
||
| for _, vm := range vms { |
There was a problem hiding this comment.
I wonder if we should run r.reconcileVM in parallel (e.g. using a wait group). Especially as there is a 10s. SSH timeout (imagine there are some dead / unreachable VMs). Running all of these in serial is going to take some time to reconcile.
We can always improve this in the future if we need to as well
There was a problem hiding this comment.
Good call.
I am going to look into this in another PR
Description
VMs spun up from the previous process were left orphaned.
They were either not fully provisioned or were never cleaned up upon job completion.
In addition to that the integration was ignoring the initial resource request message and the runner had to wait for another job to start in order to provision the needed capacity.
This PR introduces the following:
Reconciliation:
/tmp/orka-runner-run-complete) so the reconciler can probe state.
Adoption:
Recovery provisioning:
Testing
Make sure to set
MANAGE_RUNNER_SCALE_SETSandENABLE_RECONCILIATIONto trueInitial run
Interrupted run
Initial request
Adopting VMs
Cancelling adopted Job