Reconcile VMs on startup by ispasov · Pull Request #66 · macstadium/orka-github-actions-integration

ispasov · 2026-05-29T08:11:08Z

Description

Reconcile VMs on startup

VMs spun up from the previous process were left orphaned.
They were either not fully provisioned or were never cleaned up upon job completion.

In addition to that the integration was ignoring the initial resource request message and the runner had to wait for another job to start in order to provision the needed capacity.

This PR introduces the following:

Reconciliation:
- List existing Orka VMs by scale set name prefix on startup
- Provisioning writes sentinel files (/tmp/orka-runner-setup-complete,
  /tmp/orka-runner-run-complete) so the reconciler can probe state.
- For each existing VM, check the GitHub runner and SSH into the VM:
  - No GH runner OR setup never completed OR run.sh crashed -> delete.
  - run.sh finished but cleanup was missed -> clean up.
  - Setup complete and run.sh running -> adopt.
Adoption:
- Track existing running VMs as if they were created by the current process.
Recovery provisioning:
- Start respecting the first message received from GitHub.

Testing

Make sure to set MANAGE_RUNNER_SCALE_SETS and ENABLE_RECONCILIATION to true

Initial run

Start the integration
Start a job
Ensure it finishes

Interrupted run

Start the integration
Start a job
The moment a VM is provisioned stop the integration
Start it again.
The old VM should be deleted and a new one should be created instead
The job should complete
The VM gets deleted

Initial request

Stop the integration
Start a workflow, wait for it to initialize
Start the integration
It should create as many runners as requested from the workflow (note - sometimes GitHub reports incorrectly 0 required runners. If this happens, try with a job that requests more initially)
The workflow should complete
The VMs should be deleted

Adopting VMs

Start the integration
Create a workflow that runs at least 1 min
Start the workflow
Wait for the VM that is created to start executing the workflow
Restart the integration
The VM should not be deleted, but adopted
The job should complete succesfully
The VM should be deleted

Cancelling adopted Job

Start the integration
Create a workflow that runs at least 1 min
Start the workflow
Wait for the VM that is created to start executing the workflow
Restart the integration
The VM should not be deleted, but adopted
Cancel the run
The VM should be deleted

VMs spun up from the previous process were left orphaned. They were either not fully provisioned or were never cleaned up upon job completion. In addition to that the integration was ignoring the initial resource request message and the runner had to wait for another job to start in order to provision the needed capacity. This PR introduces the following: - Reconciliation: - List existing Orka VMs by scale set name prefix on startup - Provisioning writes sentinel files (/tmp/orka-runner-setup-complete, /tmp/orka-runner-run-complete) so the reconciler can probe state. - For each existing VM, check the GitHub runner and SSH into the VM: - No GH runner OR setup never completed OR run.sh crashed -> delete. - run.sh finished but cleanup was missed -> clean up. - Setup complete and run.sh running -> adopt. Adoption: - Track existing running VMs as if they were created by the current process. Recovery provisioning: - Start respecting the first message received from GitHub.

spikeburton · 2026-06-01T20:58:22Z

+}
+
+func (p *RunnerMessageProcessor) startRunner(job jobIdentity) {
+	go func() {


nit - do we need this nested func? Can we call go startRunner() from the call sites instead? This would make it a bit easier to read (there is already another function definition nested inside here with the defer 😄 we should improve this later)

spikeburton · 2026-06-01T21:31:05Z

+# (sentinel files + run.sh process), and either adopts (active), cleans up (run finished), or deletes
+# (no GitHub runner / setup incomplete / run.sh crashed) so the controller recovers from prior process exits.
+# Defaults to true.
+ENABLE_RECONCILIATION=true


Do we want this to be enabled by default now? As it changes the behavior, it may cause unexpected issues for existing users if the controller is restarted.

Should we make it opt-in for now?

I was wondering about htis.
But given the fact the previous behavior introduces stuck jobs and orphaned VM, this seems like a better default

True. At the least, when we release a new version we should highlight the change in behavior

spikeburton · 2026-06-01T21:41:57Z


+	if message.MessageId == 0 && len(batchedMessages) == 0 {
+		p.logger.Infof("initial message received, provision runners to cover assigned-job gap")
+		requiredRunners = requiredRunners - message.Statistics.TotalRunningJobs


Is it possible that we double count some VMs?

Imagine if the controller restarts, and there is a runner currently running a job. It will be counted with both TotalRegisteredRunners and TotalRunningJobs, won't it?

I haven't seen this, but it looks like this line causes us to underprovision, which could lead to issues.
I am currently experimenting without it.

Here is the initial message if you have 2 runners. Note - they are running, but not registered:)

{"level":"info","ts":"2026-06-03T12:19:44+03:00","logger":"runner-message-processor-172","msg":"Runner Set Statistics - Available: 0, Acquired: 0, Assigned: 2, Running: 2, Registered Runners: 0, Busy: 2, Idle: 0"}

And the first message after that fixes it:

{"level":"info","ts":"2026-06-03T12:19:44+03:00","logger":"runner-message-processor-172","msg":"Runner Set Statistics - Available: 0, Acquired: 0, Assigned: 2, Running: 2, Registered Runners: 2, Busy: 2, Idle: 0"}

I suspect that there is an issue with the initial reporting

Ha. So there is potentially an issue with under provisioning but it is being masked by something else going on

@spikeburton you can check my latest commit.
In some cases this is being reportly as it should be. In some not. So I decided to do some math (registered should equal busy and idle)

spikeburton · 2026-06-01T21:53:47Z

+	}
+	r.logger.Infof("reconciliation: found %d existing VM(s) to reconcile", len(vms))
+
+	for _, vm := range vms {


I wonder if we should run r.reconcileVM in parallel (e.g. using a wait group). Especially as there is a 10s. SSH timeout (imagine there are some dead / unreachable VMs). Running all of these in serial is going to take some time to reconcile.

We can always improve this in the future if we need to as well

Good call.
I am going to look into this in another PR

ispasov requested a review from a team as a code owner May 29, 2026 08:11

ispasov force-pushed the is/vm-reconciliation branch from b4b03f1 to e94ade8 Compare May 29, 2026 08:12

Rich7690 mentioned this pull request May 29, 2026

Resolve issues with orphaned VMs when process restarts with in-flight jobs #64

Closed

spikeburton approved these changes Jun 1, 2026

View reviewed changes

spikeburton reviewed Jun 1, 2026

View reviewed changes

ispasov added 4 commits June 3, 2026 12:17

Move go routine to caller

ee748b9

Better reporting during reconcile

c9f098c

Handle initial messages better

1ca2fe7

Use running job instead of busy

69ebcdf

ispasov merged commit 9dd636f into main Jun 4, 2026
2 checks passed

ispasov deleted the is/vm-reconciliation branch June 4, 2026 05:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconcile VMs on startup#66

Reconcile VMs on startup#66
ispasov merged 5 commits into
mainfrom
is/vm-reconciliation

ispasov commented May 29, 2026 •

edited

Loading

Uh oh!

spikeburton Jun 1, 2026 •

edited

Loading

Uh oh!

spikeburton Jun 1, 2026

Uh oh!

ispasov Jun 3, 2026

Uh oh!

spikeburton Jun 3, 2026

Uh oh!

spikeburton Jun 1, 2026

Uh oh!

ispasov Jun 3, 2026

Uh oh!

ispasov Jun 3, 2026 •

edited

Loading

Uh oh!

spikeburton Jun 3, 2026

Uh oh!

ispasov Jun 3, 2026

Uh oh!

spikeburton Jun 1, 2026

Uh oh!

ispasov Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ispasov commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Initial run

Interrupted run

Initial request

Adopting VMs

Cancelling adopted Job

Uh oh!

spikeburton Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ispasov Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ispasov commented May 29, 2026 •

edited

Loading

spikeburton Jun 1, 2026 •

edited

Loading

ispasov Jun 3, 2026 •

edited

Loading