Skip to content

Reconcile VMs on startup#66

Merged
ispasov merged 5 commits into
mainfrom
is/vm-reconciliation
Jun 4, 2026
Merged

Reconcile VMs on startup#66
ispasov merged 5 commits into
mainfrom
is/vm-reconciliation

Conversation

@ispasov

@ispasov ispasov commented May 29, 2026

Copy link
Copy Markdown
Collaborator

Description

  • Reconcile VMs on startup

VMs spun up from the previous process were left orphaned.
They were either not fully provisioned or were never cleaned up upon job completion.

In addition to that the integration was ignoring the initial resource request message and the runner had to wait for another job to start in order to provision the needed capacity.

This PR introduces the following:

  • Reconciliation:

    • List existing Orka VMs by scale set name prefix on startup
    • Provisioning writes sentinel files (/tmp/orka-runner-setup-complete,
      /tmp/orka-runner-run-complete) so the reconciler can probe state.
    • For each existing VM, check the GitHub runner and SSH into the VM:
      • No GH runner OR setup never completed OR run.sh crashed -> delete.
      • run.sh finished but cleanup was missed -> clean up.
      • Setup complete and run.sh running -> adopt.

    Adoption:

    • Track existing running VMs as if they were created by the current process.

    Recovery provisioning:

    • Start respecting the first message received from GitHub.

Testing

Make sure to set MANAGE_RUNNER_SCALE_SETS and ENABLE_RECONCILIATION to true

Initial run

  1. Start the integration
  2. Start a job
  3. Ensure it finishes

Interrupted run

  1. Start the integration
  2. Start a job
  3. The moment a VM is provisioned stop the integration
  4. Start it again.
  5. The old VM should be deleted and a new one should be created instead
  6. The job should complete
  7. The VM gets deleted

Initial request

  1. Stop the integration
  2. Start a workflow, wait for it to initialize
  3. Start the integration
  4. It should create as many runners as requested from the workflow (note - sometimes GitHub reports incorrectly 0 required runners. If this happens, try with a job that requests more initially)
  5. The workflow should complete
  6. The VMs should be deleted

Adopting VMs

  1. Start the integration
  2. Create a workflow that runs at least 1 min
  3. Start the workflow
  4. Wait for the VM that is created to start executing the workflow
  5. Restart the integration
  6. The VM should not be deleted, but adopted
  7. The job should complete succesfully
  8. The VM should be deleted

Cancelling adopted Job

  1. Start the integration
  2. Create a workflow that runs at least 1 min
  3. Start the workflow
  4. Wait for the VM that is created to start executing the workflow
  5. Restart the integration
  6. The VM should not be deleted, but adopted
  7. Cancel the run
  8. The VM should be deleted

@ispasov ispasov requested a review from a team as a code owner May 29, 2026 08:11
VMs spun up from the previous process were left orphaned.
They were either not fully provisioned or were never cleaned up upon job completion.

In addition to that the integration was ignoring the initial resource request message and the runner had to wait for another job to start in order to provision the needed capacity.

This PR introduces the following:

- Reconciliation:
  - List existing Orka VMs by scale set name prefix on startup
  - Provisioning writes sentinel files (/tmp/orka-runner-setup-complete,
    /tmp/orka-runner-run-complete) so the reconciler can probe state.
  - For each existing VM, check the GitHub runner and SSH into the VM:
    - No GH runner OR setup never completed OR run.sh crashed -> delete.
    - run.sh finished but cleanup was missed -> clean up.
    - Setup complete and run.sh running -> adopt.

  Adoption:
  - Track existing running VMs as if they were created by the current process.

  Recovery provisioning:
  - Start respecting the first message received from GitHub.
Comment thread pkg/github/runners/message-processor.go Outdated
}

func (p *RunnerMessageProcessor) startRunner(job jobIdentity) {
go func() {

@spikeburton spikeburton Jun 1, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - do we need this nested func? Can we call go startRunner() from the call sites instead? This would make it a bit easier to read (there is already another function definition nested inside here with the defer 😄 we should improve this later)

Comment thread examples/.env
# (sentinel files + run.sh process), and either adopts (active), cleans up (run finished), or deletes
# (no GitHub runner / setup incomplete / run.sh crashed) so the controller recovers from prior process exits.
# Defaults to true.
ENABLE_RECONCILIATION=true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to be enabled by default now? As it changes the behavior, it may cause unexpected issues for existing users if the controller is restarted.

Should we make it opt-in for now?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about htis.
But given the fact the previous behavior introduces stuck jobs and orphaned VM, this seems like a better default

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. At the least, when we release a new version we should highlight the change in behavior

Comment thread pkg/github/runners/message-processor.go Outdated

if message.MessageId == 0 && len(batchedMessages) == 0 {
p.logger.Infof("initial message received, provision runners to cover assigned-job gap")
requiredRunners = requiredRunners - message.Statistics.TotalRunningJobs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that we double count some VMs?

Imagine if the controller restarts, and there is a runner currently running a job. It will be counted with both TotalRegisteredRunners and TotalRunningJobs, won't it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen this, but it looks like this line causes us to underprovision, which could lead to issues.
I am currently experimenting without it.

@ispasov ispasov Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the initial message if you have 2 runners. Note - they are running, but not registered:)

{"level":"info","ts":"2026-06-03T12:19:44+03:00","logger":"runner-message-processor-172","msg":"Runner Set Statistics - Available: 0, Acquired: 0, Assigned: 2, Running: 2, Registered Runners: 0, Busy: 2, Idle: 0"}

And the first message after that fixes it:

{"level":"info","ts":"2026-06-03T12:19:44+03:00","logger":"runner-message-processor-172","msg":"Runner Set Statistics - Available: 0, Acquired: 0, Assigned: 2, Running: 2, Registered Runners: 2, Busy: 2, Idle: 0"}

I suspect that there is an issue with the initial reporting

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha. So there is potentially an issue with under provisioning but it is being masked by something else going on

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spikeburton you can check my latest commit.
In some cases this is being reportly as it should be. In some not. So I decided to do some math (registered should equal busy and idle)

}
r.logger.Infof("reconciliation: found %d existing VM(s) to reconcile", len(vms))

for _, vm := range vms {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should run r.reconcileVM in parallel (e.g. using a wait group). Especially as there is a 10s. SSH timeout (imagine there are some dead / unreachable VMs). Running all of these in serial is going to take some time to reconcile.

We can always improve this in the future if we need to as well

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call.
I am going to look into this in another PR

@ispasov ispasov merged commit 9dd636f into main Jun 4, 2026
2 checks passed
@ispasov ispasov deleted the is/vm-reconciliation branch June 4, 2026 05:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants