Skip to content

macos_run_puppet: make bootstrap reboot-survivable via LaunchDaemon#1210

Open
rcurranmoz wants to merge 2 commits into
masterfrom
reboot-survivable-bootstrap
Open

macos_run_puppet: make bootstrap reboot-survivable via LaunchDaemon#1210
rcurranmoz wants to merge 2 commits into
masterfrom
reboot-survivable-bootstrap

Conversation

@rcurranmoz
Copy link
Copy Markdown
Contributor

Summary

Fixes #1206.

Today a fresh M4 worker bootstrap needs a babysitter SSH/Bolt session that survives across all reboots — TCC.db reboot, MDM-driven OS upgrade, post-puppet reboot. We just walked through this with macmini-m4-130..149 on 2026-05-12 and spent ~3 hours nursing it.

This PR has run-puppet.sh register a self-removing LaunchDaemon (com.mozilla.ronin-puppet-bootstrap) on first invocation. The LaunchDaemon re-runs the script on every boot until:

  1. Puppet apply succeeds cleanly (the existing while/run_puppet loop already gates this); and
  2. Puppet's regular at-boot mechanism (com.mozilla.atboot_puppet) is installed, indicating the host is fully managed.

Once both hold, the script writes /var/tmp/semaphore/run-buildbot and unloads + removes its own LaunchDaemon. The host is in the pool, and future boots are handled by the regular puppet at-boot mechanism with no overlap.

Implementation

Two helpers in modules/macos_run_puppet/files/run-puppet.sh:

  • install_bootstrap_launchd: copies the script to /usr/local/sbin/ronin-puppet-bootstrap.sh and writes /Library/LaunchDaemons/com.mozilla.ronin-puppet-bootstrap.plist. No-op if puppet's at-boot LaunchDaemon is already in place.
  • finalize_bootstrap: writes /var/tmp/semaphore/run-buildbot, then (only if puppet's at-boot LaunchDaemon exists) unloads and removes the bootstrap LaunchDaemon.

install_bootstrap_launchd is called after the existing role-file / puppet-binary preconditions, so a host without /etc/puppet_role won't install the LaunchDaemon and start a reboot loop. finalize_bootstrap is called at the existing exit 0 after the puppet retry loop breaks.

Operational impact

Before: orchestrator must SSH-and-wait across N reboots.
After: a single MDM script-job (or bolt run) kicks run-puppet.sh and walks away.

Test plan

  • Fresh M4 host: invoke run-puppet.sh once, observe it hits TCC.db reboot trigger, comes back, applies cleanly, writes run-buildbot, removes itself
  • Fresh M4 host where MDM drives a macOS upgrade between first puppet apply and second: confirm the LaunchDaemon survives the upgrade reboot and resumes
  • Already-bootstrapped host (atboot_puppet in place): invoke the script — confirm it does NOT install the bootstrap LaunchDaemon
  • Force a puppet apply failure (rm a required file): confirm the bootstrap LaunchDaemon stays in place and retries on next boot
  • Kitchen mac suites: confirm no regression (running_in_test_kitchen fact path is unchanged)

Related

Pairs naturally with #1208 (TCC.db cltbld-session gate) — together they'd eliminate the babysitter pattern entirely.

🤖 Generated with Claude Code

Today, bootstrapping a fresh M4 worker requires an external SSH session
that babysits the run across at least two reboots (TCC.db detection +
final post-puppet reboot), and a third if MDM drives an OS upgrade
mid-bootstrap.

This commit adds a self-registering LaunchDaemon
(com.mozilla.ronin-puppet-bootstrap) so run-puppet.sh fires on every
boot until two conditions are met:

  1. Puppet apply has succeeded cleanly (existing `while/run_puppet`
     loop already gates this)
  2. Puppet's regular at-boot mechanism (com.mozilla.atboot_puppet) is
     installed

When both are satisfied, the script writes /var/tmp/semaphore/run-buildbot
so generic-worker can start, then unloads and removes its own
LaunchDaemon. Future boots are handled by the regular puppet at-boot
mechanism with no overlap.

Two helpers added:

- `install_bootstrap_launchd`: copies the script to
  /usr/local/sbin/ronin-puppet-bootstrap.sh and writes
  /Library/LaunchDaemons/com.mozilla.ronin-puppet-bootstrap.plist. No-op
  if puppet's at-boot LaunchDaemon is already present.

- `finalize_bootstrap`: writes the run-buildbot semaphore and removes
  the bootstrap LaunchDaemon (only after confirming puppet's at-boot
  mechanism is in place, so we don't leave the host with no puppet
  trigger).

`install_bootstrap_launchd` is called once role/puppet/facter
preconditions are confirmed; `finalize_bootstrap` is called at the
existing `exit 0` after the puppet retry loop has broken.

Result: an MDM script-job (or one-shot `bolt run`) can kick the
bootstrap and walk away. The host finishes provisioning across however
many reboots are needed.

Fixes #1206

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

macos_run_puppet: make bootstrap reboot-survivable via LaunchDaemon

1 participant