From 7421d92b99df426cff6a730fcc5d86eb7220a8a9 Mon Sep 17 00:00:00 2001 From: Shubham Malik Date: Thu, 18 Jun 2026 10:25:48 +0530 Subject: [PATCH 1/2] docs: add macOS launchd troubleshooting guide --- docs/launchd-troubleshooting.md | 141 ++++++++++++++++++++++++++++++++ 1 file changed, 141 insertions(+) create mode 100644 docs/launchd-troubleshooting.md diff --git a/docs/launchd-troubleshooting.md b/docs/launchd-troubleshooting.md new file mode 100644 index 0000000..b5b45a0 --- /dev/null +++ b/docs/launchd-troubleshooting.md @@ -0,0 +1,141 @@ +# launchd Troubleshooting (macOS) + +Command reference for the Dev Machine Guard launchd job (label +`com.stepsecurity.agent`). + +**How periodic runs work:** the MDM loader installs a launchd plist with a +`StartInterval` (default 4h). Each tick launchd re-runs the **loader script**, +which auto-updates the binary, then runs `send-telemetry`. `RunAtLoad` is +`false`, so loading the plist (login / boot / install) never triggers a scan — +only the interval does. The one-off initial scan runs explicitly at install +time. To force an out-of-cycle run, `kickstart` it (or run the loader by hand). + +## Variants + +Almost always a per-user **LaunchAgent** running as the console user — that's +what the MDM loader installs. A root **LaunchDaemon** exists only if someone ran +`sudo install` directly; check for it, but don't expect it. + +| | Per-user **LaunchAgent** (expected) | Root **LaunchDaemon** (rare) | +| ------- | ----------------------------------------------------- | ----------------------------------------------------- | +| Plist | `~/Library/LaunchAgents/com.stepsecurity.agent.plist` | `/Library/LaunchDaemons/com.stepsecurity.agent.plist` | +| Domain | `gui/$(id -u)` | `system` | +| Runs as | console user | root | +| Logs | `~/.stepsecurity/agent.log`, `agent.error.log` | `/var/log/stepsecurity/agent.log`, `agent.error.log` | +| `sudo` | no | yes (use the `system` domain) | + +Loader-managed (MDM, auto-updates) vs binary-managed (manual `install`) — tell +them apart by what the plist runs: + +```bash +plutil -p "$PLIST" | grep -A4 ProgramArguments +# /bin/bash …/stepsecurity-loader.sh send-telemetry -> loader-managed (auto-updates each tick) +# …/stepsecurity-dev-machine-guard send-telemetry -> binary-managed (no auto-update) +``` + +## Setup + +```bash +LABEL=com.stepsecurity.agent + +# Expected: per-user LaunchAgent +DOMAIN="gui/$(id -u)" +PLIST="$HOME/Library/LaunchAgents/$LABEL.plist" +LOGDIR="$HOME/.stepsecurity" + +# Check whether a root LaunchDaemon is also present (rare). If it is, redo with +# sudo and: DOMAIN=system PLIST=/Library/LaunchDaemons/$LABEL.plist LOGDIR=/var/log/stepsecurity +ls -la "$HOME/Library/LaunchAgents/$LABEL.plist" 2>&1 +ls -la "/Library/LaunchDaemons/$LABEL.plist" 2>&1 +``` + +## Status + +```bash +launchctl list | grep stepsec # loaded? PID + last exit +launchctl list "$LABEL" # one-job summary +launchctl print "$DOMAIN/$LABEL" # full state, schedule, last exit +launchctl print-disabled "$DOMAIN" | grep stepsec # disabled override? (loads but never runs) +launchctl enable "$DOMAIN/$LABEL" # clear a disable override +``` + +## Inspect plist + +```bash +plutil -p "$PLIST" # readable dump +plutil -lint "$PLIST" # validate XML +plutil -p "$PLIST" | grep -A4 ProgramArguments # loader script vs binary (see Variants) +/usr/libexec/PlistBuddy -c "Print :StartInterval" "$PLIST" # seconds (14400 = 4h) +/usr/libexec/PlistBuddy -c "Print :EnvironmentVariables" "$PLIST" # baked HOME / STEPSECURITY_HOME +``` + +## Config & version + +```bash +cat "$HOME/.stepsecurity/config.json" # effective config (contains api_key) +cat "$HOME/.stepsecurity/.current_version" # version the loader last installed +"$HOME/.stepsecurity/bin/stepsecurity-dev-machine-guard" --version # running binary version +ls -la "$HOME/.stepsecurity" "$HOME/.stepsecurity/bin" # owner should be the console user, not root +``` + +## Logs + +```bash +tail -n 100 "$LOGDIR/agent.log" # scheduled-run stdout +tail -n 100 "$LOGDIR/agent.error.log" # scheduled-run stderr (rotates to .prev at 5 MiB) +tail -f "$LOGDIR"/agent.log "$LOGDIR"/agent.error.log # watch live +tail -n 50 "$HOME/.stepsecurity/ai-agent-hook-errors.jsonl" # AI-agent hook errors +stat -f '%Sm' "$LOGDIR/agent.log" # last scheduled-run time +log show --predicate 'process == "launchd"' --last 2h | grep -i stepsec # launchd's own view +``` + +## Force a run + +```bash +launchctl kickstart -k "$DOMAIN/$LABEL" # run now (-k restarts if in-flight) +/bin/bash "$HOME/.stepsecurity/bin/stepsecurity-loader.sh" send-telemetry # loader by hand (update + scan) +``` + +## Reload (after editing the plist) + +```bash +launchctl bootout "$DOMAIN/$LABEL" 2>/dev/null +launchctl bootstrap "$DOMAIN" "$PLIST" +launchctl print "$DOMAIN/$LABEL" | head -20 +``` + +`config.json` changes need no reload — they're read at run time; just `kickstart`. +(The loader logs `launchctl load`/`unload`; the modern verbs above work regardless.) + +## Uninstall + +```bash +/bin/bash "$HOME/.stepsecurity/bin/stepsecurity-loader.sh" uninstall # loader-managed (MDM) +"$HOME/.stepsecurity/bin/stepsecurity-dev-machine-guard" uninstall # binary-managed + +# Manual fallback: +launchctl bootout "$DOMAIN/$LABEL" 2>/dev/null || launchctl unload "$PLIST" 2>/dev/null +rm -f "$PLIST" + +# Verify +launchctl list | grep stepsec # expect no output +ls -la "$PLIST" 2>&1 # expect not found +rm -rf "$HOME/.stepsecurity" # wipe local state (optional) +``` + +## Reinstall + +```bash +/bin/bash "$HOME/.stepsecurity/bin/stepsecurity-loader.sh" install # or re-push loader via MDM +launchctl print "$DOMAIN/$LABEL" | grep -iE 'state|last exit' +launchctl kickstart -k "$DOMAIN/$LABEL" && tail -n 20 "$LOGDIR/agent.log" +``` + +## Gotchas + +- **config.json is rewritten every tick.** The loader's `write_config()` keeps only a fixed set (customer_id, api_endpoint, api_key, scan_frequency_hours + optional install_dir / max_execution_duration / scan toggles); any other hand-edited or profile-pushed field (e.g. `include_tcc_protected`) is wiped within one interval. Make it stick by editing the loader heredoc before deploy. +- **Runs only in a live GUI session.** No console user (login window, headless, SSH) → not loaded, won't fire; the loader's initial run errors `no_user`, and `launchctl … gui/` over SSH can return `Bootstrap failed: 5`. +- **TCC prompts are real.** It runs in the user's GUI session, so scanning Documents/Downloads/etc. pops permission dialogs; skipped by default. Grant Full Disk Access (PPPC profile), then set `include_tcc_protected`. +- **A wedged run blocks every tick.** The binary's lock file makes overlapping runs exit; a hung run holds the lock until the loader SIGKILLs processes older than `MAX_PROCESS_AGE_HOURS` on a later tick. Self-heals, but loses up to that window. +- **`StartInterval` quirks.** Missed fires during sleep coalesce into one run on wake; the timer also restarts on each load/login, so short sessions on a long interval can starve it. +- **`Bootstrap failed: 5`** most often means already loaded — `bootout` first, then `bootstrap`. From e21c0ef4f5bc306975d5143159645297cf5f4ea8 Mon Sep 17 00:00:00 2001 From: Shubham Malik Date: Thu, 18 Jun 2026 13:50:33 +0530 Subject: [PATCH 2/2] docs: enhance launchd troubleshooting guide with detailed scheduling information --- docs/launchd-troubleshooting.md | 39 +++++++++++++++++++++++++++++++-- 1 file changed, 37 insertions(+), 2 deletions(-) diff --git a/docs/launchd-troubleshooting.md b/docs/launchd-troubleshooting.md index b5b45a0..4ff6f20 100644 --- a/docs/launchd-troubleshooting.md +++ b/docs/launchd-troubleshooting.md @@ -10,11 +10,46 @@ which auto-updates the binary, then runs `send-telemetry`. `RunAtLoad` is only the interval does. The one-off initial scan runs explicitly at install time. To force an out-of-cycle run, `kickstart` it (or run the loader by hand). +## Scheduling: `RunAtLoad` and `StartInterval` + +`RunAtLoad` controls one thing — whether launchd runs the job **once, +immediately, the moment the plist loads**: + +- `` — runs as soon as the job loads. "Load" means boot (LaunchDaemon) / + login (LaunchAgent), **and** every manual `launchctl bootstrap` / `load`. So a + LaunchAgent would re-run on every login and every reload. +- `` (our setting, and the default) — does **not** run at load. The job + sits idle until another trigger fires it: here that's `StartInterval`, or a + manual `launchctl kickstart`. + +```xml +StartInterval +14400 +RunAtLoad + +``` + +The cadence therefore comes entirely from `StartInterval`. `RunAtLoad=false` +avoids a redundant scan on every login/reboot (and a fleet-wide boot-time +stampede); the installer instead runs one explicit `send-telemetry` at install, +then lets the interval pace the rest. + +**Consequence:** after a `bootstrap` / `load` / reload, **nothing runs on its +own** — use `kickstart` (see Force a run) to trigger a scan immediately. +(`RunAtLoad` is a one-shot-at-load trigger, unrelated to `KeepAlive`, which +continuously restarts a long-running daemon — a short-lived scan uses neither.) + ## Variants Almost always a per-user **LaunchAgent** running as the console user — that's -what the MDM loader installs. A root **LaunchDaemon** exists only if someone ran -`sudo install` directly; check for it, but don't expect it. +what the loader installs. The loader (and every version-specific loader script) +**never** creates a root daemon: even when MDM runs it as root it resolves the +console user and installs a per-user LaunchAgent, and aborts (`no_user`) if no +one is logged in rather than falling back to root. A root **LaunchDaemon** under +`/Library/LaunchDaemons/` only appears from a **legacy (≤1.8.x) agent script** +installed as root (pre-loader), or a manual `sudo install` (the Go +binary's installer has a root path the loader never invokes). Check for one — to +clean up a leftover — but current tooling won't create it. | | Per-user **LaunchAgent** (expected) | Root **LaunchDaemon** (rare) | | ------- | ----------------------------------------------------- | ----------------------------------------------------- |