Skip to content

Expose generic-worker status + pool ID via snmpd extend (Linux)#1216

Merged
markcor merged 3 commits into
masterfrom
RELOPS-snmp-gw-pool-id-linux
May 20, 2026
Merged

Expose generic-worker status + pool ID via snmpd extend (Linux)#1216
markcor merged 3 commits into
masterfrom
RELOPS-snmp-gw-pool-id-linux

Conversation

@markcor
Copy link
Copy Markdown
Contributor

@markcor markcor commented May 15, 2026

Summary

Adds two snmpd extend-backed scripts so marlin can poll Linux workers over SNMP for:

  • gw_status — whether generic-worker is running (pgrep -f /generic-worker).
  • worker_pool_id — the worker's pool ID (workerType) read from /etc/generic-worker.config.

Both pieces are consumed by the existing marlin Status/Metrics dashboards (RelSRE → FXCI Hardware Workers → Linux). Windows already exposes equivalents via NSCP/NRPE (PR mozilla-it/marlin#16); this PR brings Linux to parity.

Files

  • modules/linux_snmpd/files/snmp_check_gw.sh (new)
  • modules/linux_snmpd/files/snmp_worker_pool_id.sh (new)
  • modules/linux_snmpd/manifests/init.pp — deploys the two scripts to /usr/local/bin/
  • modules/linux_snmpd/templates/snmpd.conf.erb — adds two extend lines

Companion PRs

  • marlin: mozilla-it/marlin#17 — adds the Icinga service definitions, the wrapper script that snmpgets the worker_pool_id and writes to InfluxDB's host_pool measurement, and the Mac counterpart in marlin.
  • Mac (separate ronin PR, coming next): new `macos_snmpd` module using `packages::macos_package_from_s3` to install net-snmp from the standard S3 bucket.

Test plan

  • After merge + a puppet run on a Linux moonshot host: /usr/local/bin/snmp_check_gw.sh and /usr/local/bin/snmp_worker_pool_id.sh exist and exit 0
  • From marlin1: snmpget -v2c -c <community> -O qv <host> 'NET-SNMP-EXTEND-MIB::nsExtendOutputFull.\"gw_status\"' returns OK - generic-worker running
  • Same query for worker_pool_id returns OK - worker_pool_id=<pool>
  • In IcingaWeb2 (once mozilla-it/marlin#17 is also merged) services Linux Generic Worker + Linux Worker Pool ID appear OK
  • Flux query against marlin-icinga2 returns host_pool records for t-linux64-ms-* hostnames

Adds two scripts that marlin/Icinga can poll over SNMP to learn:
- whether generic-worker is running on the Linux worker
- the worker's pool ID (workerType) from /etc/generic-worker.config

Linux (linux_snmpd module):
- New scripts in modules/linux_snmpd/files/:
  snmp_check_gw.sh, snmp_worker_pool_id.sh
- manifests/init.pp deploys them to /usr/local/bin/
- snmpd.conf.erb adds two `extend` lines so marlin can poll them via
  NET-SNMP-EXTEND-MIB::nsExtendOutputFull."gw_status" and
  NET-SNMP-EXTEND-MIB::nsExtendOutputFull."worker_pool_id"

This complements marlin PR mozilla-it/marlin#17, which adds the SNMP-side
service definitions and the wrapper script that writes the polled pool ID
to the shared host_pool InfluxDB measurement. The downstream consumer is
the Grafana Status dashboards under RelSRE > FXCI Hardware Workers > Linux.

The Mac equivalent (macos_snmpd) is split into a separate PR so it can
include the net-snmp package install pattern (packages::macos_package_from_s3).
@aerickson
Copy link
Copy Markdown
Member

This branch has been deployed to ms005 (18.04) and ms020 (24.04) via override.

Puppet converges fine. Will monitor and ensure they start appearing on marlin.

@aerickson
Copy link
Copy Markdown
Member

/usr/local/bin/snmp_worker_pool_id.sh is looking at the wrong path. I think it should check out the worker-runner config (workerPoolID), because the g-w config is generated just-in-time.

[aerickson@t-linux64-ms-005 snmp]$ /usr/local/bin/snmp_worker_pool_id.sh
UNKNOWN - /etc/generic-worker.config not found
[aerickson@t-linux64-ms-005 snmp]$ cat /etc/start-worker.yml
cacheOverRestarts: true
getSecrets: false
provider:
    providerType: standalone
    rootURL: "https://firefox-ci-tc.services.mozilla.com"
    clientID: "project/releng/generic-worker/datacenter-gecko-t-linux"
    accessToken: "REDACTED"
    workerPoolID: "releng-hardware/gecko-t-linux-talos-1804"
    workerGroup: "mdc1"
    workerID: "t-linux64-ms-005"
worker:
    implementation: generic-worker
    path: /usr/local/bin/generic-worker
    configPath: /home/cltbld/generic-worker.config
workerConfig:
    cachesDir:                        "/home/cltbld/caches"
    certificate:                      ""
    checkForNewDeploymentEverySecs:   0
    cleanUpTaskDirs:                  true
    ed25519SigningKeyLocation:        "/home/cltbld/generic-worker.ed25519.signing.key"
    idleTimeoutSecs:                  345600
    livelogExecutable:                "/usr/local/bin/livelog"
    numberOfTasksToRun:               1
    provisionerId:                    "releng-hardware"
    publicIP:                         ""
    requiredDiskSpaceMegabytes:       10240
    sentryProject:                    "generic-worker"
    shutdownMachineOnIdle:            false
    shutdownMachineOnInternalError:   false
    taskclusterProxyExecutable:       "/usr/local/bin/taskcluster-proxy"
    taskclusterProxyPort:             8080
    tasksDir:                         "/home/cltbld/tasks"
    workerType:                       "gecko-t-linux-talos-1804"
    wstAudience:                      "firefoxcitc"
    wstServerURL:                     "https://firefoxci-websocktunnel.services.mozilla.com/"
[aerickson@t-linux64-ms-005 snmp]$

Per @aerickson's review on #1216: /etc/generic-worker.config is generated
just-in-time per task, so it's not reliably present when snmpd polls the
extend script. The worker-runner config at /etc/start-worker.yml is
always present and contains workerPoolID under the provider block.

The value is typically formatted as "<provisionerId>/<workerType>" (e.g.
"releng-hardware/gecko-t-linux-talos-1804"); strip the provisionerId
prefix so the pool name alone is exposed, matching the
Windows/host_pool semantics where pool == workerType.

Verified against the t-linux64-ms-005 config sample in the review.
markcor added a commit that referenced this pull request May 19, 2026
Mirrors the fix in the Linux PR (#1216)
after @aerickson found that /etc/generic-worker.config is generated
just-in-time per task and isn't reliably present at snmpd poll time.

Mac's worker_runner module writes /opt/worker/worker-runner-config.yaml
(see modules/worker_runner/manifests/init.pp: $data_dir defaults to
/opt/worker and $worker_runner_conf = "${data_dir}/worker-runner-config.yaml").
Read workerPoolID from that file instead, and strip the "<provisionerId>/"
prefix to expose just the pool name (matches Windows/host_pool semantics).
@markcor
Copy link
Copy Markdown
Contributor Author

markcor commented May 19, 2026

Good catch — pushed the fix in 2c75a3d3. Switched to reading /etc/start-worker.yml and pulling workerPoolID out of the provider block. The value comes back as <provisionerId>/<workerType> (e.g. releng-hardware/gecko-t-linux-talos-1804), so the script also strips the provisionerId prefix so the exposed pool string matches what Windows/host_pool already use (pool == workerType).

Same fix applied to the Mac PR (#1217) — that one points at /opt/worker/worker-runner-config.yaml (where the worker_runner module writes the equivalent file on Mac).

If you can confirm /usr/local/bin/snmp_worker_pool_id.sh returns OK - worker_pool_id=gecko-t-linux-talos-1804 on ms005 after the next puppet run, I think we're good.

@aerickson
Copy link
Copy Markdown
Member

ms005 looks good.

root@t-linux64-ms-005:~# /usr/local/bin/snmp_check_gw.sh
OK - generic-worker running
root@t-linux64-ms-005:~# /usr/local/bin/snmp_worker_pool_id.sh
OK - worker_pool_id=gecko-t-linux-talos-1804
root@t-linux64-ms-005:~#

Will check out 020 and marlin.

@aerickson
Copy link
Copy Markdown
Member

020 is good.

@markcor markcor requested a review from aerickson May 20, 2026 17:17
@markcor markcor merged commit f31351e into master May 20, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants