Expose generic-worker status + pool ID via snmpd extend (Linux)#1216
Conversation
Adds two scripts that marlin/Icinga can poll over SNMP to learn: - whether generic-worker is running on the Linux worker - the worker's pool ID (workerType) from /etc/generic-worker.config Linux (linux_snmpd module): - New scripts in modules/linux_snmpd/files/: snmp_check_gw.sh, snmp_worker_pool_id.sh - manifests/init.pp deploys them to /usr/local/bin/ - snmpd.conf.erb adds two `extend` lines so marlin can poll them via NET-SNMP-EXTEND-MIB::nsExtendOutputFull."gw_status" and NET-SNMP-EXTEND-MIB::nsExtendOutputFull."worker_pool_id" This complements marlin PR mozilla-it/marlin#17, which adds the SNMP-side service definitions and the wrapper script that writes the polled pool ID to the shared host_pool InfluxDB measurement. The downstream consumer is the Grafana Status dashboards under RelSRE > FXCI Hardware Workers > Linux. The Mac equivalent (macos_snmpd) is split into a separate PR so it can include the net-snmp package install pattern (packages::macos_package_from_s3).
|
This branch has been deployed to ms005 (18.04) and ms020 (24.04) via override. Puppet converges fine. Will monitor and ensure they start appearing on marlin. |
|
/usr/local/bin/snmp_worker_pool_id.sh is looking at the wrong path. I think it should check out the worker-runner config ( |
Per @aerickson's review on #1216: /etc/generic-worker.config is generated just-in-time per task, so it's not reliably present when snmpd polls the extend script. The worker-runner config at /etc/start-worker.yml is always present and contains workerPoolID under the provider block. The value is typically formatted as "<provisionerId>/<workerType>" (e.g. "releng-hardware/gecko-t-linux-talos-1804"); strip the provisionerId prefix so the pool name alone is exposed, matching the Windows/host_pool semantics where pool == workerType. Verified against the t-linux64-ms-005 config sample in the review.
Mirrors the fix in the Linux PR (#1216) after @aerickson found that /etc/generic-worker.config is generated just-in-time per task and isn't reliably present at snmpd poll time. Mac's worker_runner module writes /opt/worker/worker-runner-config.yaml (see modules/worker_runner/manifests/init.pp: $data_dir defaults to /opt/worker and $worker_runner_conf = "${data_dir}/worker-runner-config.yaml"). Read workerPoolID from that file instead, and strip the "<provisionerId>/" prefix to expose just the pool name (matches Windows/host_pool semantics).
|
Good catch — pushed the fix in Same fix applied to the Mac PR (#1217) — that one points at If you can confirm |
|
ms005 looks good. Will check out 020 and marlin. |
|
020 is good. |
Summary
Adds two snmpd
extend-backed scripts so marlin can poll Linux workers over SNMP for:gw_status— whethergeneric-workeris running (pgrep -f /generic-worker).worker_pool_id— the worker's pool ID (workerType) read from/etc/generic-worker.config.Both pieces are consumed by the existing marlin Status/Metrics dashboards (RelSRE → FXCI Hardware Workers → Linux). Windows already exposes equivalents via NSCP/NRPE (PR mozilla-it/marlin#16); this PR brings Linux to parity.
Files
modules/linux_snmpd/files/snmp_check_gw.sh(new)modules/linux_snmpd/files/snmp_worker_pool_id.sh(new)modules/linux_snmpd/manifests/init.pp— deploys the two scripts to/usr/local/bin/modules/linux_snmpd/templates/snmpd.conf.erb— adds twoextendlinesCompanion PRs
Test plan
/usr/local/bin/snmp_check_gw.shand/usr/local/bin/snmp_worker_pool_id.shexist and exit 0snmpget -v2c -c <community> -O qv <host> 'NET-SNMP-EXTEND-MIB::nsExtendOutputFull.\"gw_status\"'returnsOK - generic-worker runningworker_pool_idreturnsOK - worker_pool_id=<pool>Linux Generic Worker+Linux Worker Pool IDappear OKmarlin-icinga2returnshost_poolrecords fort-linux64-ms-*hostnames