Pass structured cluster resource data and taints upstream via ping path (aligned with interLink#516)#135
Pass structured cluster resource data and taints upstream via ping path (aligned with interLink#516)#135
Conversation
…ndler When the /status endpoint is called with an empty pod list (the interlink-api ping path), return a JSON-encoded NodeResources value instead of plain-text sinfo -s output. This gives the interlink core and virtual kubelet structured data to update the virtual node's advertised capacity so it reflects the actual SLURM cluster occupancy. Implementation details: - Add NodeResources type (cpu_total_cores, cpu_used_cores, memory_total_bytes, memory_used_bytes, max_pods) to types.go - Add slurmNodeList / slurmJSONNode minimal schema for sinfo --json - getClusterResources() tries sinfo --json first (SLURM >= 20.11), falls back to sinfo --noheader --format=%c|%m|%e text parsing - Extract parseClusterResourcesFromJSON / parseClusterResourcesFromText as package-level functions for unit testability - Add resources_test.go covering text parsing, JSON parsing, and NodeResources JSON serialisation Agent-Logs-Url: https://github.com/interlink-hq/interlink-slurm-plugin/sessions/e103d80e-3adf-46d9-b112-88cc6125ba95 Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>
|
align this changes to work with interlink-hq/interLink#516 |
|
…e schema
Replace the custom NodeResources type with PingResponse / ResourcesResponse /
AcceleratorResponse types that mirror the schema introduced in interLink#516.
The VK's nodeUpdate() in that PR tries to JSON-unmarshal the ping response body
as PingResponse and, if Resources is non-nil, calls updateNodeResources() to
update node.Status.Capacity and node.Status.Allocatable. By returning the same
structure from the /status ping path, the SLURM plugin feeds live cluster
resource data directly into the virtual node's advertised capacity.
Key changes:
- types.go: PingResponse {status, resources}, ResourcesResponse {cpu, memory,
pods, accelerators} and AcceleratorResponse {resourceType, available} all with
Kubernetes-quantity string values; TODO comment to replace with upstream types
once interLink#516 is merged
- Status.go: getClusterResources() now returns PingResponse; JSON path reports
available resources (total - allocated) as Kubernetes quantity strings; text
fallback reports total CPUs + free memory; clampToZero() helper extracted to
avoid duplicate negative-guard logic
- resources_test.go: tests updated to assert PingResponse.Resources.CPU /
.Memory using Kubernetes quantity strings ("24", "96000Mi", etc.)
Agent-Logs-Url: https://github.com/interlink-hq/interlink-slurm-plugin/sessions/4badfd6e-a4b2-4ee2-b9e2-96b592f06b37
Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>
Done in d69348e. The ping-path response now returns a Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
|
@copilot include also taints to be passed from the config file via the api interlink-hq/interLink#516 (comment) |
Add TaintConfig type (dual yaml+json struct tags) to types.go and a Taints
field to both PingResponse and SlurmConfig. When SlurmConfig.Taints is
non-empty, getClusterResources() copies the slice into PingResponse.Taints so
that the VK (interLink#516) can apply the operator-configured taints to the
virtual node on every heartbeat.
When no taints are configured the field remains nil and is omitted from the
JSON output (backward compatible). An explicit empty slice in the config
results in an empty JSON array, allowing operators to clear all plugin-managed
taints dynamically.
Example SlurmConfig.yaml addition:
Taints:
- Key: vendor.io/maintenance
Value: "true"
Effect: NoSchedule
Corresponding ping response:
{"status":"ok","resources":{...},"taints":[{"key":"vendor.io/maintenance","value":"true","effect":"NoSchedule"}]}
Agent-Logs-Url: https://github.com/interlink-hq/interlink-slurm-plugin/sessions/fe353d6f-2d98-4198-acd4-da0b1ef9f457
Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>
Done in db90940. Operators can now configure taints in Taints:
- Key: vendor.io/maintenance
Value: "true"
Effect: NoSchedule
- Key: vendor.io/drain
Effect: NoExecuteThe ping response will include them as: {
"status": "ok",
"resources": { "cpu": "24", "memory": "96000Mi" },
"taints": [
{ "key": "vendor.io/maintenance", "value": "true", "effect": "NoSchedule" },
{ "key": "vendor.io/drain", "effect": "NoExecute" }
]
}The VK's |
| // allocation data. When the JSON output is unavailable or cannot be parsed it falls | ||
| // back to `sinfo --noheader --format=…` text parsing, which can report totals and | ||
| // free memory but cannot determine per-node CPU allocation. | ||
| func (h *SidecarHandler) getClusterResources() (PingResponse, error) { |
There was a problem hiding this comment.
@copilot make get cluster resource script configurable. Imagine I want to pass my own logic from the config file of the plugin.
There was a problem hiding this comment.
Done in 819db1f. Add ResourceScriptPath to SlurmConfig.yaml to plug in your own resource-gathering logic:
ResourceScriptPath: /etc/interlink/get_resources.shThe script must write a PingResponse-shaped JSON to stdout:
{"status":"ok","resources":{"cpu":"128","memory":"512Gi","pods":"1000"}}When configured, the script runs instead of sinfo. A non-zero exit code, empty output, or unparseable JSON logs a warning and falls back to the built-in sinfo chain. Static Taints from the config always override any taints returned by the script.
…e gathering
Add ResourceScriptPath string to SlurmConfig (yaml:"ResourceScriptPath"). When
set, getClusterResources() executes the script instead of running sinfo directly.
The script must write a PingResponse-shaped JSON to stdout; a non-zero exit code,
empty output, or unparseable JSON is treated as a transient error and the sinfo
chain is tried as a fallback.
Static SlurmConfig.Taints always override any taints returned by the custom script.
Example SlurmConfig.yaml:
ResourceScriptPath: /etc/interlink/get_resources.sh
Example script output:
{"status":"ok","resources":{"cpu":"128","memory":"512Gi","pods":"1000"}}
Agent-Logs-Url: https://github.com/interlink-hq/interlink-slurm-plugin/sessions/f41c6039-99d2-499c-a1cf-edaa18783361
Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>
The interlink core needs real-time SLURM cluster resource occupancy to update the virtual node's advertised capacity. Previously the
/statusping path returned a plain-textsinfo -sdump with no machine-readable structure.Changes
pkg/slurm/types.go— newPingResponse,ResourcesResponse,AcceleratorResponse, andTaintConfigtypes mirroring the schema from interlink-hq/interLink#516; minimalslurmNodeList/slurmJSONNodetypes forsinfo --jsonparsing. ATODOmarks the local types for replacement with upstreamcommonIL.*imports once interLink#516 is merged.SlurmConfiggains two new fields:Taints []TaintConfigfor declaring static node taints andResourceScriptPath stringfor plugging in a custom resource-gathering script.pkg/slurm/Status.go— when/statusis called with an empty pod list (the interlink-api ping path), returnsContent-Type: application/jsonwith aPingResponsebody. Resource data is fetched via a three-tier resolution:ResourceScriptPath) — when set, the script is executed and its JSON stdout is parsed as aPingResponse; on failure the sinfo chain is tried as a fallback.sinfo --json(SLURM ≥ 20.11) — accurate per-nodealloc_cpusandalloc_memory; reports available resources (total − allocated) as Kubernetes quantity strings.sinfo --noheader --format=%c|%m|%e— reports total installed CPUs and sum of free memory as Kubernetes quantity strings (fallback when JSON unavailable).Static
SlurmConfig.Taintsare always merged into the response (overriding any taints returned by the custom script). When no taints are configured the field is omitted (backward compatible).pkg/slurm/resources_test.go— parsing logic extracted into package-levelparseClusterResourcesFromJSON/parseClusterResourcesFromTextfunctions; tests cover thePingResponseformat with Kubernetes quantity strings, taint serialization, and the custom script path (happy path, non-zero exit, invalid JSON, missing status field).Example response on ping
{ "status": "ok", "resources": { "cpu": "24", "memory": "96000Mi" }, "taints": [ { "key": "vendor.io/maintenance", "value": "true", "effect": "NoSchedule" } ] }Example
SlurmConfig.yamlconfigurationExample custom script output
{"status":"ok","resources":{"cpu":"128","memory":"512Gi","pods":"1000"}}The VK's
nodeUpdate()(interLink#516) parses this body asPingResponseand:resourcesis non-nil, callsupdateNodeResources()to update bothnode.Status.Capacityandnode.Status.Allocatabletaintsis non-nil, callsupdateNodeTaints()to replace the node's non-system taints with the provided listBoth updates happen on every 30-second heartbeat, keeping the virtual node's advertised capacity and taint set in sync with the SLURM cluster state. Omitted fields retain their currently configured values.