Skip to content

Pass structured cluster resource data and taints upstream via ping path (aligned with interLink#516)#135

Draft
Copilot wants to merge 5 commits intomainfrom
copilot/pass-resource-update-upstream
Draft

Pass structured cluster resource data and taints upstream via ping path (aligned with interLink#516)#135
Copilot wants to merge 5 commits intomainfrom
copilot/pass-resource-update-upstream

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 3, 2026

The interlink core needs real-time SLURM cluster resource occupancy to update the virtual node's advertised capacity. Previously the /status ping path returned a plain-text sinfo -s dump with no machine-readable structure.

Changes

  • pkg/slurm/types.go — new PingResponse, ResourcesResponse, AcceleratorResponse, and TaintConfig types mirroring the schema from interlink-hq/interLink#516; minimal slurmNodeList/slurmJSONNode types for sinfo --json parsing. A TODO marks the local types for replacement with upstream commonIL.* imports once interLink#516 is merged. SlurmConfig gains two new fields: Taints []TaintConfig for declaring static node taints and ResourceScriptPath string for plugging in a custom resource-gathering script.

  • pkg/slurm/Status.go — when /status is called with an empty pod list (the interlink-api ping path), returns Content-Type: application/json with a PingResponse body. Resource data is fetched via a three-tier resolution:

    1. Custom script (ResourceScriptPath) — when set, the script is executed and its JSON stdout is parsed as a PingResponse; on failure the sinfo chain is tried as a fallback.
    2. sinfo --json (SLURM ≥ 20.11) — accurate per-node alloc_cpus and alloc_memory; reports available resources (total − allocated) as Kubernetes quantity strings.
    3. sinfo --noheader --format=%c|%m|%e — reports total installed CPUs and sum of free memory as Kubernetes quantity strings (fallback when JSON unavailable).

    Static SlurmConfig.Taints are always merged into the response (overriding any taints returned by the custom script). When no taints are configured the field is omitted (backward compatible).

  • pkg/slurm/resources_test.go — parsing logic extracted into package-level parseClusterResourcesFromJSON / parseClusterResourcesFromText functions; tests cover the PingResponse format with Kubernetes quantity strings, taint serialization, and the custom script path (happy path, non-zero exit, invalid JSON, missing status field).

Example response on ping

{
  "status": "ok",
  "resources": {
    "cpu": "24",
    "memory": "96000Mi"
  },
  "taints": [
    { "key": "vendor.io/maintenance", "value": "true", "effect": "NoSchedule" }
  ]
}

Example SlurmConfig.yaml configuration

# Optional: plug in a custom resource-gathering script
ResourceScriptPath: /etc/interlink/get_resources.sh

# Optional: static taints to advertise on the virtual node
Taints:
  - Key: vendor.io/maintenance
    Value: "true"
    Effect: NoSchedule
  - Key: vendor.io/drain
    Effect: NoExecute

Example custom script output

{"status":"ok","resources":{"cpu":"128","memory":"512Gi","pods":"1000"}}

The VK's nodeUpdate() (interLink#516) parses this body as PingResponse and:

  • when resources is non-nil, calls updateNodeResources() to update both node.Status.Capacity and node.Status.Allocatable
  • when taints is non-nil, calls updateNodeTaints() to replace the node's non-system taints with the provided list

Both updates happen on every 30-second heartbeat, keeping the virtual node's advertised capacity and taint set in sync with the SLURM cluster state. Omitted fields retain their currently configured values.

…ndler

When the /status endpoint is called with an empty pod list (the
interlink-api ping path), return a JSON-encoded NodeResources value
instead of plain-text sinfo -s output.  This gives the interlink core
and virtual kubelet structured data to update the virtual node's
advertised capacity so it reflects the actual SLURM cluster occupancy.

Implementation details:
- Add NodeResources type (cpu_total_cores, cpu_used_cores,
  memory_total_bytes, memory_used_bytes, max_pods) to types.go
- Add slurmNodeList / slurmJSONNode minimal schema for sinfo --json
- getClusterResources() tries sinfo --json first (SLURM >= 20.11),
  falls back to sinfo --noheader --format=%c|%m|%e text parsing
- Extract parseClusterResourcesFromJSON / parseClusterResourcesFromText
  as package-level functions for unit testability
- Add resources_test.go covering text parsing, JSON parsing, and
  NodeResources JSON serialisation

Agent-Logs-Url: https://github.com/interlink-hq/interlink-slurm-plugin/sessions/e103d80e-3adf-46d9-b112-88cc6125ba95

Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>
Copilot AI changed the title [WIP] Add resource update functionality to virtual node Pass structured cluster resource data upstream via ping path Apr 3, 2026
Copilot AI requested a review from dciangot April 3, 2026 06:03
@dciangot
Copy link
Copy Markdown
Member

dciangot commented Apr 3, 2026

align this changes to work with interlink-hq/interLink#516

@dciangot
Copy link
Copy Markdown
Member

dciangot commented Apr 3, 2026

align this changes to work with interlink-hq/interLink#516

@copilot

…e schema

Replace the custom NodeResources type with PingResponse / ResourcesResponse /
AcceleratorResponse types that mirror the schema introduced in interLink#516.

The VK's nodeUpdate() in that PR tries to JSON-unmarshal the ping response body
as PingResponse and, if Resources is non-nil, calls updateNodeResources() to
update node.Status.Capacity and node.Status.Allocatable.  By returning the same
structure from the /status ping path, the SLURM plugin feeds live cluster
resource data directly into the virtual node's advertised capacity.

Key changes:
- types.go: PingResponse {status, resources}, ResourcesResponse {cpu, memory,
  pods, accelerators} and AcceleratorResponse {resourceType, available} all with
  Kubernetes-quantity string values; TODO comment to replace with upstream types
  once interLink#516 is merged
- Status.go: getClusterResources() now returns PingResponse; JSON path reports
  available resources (total - allocated) as Kubernetes quantity strings; text
  fallback reports total CPUs + free memory; clampToZero() helper extracted to
  avoid duplicate negative-guard logic
- resources_test.go: tests updated to assert PingResponse.Resources.CPU /
  .Memory using Kubernetes quantity strings ("24", "96000Mi", etc.)

Agent-Logs-Url: https://github.com/interlink-hq/interlink-slurm-plugin/sessions/4badfd6e-a4b2-4ee2-b9e2-96b592f06b37

Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 3, 2026

align this changes to work with interlink-hq/interLink#516

@copilot

Done in d69348e. The ping-path response now returns a PingResponse that mirrors the schema from interLink#516 exactly — {"status":"ok","resources":{"cpu":"24","memory":"96000Mi"}} — using Kubernetes quantity strings so the VK's updateNodeResources() can call resource.ParseQuantity() directly. Local type definitions (PingResponse, ResourcesResponse, AcceleratorResponse) carry a TODO to be replaced with the upstream commonIL.* imports once interLink#516 is merged and the dependency is bumped.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • https://api.github.com/repos/interlink-hq/interLink/pulls/516
    • Triggering command: /usr/bin/curl curl -s -H Accept: application/vnd.github.v3&#43;json REDACTED (http block)
  • https://api.github.com/repos/interlink-hq/interLink/pulls/516/files
    • Triggering command: /usr/bin/curl curl -s REDACTED (http block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI changed the title Pass structured cluster resource data upstream via ping path Pass structured cluster resource data upstream via ping path (aligned with interLink#516) Apr 3, 2026
@dciangot
Copy link
Copy Markdown
Member

dciangot commented Apr 5, 2026

@copilot include also taints to be passed from the config file via the api interlink-hq/interLink#516 (comment)

Add TaintConfig type (dual yaml+json struct tags) to types.go and a Taints
field to both PingResponse and SlurmConfig.  When SlurmConfig.Taints is
non-empty, getClusterResources() copies the slice into PingResponse.Taints so
that the VK (interLink#516) can apply the operator-configured taints to the
virtual node on every heartbeat.

When no taints are configured the field remains nil and is omitted from the
JSON output (backward compatible).  An explicit empty slice in the config
results in an empty JSON array, allowing operators to clear all plugin-managed
taints dynamically.

Example SlurmConfig.yaml addition:
  Taints:
    - Key: vendor.io/maintenance
      Value: "true"
      Effect: NoSchedule

Corresponding ping response:
  {"status":"ok","resources":{...},"taints":[{"key":"vendor.io/maintenance","value":"true","effect":"NoSchedule"}]}

Agent-Logs-Url: https://github.com/interlink-hq/interlink-slurm-plugin/sessions/fe353d6f-2d98-4198-acd4-da0b1ef9f457

Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 5, 2026

@copilot include also taints to be passed from the config file via the api interlink-hq/interLink#516 (comment)

Done in db90940. Operators can now configure taints in SlurmConfig.yaml:

Taints:
  - Key: vendor.io/maintenance
    Value: "true"
    Effect: NoSchedule
  - Key: vendor.io/drain
    Effect: NoExecute

The ping response will include them as:

{
  "status": "ok",
  "resources": { "cpu": "24", "memory": "96000Mi" },
  "taints": [
    { "key": "vendor.io/maintenance", "value": "true", "effect": "NoSchedule" },
    { "key": "vendor.io/drain", "effect": "NoExecute" }
  ]
}

The VK's updateNodeTaints() (interLink#516) will replace the node's non-system taints with this list on every heartbeat. When no taints are configured the field is omitted entirely (backward compatible).

Comment thread pkg/slurm/Status.go
// allocation data. When the JSON output is unavailable or cannot be parsed it falls
// back to `sinfo --noheader --format=…` text parsing, which can report totals and
// free memory but cannot determine per-node CPU allocation.
func (h *SidecarHandler) getClusterResources() (PingResponse, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot make get cluster resource script configurable. Imagine I want to pass my own logic from the config file of the plugin.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 819db1f. Add ResourceScriptPath to SlurmConfig.yaml to plug in your own resource-gathering logic:

ResourceScriptPath: /etc/interlink/get_resources.sh

The script must write a PingResponse-shaped JSON to stdout:

{"status":"ok","resources":{"cpu":"128","memory":"512Gi","pods":"1000"}}

When configured, the script runs instead of sinfo. A non-zero exit code, empty output, or unparseable JSON logs a warning and falls back to the built-in sinfo chain. Static Taints from the config always override any taints returned by the script.

Copilot AI changed the title Pass structured cluster resource data upstream via ping path (aligned with interLink#516) Pass structured cluster resource data and taints upstream via ping path (aligned with interLink#516) Apr 5, 2026
Copilot AI requested a review from dciangot April 5, 2026 07:01
…e gathering

Add ResourceScriptPath string to SlurmConfig (yaml:"ResourceScriptPath").  When
set, getClusterResources() executes the script instead of running sinfo directly.
The script must write a PingResponse-shaped JSON to stdout; a non-zero exit code,
empty output, or unparseable JSON is treated as a transient error and the sinfo
chain is tried as a fallback.

Static SlurmConfig.Taints always override any taints returned by the custom script.

Example SlurmConfig.yaml:
  ResourceScriptPath: /etc/interlink/get_resources.sh

Example script output:
  {"status":"ok","resources":{"cpu":"128","memory":"512Gi","pods":"1000"}}

Agent-Logs-Url: https://github.com/interlink-hq/interlink-slurm-plugin/sessions/f41c6039-99d2-499c-a1cf-edaa18783361

Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pass resource update upstream to the virtual node

2 participants