feat: dynamically update node resource capacity and taints from plugin ping responses#516
feat: dynamically update node resource capacity and taints from plugin ping responses#516
Conversation
✅ Deploy Preview for interlink-dev canceled.
|
Agent-Logs-Url: https://github.com/interlink-hq/interLink/sessions/80d70d6e-9c74-48bb-aecf-d2a8e34692f5 Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>
|
@copilot is the new api retrocompatible with the previous spec? |
Yes, fully backward compatible. The VK-side change (line 658–662) only attempts The interLink API's |
…e schema
Replace the custom NodeResources type with PingResponse / ResourcesResponse /
AcceleratorResponse types that mirror the schema introduced in interLink#516.
The VK's nodeUpdate() in that PR tries to JSON-unmarshal the ping response body
as PingResponse and, if Resources is non-nil, calls updateNodeResources() to
update node.Status.Capacity and node.Status.Allocatable. By returning the same
structure from the /status ping path, the SLURM plugin feeds live cluster
resource data directly into the virtual node's advertised capacity.
Key changes:
- types.go: PingResponse {status, resources}, ResourcesResponse {cpu, memory,
pods, accelerators} and AcceleratorResponse {resourceType, available} all with
Kubernetes-quantity string values; TODO comment to replace with upstream types
once interLink#516 is merged
- Status.go: getClusterResources() now returns PingResponse; JSON path reports
available resources (total - allocated) as Kubernetes quantity strings; text
fallback reports total CPUs + free memory; clampToZero() helper extracted to
avoid duplicate negative-guard logic
- resources_test.go: tests updated to assert PingResponse.Resources.CPU /
.Memory using Kubernetes quantity strings ("24", "96000Mi", etc.)
Agent-Logs-Url: https://github.com/interlink-hq/interlink-slurm-plugin/sessions/4badfd6e-a4b2-4ee2-b9e2-96b592f06b37
Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>
|
@copilot make also possible to pass taints in response to the ping call |
Agent-Logs-Url: https://github.com/interlink-hq/interLink/sessions/ab8b0dc8-951f-4aaa-85b8-26f7028530f4 Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>
Done in c714361. Plugins can now include a {
"status": "ok",
"resources": { "cpu": "128", "memory": "512Gi" },
"taints": [
{ "key": "vendor.io/maintenance", "value": "true", "effect": "NoSchedule" }
]
}When |
Node resource capacity (CPU, memory, pods, GPUs, FPGAs) and taints were fixed at initialization and never updated based on plugin state. This adds support for plugins to report live resource availability and node taints via the existing
/pinglinkresponse body.Changes
New types (
pkg/interlink/types.go)PingResponse— structured ping response envelope with optionalresourcesandtaintsfieldsResourcesResponse— resource capacities with JSON lowercase keys (cpu,memory,pods,accelerators)AcceleratorResponse— per-accelerator entry (resourceType+available)TaintResponse— per-taint entry (key,value,effect)Dynamic update logic (
pkg/virtualkubelet/virtualkubelet.go)updateNodeResources()method validates quantities viaresource.ParseQuantityand updates bothnode.Status.Capacityandnode.Status.Allocatable; invalid values emit a warning and leave the field unchangedupdateNodeTaints()method replaces all non-system taints with the plugin-supplied list; the built-invirtual-node.interlink/no-scheduletaint is always preserved; unknown effects default toNoSchedulewith a warningnodeUpdate()now attempts to unmarshal successful ping responses asPingResponse; if aresourcesortaintsfield is present, applies the respective update before callingonNodeChangeCallbackBackward compatibility
Non-JSON or resource/taint-less responses (existing plugin behavior) are silently ignored — no behavior change for current plugins.
Example plugin response
{ "status": "ok", "resources": { "cpu": "128", "memory": "512Gi", "pods": "1000", "accelerators": [ { "resourceType": "nvidia.com/gpu", "available": "8" } ] }, "taints": [ { "key": "vendor.io/maintenance", "value": "true", "effect": "NoSchedule" } ] }Omitted fields retain their current configured values, so partial updates are supported. When
taintsis present as an empty array ([]), all plugin-managed taints are cleared.Original prompt
Overview
Currently, the Virtual Kubelet node's resource capacity (CPU, Memory, GPU, etc.) is set only once during initialization in
NewProviderConfig()and never updated dynamically based on plugin feedback.Problem Statement
The InterLink system periodically pings the plugin to check its status (every 30 seconds in the
nodeUpdate()function), but only updates the node's connectivity condition. There is no mechanism to dynamically update node resource capacity based on information returned by the plugin.Proposed Solution
Add the capability to dynamically update node resources based on plugin responses:
Extend the ping response format to optionally include resource information (CPU, Memory, Pod count, GPUs, FPGAs, etc.)
Parse plugin responses to extract resource data when available
Update node resources dynamically by modifying
node.Status.Capacityandnode.Status.Allocatablewhen resource information is receivedMaintain backward compatibility - if the plugin doesn't return resource data, continue using the statically configured values
Implementation Details
Changes Required:
Plugin Response Structure - Define an optional resource update format in the plugin response that includes:
{ "status": "ok", "resources": { "cpu": "100", "memory": "256Gi", "pods": "1000", "accelerators": [ { "resourceType": "nvidia.com/gpu", "available": "8" } ] } }Update
nodeUpdate()function to:updateNodeResources()function with the extracted datap.node.Status.Capacityandp.node.Status.AllocatableAdd helper function
updateNodeResources()to safely update node resource spec based on parsed responseAdd validation to ensure resource values are valid Kubernetes quantities
Logging for resource updates to help with debugging
Backward Compatibility:
Benefits
The following is the prior conversation context from the user's chat exploration (may be truncated):
User: is there a point where the vk can update the node resource spec based on interlinkplugin reply?
Assistant: I'll search the repository to understand how the virtual kubelet (vk) handles updates to the node resource spec based on InterLink plugin replies.
Assistant: Based on the code search results, I can see there is indeed a mechanism for the Virtual Kubelet (VK) to update the node resource spec based on InterLink plugin replies. Let me show you the specific points:
Node Update Mechanism
The primary point where the VK updates the node resource spec based on plugin replies is in the
nodeUpdate()function inpkg/virtualkubelet/virtualkubelet.go:Assistant: Yes, there is a specific point where the VK (Virtual Kubelet) can update the node resource spec based on the InterLink plugin reply. Let me explain the key mechanism:
Node Update Point
The main mechanism for updating node resources based on plugin responses is in the
nodeUpdate()function (lines 599-660), which runs continuously and performs a periodic health check by pinging the InterLink plugin.How It Works
This pull request was created from Copilot chat.