feat: FournosCluster controller with GPU discovery, sentinel locking, and Kueue integration#74
feat: FournosCluster controller with GPU discovery, sentinel locking, and Kueue integration#74MML-coder wants to merge 8 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a FournosCluster operator and CRD for kubeconfig-managed cluster discovery, GPU inventory, Kueue quota wiring, sentinel lock jobs and lockOnly mode; includes operator wiring, manifests/RBAC, packaging, and unit tests. ChangesFournosCluster Operator & Lock-Only Job Mode
Estimated code review effort 🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
fournos/handlers/psapcluster.py (1)
52-64: 💤 Low valueConsider exposing a public method on
ctx.registryinstead of accessing_k8sdirectly.Accessing the private
_k8sattribute couples this handler to the internal implementation ofClusterRegistry. Consider adding a public method likectx.registry.secret_exists(name, namespace)or using a shared CoreV1Api instance.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fournos/handlers/psapcluster.py` around lines 52 - 64, Update _check_kubeconfig to stop accessing the private ctx.registry._k8s and call a new public API on the registry (e.g., ctx.registry.secret_exists(secret_name, namespace)) that encapsulates the read_namespaced_secret logic; add the corresponding method on ClusterRegistry (secret_exists(name, namespace)) which returns a boolean or raises a controlled exception for non-404 errors, then have _check_kubeconfig map True->"Valid", False->"Missing" and handle other errors as "Invalid" while logging the exception via logger.warning as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@fournos/core/gpu_discovery.py`:
- Around line 125-139: The timeout on the Kubernetes call is ineffective because
setting api_client.configuration.request_timeout doesn't apply to individual
calls; update the code that builds the client and calls list_node so that you
remove the assignment to api_client.configuration.request_timeout and instead
call target_core.list_node(...) passing
_request_timeout=(settings.gpu_discovery_timeout_sec,
settings.gpu_discovery_timeout_sec); keep the existing try/except that wraps the
call and raise GPUDiscoveryError on exception, and ensure api_client.close()
remains in the finally block.
In `@fournos/core/psapcluster.py`:
- Around line 181-185: The stopPolicy patch handler currently catches
ApiException but ApiException is not imported; update the exception handling in
that handler to catch client.exceptions.ApiException instead (i.e., change the
except block that reads "except ApiException as exc:" to use
client.exceptions.ApiException) so it matches other handlers in this module and
avoids a NameError; ensure the logger.warning and raise behavior remain
unchanged and reference cq_name as before.
In `@README.md`:
- Around line 307-312: The fenced code block showing the table lacks a language
identifier (MD040); update the backticks surrounding the block (the triple-fence
that begins the block containing "NAME OWNER GPUS KUBECONFIG LOCKED AGE") to
include a language tag such as "text" (e.g., change ``` to ```text) so
markdownlint no longer flags the code fence.
---
Nitpick comments:
In `@fournos/handlers/psapcluster.py`:
- Around line 52-64: Update _check_kubeconfig to stop accessing the private
ctx.registry._k8s and call a new public API on the registry (e.g.,
ctx.registry.secret_exists(secret_name, namespace)) that encapsulates the
read_namespaced_secret logic; add the corresponding method on ClusterRegistry
(secret_exists(name, namespace)) which returns a boolean or raises a controlled
exception for non-404 errors, then have _check_kubeconfig map True->"Valid",
False->"Missing" and handle other errors as "Invalid" while logging the
exception via logger.warning as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e083bc62-0476-4762-9cde-25f919d35cc5
📒 Files selected for processing (15)
README.mddev/mock-psapcluster.yamlfournos/core/constants.pyfournos/core/gpu_discovery.pyfournos/core/psapcluster.pyfournos/handlers/__init__.pyfournos/handlers/psapcluster.pyfournos/operator.pyfournos/settings.pyfournos/state.pymanifests/psapcluster-crd.yamlmanifests/rbac/role_fournos.yamltests/unit/test_gpu_discovery.pytests/unit/test_psapcluster_handlers.pytests/unit/test_psapcluster_manager.py
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (1)
fournos/operator.py (1)
169-177: ⚡ Quick winRun discovery once before the first interval sleep.
Current loop delays initial discovery until the first timer elapses. Doing an immediate first pass reduces startup lag for newly added kubeconfig secrets.
Proposed fix
def _discovery_loop(): interval = settings.cluster_discovery_interval_sec while True: - time.sleep(interval) try: handlers.scan_clusters() except Exception: logger.exception("Cluster discovery scan failed") + time.sleep(interval)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fournos/operator.py` around lines 169 - 177, The discovery loop in _discovery_loop currently waits for settings.cluster_discovery_interval_sec before the first scan; change the loop to perform an immediate scan on startup by calling handlers.scan_clusters() (inside the existing try/except that logs via logger.exception) before sleeping, i.e. move the sleep to occur after the scan so the sequence is: try: handlers.scan_clusters() except Exception: logger.exception(...) then time.sleep(settings.cluster_discovery_interval_sec), and repeat.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@fournos/core/discovery.py`:
- Around line 47-63: The loop that processes each secret should not abort the
entire discovery pass on a single failure; wrap the per-cluster work (after
cluster_name resolution) in a try/except so exceptions from
extract_cluster_name, self._kueue.create_flavor,
self._kueue.add_flavor_to_cluster_queue, or self._create_psapcluster are caught,
logged with contextual info (cluster_name and secret_name) and then continue to
the next secret; ensure you only append to discovered on success so failed
clusters are not marked discovered and rethrow nothing so the outer discovery
loop proceeds.
In `@fournos/core/kueue.py`:
- Around line 215-271: add_flavor_to_cluster_queue (and similarly
update_flavor_quotas) currently does a read-modify-write on the global
ClusterQueue and can lose concurrent changes; modify these functions to perform
optimistic retry-on-conflict by reading and including the object's current
metadata.resourceVersion when patching and retrying on ApiException conflict
(409) a few times, or alternatively switch to a narrower patch/Server-Side Apply
that targets only the specific flavor element instead of replacing
spec.resourceGroups; locate add_flavor_to_cluster_queue and update_flavor_quotas
in kueue.py and implement a loop that GETs the ClusterQueue, computes the
change, issues a PATCH/SSApply with resourceVersion or field-specific patch, and
retries on conflict until success or max attempts.
In `@fournos/handlers/lifecycle.py`:
- Around line 52-63: Reject combinations where lockOnly is true but exclusive is
false instead of silently overriding: in the validation blocks that read
exclusive = spec["exclusive"] and lock_only = spec.get("lockOnly", False) (and
the duplicate block around lines with similar logic), add a check that if
lock_only is True and exclusive is False then set patch.status["phase"] =
Phase.FAILED and patch.status["message"] to a clear message like "lockOnly: true
requires 'exclusive' to be true" and return; reference the same symbols
(exclusive, lock_only, spec) and ensure the same check is applied in both places
(the initial validation and the later similar block that currently enforces
exclusivity in _create_lock_workload()) so user input is rejected rather than
silently overridden.
In `@fournos/handlers/psapcluster.py`:
- Around line 216-220: reconcile_psapcluster currently calls
_reconcile_ttl_expiry(...) then continues to call _reconcile_lock_job(...) which
can immediately recreate a lock because patch was mutated but spec/status still
reflect the old locked state; modify _reconcile_ttl_expiry to return a boolean
(e.g., released=True if it called _release_lock()) and update
reconcile_psapcluster to check that return and stop the reconcile pass (early
return) when TTL expiry released the lock so _reconcile_lock_job is not invoked
with stale spec/status; apply the same pattern to the other reconcile sequences
mentioned (the blocks around lines 238-260 and 354-383) so each
_reconcile_ttl_expiry caller aborts further reconcile steps when it reports a
release.
- Around line 303-310: The code currently always appends a new condition into
patch.status.setdefault("conditions", []) (see the append call with
_make_condition and COND_GPU_DISCOVERED), causing duplicate COND_GPU_DISCOVERED
entries over time; change this to "upsert" by finding and removing any existing
condition whose "type" equals COND_GPU_DISCOVERED from
patch.status["conditions"] before adding the new _make_condition(...) entry.
Apply the same replacement in both the failure path (the shown append using
str(exc)) and the corresponding success path where COND_GPU_DISCOVERED is
appended, using the symbols patch.status, COND_GPU_DISCOVERED, and
_make_condition to locate and modify the code.
- Around line 342-347: The guard `and new_resources` prevents clearing Kueue
quotas when discovery reports zero GPUs; change the logic so that
`ctx.kueue.update_flavor_quotas(name, new_resources)` is called whenever
`new_resources != prev_resources` (even if `new_resources` is empty) so stale
GPU quotas are removed — update the block around the variables `prev_gpus`,
`new_resources`, `prev_resources` and the call to
`ctx.kueue.update_flavor_quotas(name, new_resources)` to remove the `and
new_resources` condition and ensure empty `new_resources` clears quotas.
In `@README.md`:
- Around line 323-325: The README's PSAPCluster onboarding is inconsistent about
manual quota setup: update the wording around the step that says "Add a
ResourceFlavor and quota for the cluster in `config/kueue-cluster-config.yaml`"
(and the repeated text at the later block) to clearly state whether this manual
step is required, optional fallback, or legacy; explicitly mention that the
controller (referenced as "controller updates ClusterQueue quotas automatically"
in the "Onboarding a new cluster" section) will normally create/update
ClusterQueue quotas for onboarded clusters and keep the manual
`config/kueue-cluster-config.yaml` step as an optional/manual fallback for edge
cases or legacy workflows, and add a short cross-reference to the "Onboarding a
new cluster" section and a one-line example/criteria when the manual step is
needed.
---
Nitpick comments:
In `@fournos/operator.py`:
- Around line 169-177: The discovery loop in _discovery_loop currently waits for
settings.cluster_discovery_interval_sec before the first scan; change the loop
to perform an immediate scan on startup by calling handlers.scan_clusters()
(inside the existing try/except that logs via logger.exception) before sleeping,
i.e. move the sleep to occur after the scan so the sequence is: try:
handlers.scan_clusters() except Exception: logger.exception(...) then
time.sleep(settings.cluster_discovery_interval_sec), and repeat.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9c74b8b6-0fa5-4d35-a8eb-3736ecf1f197
📒 Files selected for processing (19)
README.mdconfig/forge/samples/job-full.yamlfournos/core/clusters.pyfournos/core/constants.pyfournos/core/discovery.pyfournos/core/kueue.pyfournos/handlers/__init__.pyfournos/handlers/discovery.pyfournos/handlers/execution.pyfournos/handlers/lifecycle.pyfournos/handlers/psapcluster.pyfournos/operator.pyfournos/settings.pyfournos/state.pymanifests/crd.yamlmanifests/psapcluster-crd.yamlmanifests/rbac/role_fournos.yamltests/unit/test_cluster_discovery.pytests/unit/test_psapcluster_handlers.py
✅ Files skipped from review due to trivial changes (3)
- config/forge/samples/job-full.yaml
- fournos/handlers/init.py
- manifests/psapcluster-crd.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
- fournos/state.py
- fournos/core/constants.py
- manifests/rbac/role_fournos.yaml
- tests/unit/test_psapcluster_handlers.py
| for secret_name in secrets: | ||
| cluster_name = extract_cluster_name(secret_name) | ||
| if cluster_name is None: | ||
| continue | ||
|
|
||
| if cluster_name in existing_clusters: | ||
| continue | ||
|
|
||
| logger.info("Discovered new cluster %s from secret %s", cluster_name, secret_name) | ||
|
|
||
| if cluster_name not in existing_flavors: | ||
| self._kueue.create_flavor(cluster_name) | ||
|
|
||
| self._kueue.add_flavor_to_cluster_queue(cluster_name) | ||
| self._create_psapcluster(cluster_name, secret_name) | ||
| discovered.append(cluster_name) | ||
|
|
There was a problem hiding this comment.
A single onboarding failure aborts the whole discovery pass.
If one cluster fails during flavor/CQ/CR creation, the loop exits and later valid clusters are skipped for that cycle. Isolate errors per cluster and continue.
Proposed fix
discovered: list[str] = []
for secret_name in secrets:
cluster_name = extract_cluster_name(secret_name)
if cluster_name is None:
continue
if cluster_name in existing_clusters:
continue
logger.info("Discovered new cluster %s from secret %s", cluster_name, secret_name)
- if cluster_name not in existing_flavors:
- self._kueue.create_flavor(cluster_name)
-
- self._kueue.add_flavor_to_cluster_queue(cluster_name)
- self._create_psapcluster(cluster_name, secret_name)
- discovered.append(cluster_name)
+ try:
+ if cluster_name not in existing_flavors:
+ self._kueue.create_flavor(cluster_name)
+ existing_flavors.add(cluster_name)
+
+ self._kueue.add_flavor_to_cluster_queue(cluster_name)
+ self._create_psapcluster(cluster_name, secret_name)
+ discovered.append(cluster_name)
+ except Exception:
+ logger.exception(
+ "Failed onboarding discovered cluster %s from secret %s",
+ cluster_name,
+ secret_name,
+ )
+ continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@fournos/core/discovery.py` around lines 47 - 63, The loop that processes each
secret should not abort the entire discovery pass on a single failure; wrap the
per-cluster work (after cluster_name resolution) in a try/except so exceptions
from extract_cluster_name, self._kueue.create_flavor,
self._kueue.add_flavor_to_cluster_queue, or self._create_psapcluster are caught,
logged with contextual info (cluster_name and secret_name) and then continue to
the next secret; ensure you only append to discovered on success so failed
clusters are not marked discovered and rethrow nothing so the outer discovery
loop proceeds.
There was a problem hiding this comment.
looks like a valid concern
There was a problem hiding this comment.
@kpouget, glad we're aligned! The fix is straightforward — wrapping the per-cluster onboarding block in a try/except ensures that a transient API error (e.g., a flaky Kueue patch or a race on PSAPCluster creation) for one cluster doesn't silently drop all subsequent clusters from that scan cycle.
The proposed fix from the review comment is ready to apply as-is:
discovered: list[str] = []
for secret_name in secrets:
cluster_name = extract_cluster_name(secret_name)
if cluster_name is None:
continue
if cluster_name in existing_clusters:
continue
logger.info("Discovered new cluster %s from secret %s", cluster_name, secret_name)
try:
if cluster_name not in existing_flavors:
self._kueue.create_flavor(cluster_name)
existing_flavors.add(cluster_name)
self._kueue.add_flavor_to_cluster_queue(cluster_name)
self._create_psapcluster(cluster_name, secret_name)
discovered.append(cluster_name)
except Exception:
logger.exception(
"Failed onboarding discovered cluster %s from secret %s",
cluster_name,
secret_name,
)
continueNote also the existing_flavors.add(cluster_name) after a successful create_flavor call — this keeps the in-memory set consistent so that if the same flavor name somehow appears again later in the same scan, it won't attempt a redundant creation.
Would you like me to plan these code changes for you?
There was a problem hiding this comment.
Good suggestion to incorporate related to transient api error.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| exclusive = spec["exclusive"] | ||
| lock_only = spec.get("lockOnly", False) | ||
|
|
||
| if lock_only and not cluster: | ||
| patch.status["phase"] = Phase.FAILED | ||
| patch.status["message"] = "lockOnly: true requires 'cluster' to be set" | ||
| return | ||
|
|
||
| if not lock_only and not spec.get("forge"): | ||
| patch.status["phase"] = Phase.FAILED | ||
| patch.status["message"] = "spec.forge is required for non-lockOnly jobs" | ||
| return |
There was a problem hiding this comment.
Fail fast on lockOnly: true + exclusive: false instead of silently overriding.
Current flow accepts that combo but _create_lock_workload() enforces exclusive anyway, so user intent is ignored. Rejecting this input makes behavior explicit and debuggable.
Proposed fix
def on_create(spec, name, namespace, status, patch, body):
@@
cluster = spec.get("cluster")
exclusive = spec["exclusive"]
lock_only = spec.get("lockOnly", False)
+ if lock_only and not exclusive:
+ patch.status["phase"] = Phase.FAILED
+ patch.status["message"] = "lockOnly jobs require 'exclusive: true'"
+ return
+
if lock_only and not cluster:
patch.status["phase"] = Phase.FAILED
patch.status["message"] = "lockOnly: true requires 'cluster' to be set"
returnAlso applies to: 95-106
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@fournos/handlers/lifecycle.py` around lines 52 - 63, Reject combinations
where lockOnly is true but exclusive is false instead of silently overriding: in
the validation blocks that read exclusive = spec["exclusive"] and lock_only =
spec.get("lockOnly", False) (and the duplicate block around lines with similar
logic), add a check that if lock_only is True and exclusive is False then set
patch.status["phase"] = Phase.FAILED and patch.status["message"] to a clear
message like "lockOnly: true requires 'exclusive' to be true" and return;
reference the same symbols (exclusive, lock_only, spec) and ensure the same
check is applied in both places (the initial validation and the later similar
block that currently enforces exclusivity in _create_lock_workload()) so user
input is rejected rather than silently overridden.
| def reconcile_psapcluster(spec, name, namespace, status, patch, body): | ||
| _reconcile_kubeconfig(spec, name, status, patch) | ||
| _reconcile_ttl_expiry(spec, name, status, patch) | ||
| _reconcile_gpu_discovery(spec, name, status, patch) | ||
| _reconcile_lock_job(spec, name, status, patch) |
There was a problem hiding this comment.
Stop the reconcile pass after TTL expiry releases the lock.
_reconcile_ttl_expiry() only updates patch, but the same pass still calls _reconcile_lock_job() with the old spec/status. After _release_lock() deletes the sentinel, the self-healing branch still sees status.locked == True and the old spec.owner, so it can recreate the lock immediately and make TTL expiry ineffective.
Suggested fix
def reconcile_psapcluster(spec, name, namespace, status, patch, body):
_reconcile_kubeconfig(spec, name, status, patch)
- _reconcile_ttl_expiry(spec, name, status, patch)
+ if _reconcile_ttl_expiry(spec, name, status, patch):
+ return
_reconcile_gpu_discovery(spec, name, status, patch)
_reconcile_lock_job(spec, name, status, patch)
-
def _reconcile_ttl_expiry(spec, name, status, patch):
if not status.get("locked"):
- return
+ return False
...
if datetime.now(timezone.utc) >= expires_at:
prev_owner = spec.get("owner", "unknown")
logger.info(
"PSAPCluster %s: ownership expired (was owned by %s)", name, prev_owner
)
patch.spec["owner"] = ""
_release_lock(name, patch)
+ return True
+ return FalseAlso applies to: 238-260, 354-383
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@fournos/handlers/psapcluster.py` around lines 216 - 220,
reconcile_psapcluster currently calls _reconcile_ttl_expiry(...) then continues
to call _reconcile_lock_job(...) which can immediately recreate a lock because
patch was mutated but spec/status still reflect the old locked state; modify
_reconcile_ttl_expiry to return a boolean (e.g., released=True if it called
_release_lock()) and update reconcile_psapcluster to check that return and stop
the reconcile pass (early return) when TTL expiry released the lock so
_reconcile_lock_job is not invoked with stale spec/status; apply the same
pattern to the other reconcile sequences mentioned (the blocks around lines
238-260 and 354-383) so each _reconcile_ttl_expiry caller aborts further
reconcile steps when it reports a release.
| patch.status.setdefault("conditions", []).append( | ||
| _make_condition( | ||
| COND_GPU_DISCOVERED, | ||
| "False", | ||
| "DiscoveryFailed", | ||
| str(exc), | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Upsert conditions instead of appending duplicates on every timer run.
Both the success and failure paths append a fresh COND_GPU_DISCOVERED entry every reconciliation. On a long-lived cluster this grows status.conditions without bound and makes consumers pick between duplicate condition records. Replace the existing condition by type instead of always appending.
Also applies to: 338-340
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@fournos/handlers/psapcluster.py` around lines 303 - 310, The code currently
always appends a new condition into patch.status.setdefault("conditions", [])
(see the append call with _make_condition and COND_GPU_DISCOVERED), causing
duplicate COND_GPU_DISCOVERED entries over time; change this to "upsert" by
finding and removing any existing condition whose "type" equals
COND_GPU_DISCOVERED from patch.status["conditions"] before adding the new
_make_condition(...) entry. Apply the same replacement in both the failure path
(the shown append using str(exc)) and the corresponding success path where
COND_GPU_DISCOVERED is appended, using the symbols patch.status,
COND_GPU_DISCOVERED, and _make_condition to locate and modify the code.
| prev_gpus = hardware.get("gpus", []) | ||
| new_resources = [(g.short_name, g.count) for g in result.gpus] | ||
| prev_resources = [(g.get("shortName"), g.get("count")) for g in prev_gpus] | ||
| if new_resources != prev_resources and new_resources: | ||
| try: | ||
| ctx.kueue.update_flavor_quotas(name, new_resources) |
There was a problem hiding this comment.
Clear Kueue quotas when discovery drops to zero GPUs.
The and new_resources guard skips update_flavor_quotas() when a cluster previously had GPUs but discovery now returns none. That leaves stale GPU quota in Kueue and can schedule workloads onto capacity that no longer exists.
Suggested fix
- if new_resources != prev_resources and new_resources:
+ if new_resources != prev_resources:
try:
ctx.kueue.update_flavor_quotas(name, new_resources)
except Exception as exc:
logger.warning(
"PSAPCluster %s: failed to update flavor quotas: %s", name, exc📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| prev_gpus = hardware.get("gpus", []) | |
| new_resources = [(g.short_name, g.count) for g in result.gpus] | |
| prev_resources = [(g.get("shortName"), g.get("count")) for g in prev_gpus] | |
| if new_resources != prev_resources and new_resources: | |
| try: | |
| ctx.kueue.update_flavor_quotas(name, new_resources) | |
| prev_gpus = hardware.get("gpus", []) | |
| new_resources = [(g.short_name, g.count) for g in result.gpus] | |
| prev_resources = [(g.get("shortName"), g.get("count")) for g in prev_gpus] | |
| if new_resources != prev_resources: | |
| try: | |
| ctx.kueue.update_flavor_quotas(name, new_resources) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@fournos/handlers/psapcluster.py` around lines 342 - 347, The guard `and
new_resources` prevents clearing Kueue quotas when discovery reports zero GPUs;
change the logic so that `ctx.kueue.update_flavor_quotas(name, new_resources)`
is called whenever `new_resources != prev_resources` (even if `new_resources` is
empty) so stale GPU quotas are removed — update the block around the variables
`prev_gpus`, `new_resources`, `prev_resources` and the call to
`ctx.kueue.update_flavor_quotas(name, new_resources)` to remove the `and
new_resources` condition and ensure empty `new_resources` clears quotas.
| if cluster_name in existing_clusters: | ||
| continue |
There was a problem hiding this comment.
we need to make sure that the objects (flavor, etc) are there, even if cluster_name in existing_clusters
they need to
- exist
- have the right content
(I think they need to be created every time)
| GPU_MODEL_MAP: dict[str, str] = { | ||
| "NVIDIA-A100-SXM4-80GB": "a100", | ||
| "NVIDIA-A100-SXM4-40GB": "a100", | ||
| "NVIDIA-A100-PCIE-80GB": "a100", | ||
| "NVIDIA-A100-PCIE-40GB": "a100", | ||
| "NVIDIA-H100-SXM5-80GB": "h100", | ||
| "NVIDIA-H100-PCIE-80GB": "h100", | ||
| "NVIDIA-H200-SXM-141GB": "h200", | ||
| "AMD-Instinct-MI300X": "mi300x", | ||
| } |
There was a problem hiding this comment.
the can be in a user-configurable config map
(can be done later. If we don't do it in the PR, need to fill a ticket so that we don't forget it)
kpouget
left a comment
There was a problem hiding this comment.
@MML-coder , could you turn this into a controller fully independent from FOURNOS ?
you create a secondary top directory, and only important the relevant FOURNOS bits (constants, I assume)
we'll also need a name for the controller,
and maybe use another name than "PSAP"
Maybe FournosCluster, or something alike if we find another nice name
| apiVersion: apiextensions.k8s.io/v1 | ||
| kind: CustomResourceDefinition |
There was a problem hiding this comment.
we'll rediscuss it, but I think we'll want to have
spec:
resources:
h200: 8
with spec.resources either user provided or dynamically looked up if spec.resources: {}
There was a problem hiding this comment.
Are you thinking of user could provide how many GPus they want to use by specifying spec.resources: 2 or 8 ? and the job can use available GPUs for the job?
There was a problem hiding this comment.
there are two things I have in mind:
- we want the controller to do this only once (hence the
specfield. Like a dynamic default value, not a truly reconciliated field) - in the future, we want to allow the customization of the resources list with SW properties (TASK: refine the future behavior of the exclusive/cluster/hardware flags #70)
…ocking Introduces a PSAPCluster custom resource and kopf controller that unifies cluster definition, discovers GPU hardware dynamically, manages per-cluster Kueue ClusterQueues, and enables ownership locking via stopPolicy. - PSAPCluster CRD (fournos.dev/v1) with spec fields: kubeconfigSecret, owner, ttl, evict, gpuDiscoveryInterval - GPU discovery client that connects to target clusters via kubeconfig secrets, lists nodes, and aggregates GPU counts by vendor/model - Per-cluster ClusterQueues in a shared cohort with self-healing - Ownership locking with Hold/HoldAndDrain stopPolicy and TTL auto-expiry - Timer-based reconciliation: kubeconfig validation, TTL expiry, GPU discovery with backoff, CQ/RF self-healing - 82 unit tests covering all modules Closes #37, closes #51
Replace per-cluster ClusterQueue stopPolicy locking with sentinel FournosJob approach. A lockOnly sentinel holds all 100 cluster-slots via Kueue, blocking competing workloads without requiring per-cluster CQ infrastructure. Add auto-discovery daemon that scans kubeconfig secrets every 60s and auto-creates ResourceFlavor, ClusterQueue flavor entry, and PSAPCluster CR for each new cluster found. Key changes: - Add lockOnly field to FournosJob CRD, skip Forge for lockOnly jobs - Rewrite PSAPCluster handlers for sentinel create/delete lifecycle - Remove PSAPClusterManager (fournos/core/psapcluster.py) - Add create_flavor, add_flavor_to_cluster_queue, update_flavor_quotas to KueueClient - Add ClusterDiscovery with extract_cluster_name, list_kubeconfig_secrets - Wire discovery daemon thread in operator startup - Fix RBAC: add create/delete for fournosjobs, create for psapclusters - Add end-to-end locking test walkthrough to README - 28 new discovery unit tests, rewritten PSAPCluster handler tests
…ter service Move the cluster management logic (GPU discovery, sentinel locking, auto-discovery, Kueue integration) out of the fournos operator into a standalone kopf-based controller in fournos-cluster/. The new controller watches kubeconfig secrets in psap-secrets and manages FournosCluster CRs in psap-automation. It runs as its own Deployment with dedicated RBAC and container image. Changes to the fournos operator: - Remove PSAPCluster handlers, discovery daemon, GPU discovery, and related settings/state - Fix lifecycle validation to check spec.executionEngine instead of spec.forge - Remove psapclusters RBAC from role_fournos.yaml Tested on psap-automation cluster: - Auto-discovery from labeled secrets - GPU discovery (8x NVIDIA on athena-fire) - Sentinel locking (blocks competing workloads via Kueue) - Lock release (sentinel deleted, blocked workloads admitted)
efa8139 to
89627f7
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
fournos/core/constants.py (1)
30-31:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix formatter drift causing CI failure.
Line 30introduces spacing thatruff format --checkwants to rewrite, and lint is currently failing on this PR. Please runruff format(or normalize these blank lines manually) before merge.Suggested minimal fix
class Shutdown(StrEnum): STOP = "Stop" TERMINATE = "Terminate" - -🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fournos/core/constants.py` around lines 30 - 31, Remove the unexpected extra blank line(s) introduced in fournos/core/constants.py (around the top of the module where module-level constants are defined) so the file matches ruff's formatting expectations; run "ruff format" (or manually normalize the blank lines to a single blank line between top-level sections) and re-run "ruff format --check" to ensure the formatter drift is fixed before merging.fournos/handlers/lifecycle.py (1)
1-272:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winCI is failing on
ruff format --check.The lint job reports formatting changes needed for this file. Run
ruff format(orruff format --write) and commit before merge, otherwise CI will keep blocking the PR.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fournos/handlers/lifecycle.py` around lines 1 - 272, The file fails ruff formatting checks; run the auto-formatter and commit the updated file. Specifically, open the lifecycle handlers containing functions on_create, _create_lock_workload, _find_exclusive_locker, _pending_status, and reconcile_pending, run `ruff format` (or `ruff format --write`) to apply the required whitespace/line-break/style fixes, verify no remaining ruff format --check errors, and commit the formatted changes so CI stops failing.
♻️ Duplicate comments (3)
fournos-cluster/fournos_cluster/core/gpu_discovery.py (1)
125-139:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winTimeout is still configured on
api_client.configurationand has no effect at call time.The Kubernetes Python client does not honor
Configuration.request_timeout— that attribute is not consulted byApiClient.call_api()/RESTClient.request(). The timeout must be passed per-call as_request_timeout. As written,list_node()will block indefinitely against an unreachable target cluster, which directly defeats the intent ofgpu_discovery_timeout_secand can stall the reconciler thread.A previous review raised this same issue on an earlier commit; the implementation was not updated.
Does kubernetes-client/python honor Configuration.request_timeout for individual API calls, or must _request_timeout be passed per call?🛠️ Suggested fix
api_client = k8s_config.new_client_from_config_dict(kubeconfig_dict) - api_client.configuration.request_timeout = ( - settings.gpu_discovery_timeout_sec, - settings.gpu_discovery_timeout_sec, - ) + timeout = ( + settings.gpu_discovery_timeout_sec, + settings.gpu_discovery_timeout_sec, + ) try: target_core = client.CoreV1Api(api_client) - nodes = target_core.list_node() + nodes = target_core.list_node(_request_timeout=timeout) except Exception as exc: raise GPUDiscoveryError( f"Failed to connect to cluster {cluster_name!r}: {exc}" ) from exc finally: api_client.close()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fournos-cluster/fournos_cluster/core/gpu_discovery.py` around lines 125 - 139, The code configures api_client.configuration.request_timeout which the Kubernetes Python client ignores; change the call that fetches nodes (target_core.list_node) to pass the timeout per-call via the internal _request_timeout parameter (e.g., _request_timeout=(settings.gpu_discovery_timeout_sec, settings.gpu_discovery_timeout_sec)) so the call will time out; keep the existing try/except raising GPUDiscoveryError on failure and the api_client.close() in the finally block.README.md (1)
323-345:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winStep 2 still contradicts step 3 / "controller automatically …".
The PSAPCluster onboarding section still tells users to manually add a ResourceFlavor and quota in step 2, while the bullet a few lines below states the controller "Discovers GPUs … and updates the global
fournos-queueClusterQueue quotas for this cluster's ResourceFlavor". The previous review on this section is still open — please clarify whether step 2 is a manual fallback or legacy flow.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` around lines 323 - 345, The documentation currently contradicts itself about who creates ResourceFlavors and quotas; update the PSAPCluster onboarding section to clearly state that the controller will auto-discover GPUs and create/update the cluster's ResourceFlavor and update the global `fournos-queue` ClusterQueue quotas when a PSAPCluster is applied, and make step 2 (the instruction to manually add a ResourceFlavor and quota in `config/kueue-cluster-config.yaml`) an explicit "manual fallback" for environments where the controller is disabled/unavailable; mention `PSAPCluster`, `ResourceFlavor`, `fournos-queue`, and `config/kueue-cluster-config.yaml` by name so readers know when to rely on automation vs perform the manual step.fournos/handlers/lifecycle.py (1)
51-89:⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
lockOnly: true+exclusive: falseis still silently overridden.
exclusiveis read from the spec and checked at line 84, but forlock_only: truethe path branches into_create_lock_workload(line 88) whereexclusive=Trueis hardcoded (line 104) — so a user who explicitly setsexclusive: falsealong withlockOnly: truegets a different behavior than they declared, with no signal back.The previous review proposed rejecting this combination explicitly; that recommendation still applies to the current code.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fournos/handlers/lifecycle.py` around lines 51 - 89, Add an explicit validation that rejects the unsupported combination of lock_only and exclusive=False: when spec["lockOnly"] (lock_only) is true and spec.get("exclusive") is present and False, set patch.status["phase"]=Phase.FAILED with a clear message (e.g. "lockOnly: true cannot be combined with exclusive: false") and return before calling _create_lock_workload; this prevents silently overriding the user's exclusive setting and keeps the existing LABEL_EXCLUSIVE_CLUSTER/_create_lock_workload behavior intact.
🧹 Nitpick comments (4)
fournos/handlers/lifecycle.py (1)
108-125: 💤 Low valueDocument the deliberate 409 fall-through.
When the create call returns 409 (workload already exists), control silently falls through to the
Phase.PENDINGblock. That's the right behavior for idempotency, but it isn't obvious from reading the code. A one-line comment would prevent a future refactor from "fixing" it into a return.♻️ Suggested clarification
except k8s_client.exceptions.ApiException as exc: - if exc.status != 409: + if exc.status != 409: patch.status["phase"] = Phase.FAILED patch.status["message"] = f"Failed to create Workload: {exc.reason}" logger.error("Job %s: Workload creation failed: %s", name, exc.reason) return + # 409: workload already exists (e.g. handler retry); proceed to PENDING.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fournos/handlers/lifecycle.py` around lines 108 - 125, Add a one-line comment inside the except k8s_client.exceptions.ApiException as exc block explaining that a 409 status is intentionally allowed to fall through (workload already exists) for idempotent behavior; make it adjacent to the conditional that checks exc.status != 409 and mention that when exc.status == 409 we intentionally proceed to set patch.status to Phase.PENDING and set the COND_WORKLOAD_ADMITTED condition via set_condition so future maintainers don’t change the fall-through to a return.fournos-cluster/fournos_cluster/core/gpu_discovery.py (1)
116-123: 💤 Low valueSilently picking
contexts[0]can mask a malformed kubeconfig.A kubeconfig with no
current-contextand multiple contexts is unusual; defaulting to the first one without logging makes it hard to diagnose why GPU discovery hits the "wrong" cluster. Consider logging a warning so this fallback is visible in operator logs.♻️ Suggested change
if "current-context" not in kubeconfig_dict: contexts = kubeconfig_dict.get("contexts", []) if contexts: + if len(contexts) > 1: + logger.warning( + "Kubeconfig for %s has no current-context; " + "defaulting to first of %d contexts (%s)", + cluster_name, len(contexts), contexts[0].get("name"), + ) kubeconfig_dict["current-context"] = contexts[0]["name"] else: raise GPUDiscoveryError( f"Kubeconfig for {cluster_name!r} has no contexts" )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fournos-cluster/fournos_cluster/core/gpu_discovery.py` around lines 116 - 123, When you detect a missing "current-context" and choose to fallback to contexts[0], emit a warning log that includes cluster_name, the list of available contexts and the chosen context so the fallback is visible in operator logs; update the block that inspects kubeconfig_dict and contexts (the code that currently assigns kubeconfig_dict["current-context"] = contexts[0]["name"]) to call the module logger (or existing logger used for GPU discovery) to warn before assigning, while leaving the GPUDiscoveryError raise path intact for the empty-contexts case.fournos-cluster/fournos_cluster/handlers/cluster.py (1)
57-67: 💤 Low valueNarrow the exception in
_check_kubeconfig.Catching bare
Exceptionand only conditionally inspecting.statusmasks programming errors (e.g.,AttributeError,KeyErrorfrom a malformedspec) asInvalid. Catchkubernetes.client.exceptions.ApiExceptionfor the404 → Missing/ non-404 →Invalidmapping and let unexpected exceptions propagate to Kopf for retry/visibility.♻️ Suggested change
- try: - core.read_namespaced_secret(secret_name, settings.secrets_namespace) - return "Valid" - except Exception as exc: - if hasattr(exc, "status") and exc.status == 404: - return "Missing" - logger.warning("Error checking kubeconfig secret %s: %s", secret_name, exc) - return "Invalid" + try: + core.read_namespaced_secret(secret_name, settings.secrets_namespace) + return "Valid" + except client.exceptions.ApiException as exc: + if exc.status == 404: + return "Missing" + logger.warning("Error checking kubeconfig secret %s: %s", secret_name, exc) + return "Invalid"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fournos-cluster/fournos_cluster/handlers/cluster.py` around lines 57 - 67, The function _check_kubeconfig currently catches all Exception which masks programming errors; change it to only catch kubernetes.client.exceptions.ApiException when calling core.read_namespaced_secret(secret_name, settings.secrets_namespace), check exc.status == 404 to return "Missing" and return "Invalid" for other ApiException statuses, and remove the broad except so that unexpected errors (e.g., KeyError on spec or AttributeError) propagate to Kopf; reference symbols: _check_kubeconfig, spec, secret_name, core.read_namespaced_secret, settings.secrets_namespace, logger.warning, and import/use kubernetes.client.exceptions.ApiException.fournos-cluster/fournos_cluster/operator.py (1)
25-26: 💤 Low value
logging.basicConfig()withoutforce=Truewill not apply the configured level or format.By the time
@kopf.on.startup()runs, Kopf has already attached handlers to the root logger. CallingbasicConfig()withoutforce=Truereturns silently without applying changes. Useforce=Trueto override existing configuration, or calllogging.getLogger().setLevel(log_level)directly. For custom formatting, consider usingkopf run --log-format ...CLI options or add a handler with your formatter in the startup hook.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fournos-cluster/fournos_cluster/operator.py` around lines 25 - 26, The current call to logging.basicConfig(level=log_level, format=...) won't override Kopf's existing handlers; update the startup logging setup (where logging.basicConfig is called and settings.log_level is used) to either call logging.basicConfig(..., force=True) or explicitly set the root logger level and add a handler with your formatter (e.g., logging.getLogger().setLevel(log_level) plus create and attach a StreamHandler with the formatter) so the configured level/format are actually applied.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@fournos-cluster/deploy/deployment.yaml`:
- Around line 38-39: Replace the personal mutable image reference and Always
pull policy: update the image field value that currently reads
"quay.io/memehta/fournos-cluster:latest" to an organization-owned repository and
a pinned identifier (for example "quay.io/openshift-psap/fournos-cluster:vX.Y.Z"
or a digest "quay.io/openshift-psap/fournos-cluster@sha256:...") and keep or
change imagePullPolicy only as needed for your release workflow (e.g., remove
imagePullPolicy: Always or set to IfNotPresent for pinned tags); ensure you
modify the deployment.yaml image entry and any related CI/CD references that
push tags so pods no longer rely on a personal namespace or the mutable "latest"
tag.
- Around line 62-69: Add container-level readOnlyRootFilesystem: true to the
existing container.securityContext (or if the Kopf process needs writable space,
mount an emptyDir volume at a writable path such as /tmp and limit writes to
that mount) and add a minimal pod-level securityContext with runAsNonRoot: true
(and optionally fsGroup if needed) to the Pod spec to satisfy PSA restricted
checks; update the container spec referenced by securityContext,
allowPrivilegeEscalation, capabilities, runAsNonRoot and seccompProfile and add
a top-level pod securityContext entry in the same Deployment manifest.
In `@fournos-cluster/fournos_cluster/constants.py`:
- Around line 14-15: The CRD uses a singular plural name which violates
Kubernetes conventions; update the CRD's spec.names.plural value from
"fournoscluster" to "fournosclusters", change the constant
FOURNOS_CLUSTER_PLURAL in constants.py to "fournosclusters", and update any RBAC
resource entries that reference "fournoscluster", "fournoscluster/status", or
"fournoscluster/finalizers" to use the plural form "fournosclusters",
"fournosclusters/status", and "fournosclusters/finalizers" so the CRD,
constants, and RBAC remain consistent and CustomObjectsApi calls continue to
work.
In `@fournos-cluster/fournos_cluster/core/kueue.py`:
- Around line 110-186: update_flavor_quotas currently adds new GPU resource
names to rg["coveredResources"] and only rebuilds target_flavor["resources"],
leaving sibling flavors missing entries and violating Kueue's invariant; before
patching, iterate over rg["flavors"] (use the existing flavors list and identify
target_flavor) and for each sibling flavor build a resources list aligned with
the final coveredResources ordering (use the same sorted(covered) used for
rg["coveredResources"]) adding each resource with its existing nominalQuota if
present or nominalQuota: 0 otherwise (preserve existing CLUSTER_SLOT_RESOURCE
quota from each flavor when present, or use slot_quota fallback for target if
needed), then assign those lists back into each flavor so every flavor has
entries for every covered resource before calling patch_cluster_custom_object.
In `@README.md`:
- Around line 547-552: The README table lists incorrect env var names;
pydantic-settings in fourn o s_cluster/settings.py sets model_config =
{"env_prefix": "FOURNOS_CLUSTER_"} so environment variables are derived by
prefixing field names—update the table to use the actual env vars
FOURNOS_CLUSTER_RECONCILE_INTERVAL_SEC,
FOURNOS_CLUSTER_GPU_DISCOVERY_DEFAULT_INTERVAL_SEC, and
FOURNOS_CLUSTER_GPU_DISCOVERY_TIMEOUT_SEC (matching the Settings field names),
and remove the fourth row (no env var controls cluster discovery interval);
reference model_config/env_prefix and the Settings field names to locate the
authoritative names.
- Around line 504-507: Update the PSAPCluster troubleshooting instructions so
logs are checked from the cluster-specific operator deployment: replace
references to the `fournos` deployment with `fournos-cluster` and update the
example command to `oc logs deployment/fournos-cluster -n psap-automation | grep
<cluster-name>`; ensure the PSAPCluster section text and the three-step
checklist mention `fournos-cluster` when directing users to check kubeconfig/GPU
discovery and cluster-related errors.
---
Outside diff comments:
In `@fournos/core/constants.py`:
- Around line 30-31: Remove the unexpected extra blank line(s) introduced in
fournos/core/constants.py (around the top of the module where module-level
constants are defined) so the file matches ruff's formatting expectations; run
"ruff format" (or manually normalize the blank lines to a single blank line
between top-level sections) and re-run "ruff format --check" to ensure the
formatter drift is fixed before merging.
In `@fournos/handlers/lifecycle.py`:
- Around line 1-272: The file fails ruff formatting checks; run the
auto-formatter and commit the updated file. Specifically, open the lifecycle
handlers containing functions on_create, _create_lock_workload,
_find_exclusive_locker, _pending_status, and reconcile_pending, run `ruff
format` (or `ruff format --write`) to apply the required
whitespace/line-break/style fixes, verify no remaining ruff format --check
errors, and commit the formatted changes so CI stops failing.
---
Duplicate comments:
In `@fournos-cluster/fournos_cluster/core/gpu_discovery.py`:
- Around line 125-139: The code configures
api_client.configuration.request_timeout which the Kubernetes Python client
ignores; change the call that fetches nodes (target_core.list_node) to pass the
timeout per-call via the internal _request_timeout parameter (e.g.,
_request_timeout=(settings.gpu_discovery_timeout_sec,
settings.gpu_discovery_timeout_sec)) so the call will time out; keep the
existing try/except raising GPUDiscoveryError on failure and the
api_client.close() in the finally block.
In `@fournos/handlers/lifecycle.py`:
- Around line 51-89: Add an explicit validation that rejects the unsupported
combination of lock_only and exclusive=False: when spec["lockOnly"] (lock_only)
is true and spec.get("exclusive") is present and False, set
patch.status["phase"]=Phase.FAILED with a clear message (e.g. "lockOnly: true
cannot be combined with exclusive: false") and return before calling
_create_lock_workload; this prevents silently overriding the user's exclusive
setting and keeps the existing LABEL_EXCLUSIVE_CLUSTER/_create_lock_workload
behavior intact.
In `@README.md`:
- Around line 323-345: The documentation currently contradicts itself about who
creates ResourceFlavors and quotas; update the PSAPCluster onboarding section to
clearly state that the controller will auto-discover GPUs and create/update the
cluster's ResourceFlavor and update the global `fournos-queue` ClusterQueue
quotas when a PSAPCluster is applied, and make step 2 (the instruction to
manually add a ResourceFlavor and quota in `config/kueue-cluster-config.yaml`)
an explicit "manual fallback" for environments where the controller is
disabled/unavailable; mention `PSAPCluster`, `ResourceFlavor`, `fournos-queue`,
and `config/kueue-cluster-config.yaml` by name so readers know when to rely on
automation vs perform the manual step.
---
Nitpick comments:
In `@fournos-cluster/fournos_cluster/core/gpu_discovery.py`:
- Around line 116-123: When you detect a missing "current-context" and choose to
fallback to contexts[0], emit a warning log that includes cluster_name, the list
of available contexts and the chosen context so the fallback is visible in
operator logs; update the block that inspects kubeconfig_dict and contexts (the
code that currently assigns kubeconfig_dict["current-context"] =
contexts[0]["name"]) to call the module logger (or existing logger used for GPU
discovery) to warn before assigning, while leaving the GPUDiscoveryError raise
path intact for the empty-contexts case.
In `@fournos-cluster/fournos_cluster/handlers/cluster.py`:
- Around line 57-67: The function _check_kubeconfig currently catches all
Exception which masks programming errors; change it to only catch
kubernetes.client.exceptions.ApiException when calling
core.read_namespaced_secret(secret_name, settings.secrets_namespace), check
exc.status == 404 to return "Missing" and return "Invalid" for other
ApiException statuses, and remove the broad except so that unexpected errors
(e.g., KeyError on spec or AttributeError) propagate to Kopf; reference symbols:
_check_kubeconfig, spec, secret_name, core.read_namespaced_secret,
settings.secrets_namespace, logger.warning, and import/use
kubernetes.client.exceptions.ApiException.
In `@fournos-cluster/fournos_cluster/operator.py`:
- Around line 25-26: The current call to logging.basicConfig(level=log_level,
format=...) won't override Kopf's existing handlers; update the startup logging
setup (where logging.basicConfig is called and settings.log_level is used) to
either call logging.basicConfig(..., force=True) or explicitly set the root
logger level and add a handler with your formatter (e.g.,
logging.getLogger().setLevel(log_level) plus create and attach a StreamHandler
with the formatter) so the configured level/format are actually applied.
In `@fournos/handlers/lifecycle.py`:
- Around line 108-125: Add a one-line comment inside the except
k8s_client.exceptions.ApiException as exc block explaining that a 409 status is
intentionally allowed to fall through (workload already exists) for idempotent
behavior; make it adjacent to the conditional that checks exc.status != 409 and
mention that when exc.status == 409 we intentionally proceed to set patch.status
to Phase.PENDING and set the COND_WORKLOAD_ADMITTED condition via set_condition
so future maintainers don’t change the fall-through to a return.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 97ec2708-e0f5-4e8e-a6a5-1efb8a2c20f9
📒 Files selected for processing (35)
README.mdconfig/forge/samples/job-full.yamldev/mock-psapcluster.yamlfournos-cluster/Containerfilefournos-cluster/README.mdfournos-cluster/deploy/deployment.yamlfournos-cluster/fournos_cluster/__init__.pyfournos-cluster/fournos_cluster/__main__.pyfournos-cluster/fournos_cluster/constants.pyfournos-cluster/fournos_cluster/core/__init__.pyfournos-cluster/fournos_cluster/core/gpu_discovery.pyfournos-cluster/fournos_cluster/core/kueue.pyfournos-cluster/fournos_cluster/handlers/__init__.pyfournos-cluster/fournos_cluster/handlers/cluster.pyfournos-cluster/fournos_cluster/handlers/secret.pyfournos-cluster/fournos_cluster/operator.pyfournos-cluster/fournos_cluster/settings.pyfournos-cluster/fournos_cluster/state.pyfournos-cluster/manifests/crd.yamlfournos-cluster/manifests/rbac/role.yamlfournos-cluster/pyproject.tomlfournos-cluster/tests/__init__.pyfournos-cluster/tests/unit/__init__.pyfournos-cluster/tests/unit/test_cluster_handlers.pyfournos-cluster/tests/unit/test_gpu_discovery.pyfournos-cluster/tests/unit/test_kueue.pyfournos-cluster/tests/unit/test_secret_handlers.pyfournos/core/clusters.pyfournos/core/constants.pyfournos/core/kueue.pyfournos/handlers/execution.pyfournos/handlers/lifecycle.pyfournos/settings.pymanifests/crd.yamlmanifests/rbac/role_fournos.yaml
✅ Files skipped from review due to trivial changes (5)
- fournos-cluster/fournos_cluster/handlers/init.py
- fournos-cluster/pyproject.toml
- config/forge/samples/job-full.yaml
- fournos-cluster/README.md
- dev/mock-psapcluster.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
- fournos/handlers/execution.py
- fournos/core/clusters.py
- manifests/crd.yaml
- fournos/core/kueue.py
| image: quay.io/memehta/fournos-cluster:latest | ||
| imagePullPolicy: Always |
There was a problem hiding this comment.
Production deployment must not reference a personal image namespace with a mutable latest tag.
quay.io/memehta/fournos-cluster:latest couples the deployment to a personal account: if access is revoked or the account is removed, all pod restarts will fail. Combined with imagePullPolicy: Always, every restart silently pulls whatever image was last pushed, making rollbacks and audit trails unreliable.
Move the image to an org-level registry path (e.g., quay.io/openshift-psap/fournos-cluster) and pin to a specific tag or digest.
🔧 Suggested change
- image: quay.io/memehta/fournos-cluster:latest
- imagePullPolicy: Always
+ image: quay.io/openshift-psap/fournos-cluster:<tag-or-digest>
+ imagePullPolicy: IfNotPresent🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@fournos-cluster/deploy/deployment.yaml` around lines 38 - 39, Replace the
personal mutable image reference and Always pull policy: update the image field
value that currently reads "quay.io/memehta/fournos-cluster:latest" to an
organization-owned repository and a pinned identifier (for example
"quay.io/openshift-psap/fournos-cluster:vX.Y.Z" or a digest
"quay.io/openshift-psap/fournos-cluster@sha256:...") and keep or change
imagePullPolicy only as needed for your release workflow (e.g., remove
imagePullPolicy: Always or set to IfNotPresent for pinned tags); ensure you
modify the deployment.yaml image entry and any related CI/CD references that
push tags so pods no longer rely on a personal namespace or the mutable "latest"
tag.
| securityContext: | ||
| allowPrivilegeEscalation: false | ||
| capabilities: | ||
| drop: | ||
| - ALL | ||
| runAsNonRoot: true | ||
| seccompProfile: | ||
| type: RuntimeDefault |
There was a problem hiding this comment.
Two Trivy-flagged security hardening gaps in the container security context.
- KSV-0014 —
readOnlyRootFilesystem: trueis missing. If the Kopf process doesn't write to the filesystem at runtime, this is a straightforward addition; otherwise pair it with anemptyDirvolume mounted at any writable path the process needs (e.g.,/tmp). - KSV-0118 — No pod-level
securityContextis set. WhilerunAsNonRoot: trueexists at the container level, some admission controllers (PSArestrictedprofile) also require it at the pod level. Adding a minimal pod-level context closes this gap.
🛡️ Proposed fix
spec:
serviceAccountName: fournos-cluster
+ securityContext:
+ runAsNonRoot: true
+ seccompProfile:
+ type: RuntimeDefault
containers:
- name: fournos-cluster
...
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
runAsNonRoot: true
+ readOnlyRootFilesystem: true
seccompProfile:
type: RuntimeDefault🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@fournos-cluster/deploy/deployment.yaml` around lines 62 - 69, Add
container-level readOnlyRootFilesystem: true to the existing
container.securityContext (or if the Kopf process needs writable space, mount an
emptyDir volume at a writable path such as /tmp and limit writes to that mount)
and add a minimal pod-level securityContext with runAsNonRoot: true (and
optionally fsGroup if needed) to the Pod spec to satisfy PSA restricted checks;
update the container spec referenced by securityContext,
allowPrivilegeEscalation, capabilities, runAsNonRoot and seccompProfile and add
a top-level pod securityContext entry in the same Deployment manifest.
| FOURNOSJOB_PLURAL = "fournosjobs" | ||
| FOURNOS_CLUSTER_PLURAL = "fournoscluster" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# 1. Find the CRD manifest and inspect its names (plural/singular/kind).
fd -t f -e yaml -e yml | xargs rg -lP '^kind:\s*CustomResourceDefinition' 2>/dev/null
fd -t f -e yaml -e yml | xargs rg -nP '^kind:\s*CustomResourceDefinition' 2>/dev/null \
| cut -d: -f1 | sort -u \
| xargs -I{} sh -c 'echo "=== {} ==="; rg -nP "names:|plural:|singular:|kind:|listKind:" "{}"'
# 2. Confirm every callsite of the constant.
rg -nP --type=py -C2 '\bFOURNOS_CLUSTER_PLURAL\b'
# 3. Cross-check RBAC and any other YAML referencing the resource name.
rg -nP -C1 '\bfournoscluster(s)?\b' -g '*.yaml' -g '*.yml'Repository: openshift-psap/fournos
Length of output: 4848
CRD manifest has plural: fournoscluster (singular) — the constant correctly mirrors this but both violate Kubernetes naming conventions.
The constant FOURNOS_CLUSTER_PLURAL = "fournoscluster" is not actually wrong; it correctly reflects what the CRD manifest defines at fournos-cluster/manifests/crd.yaml line 10 (plural: fournoscluster). However, the CRD itself has the bug — Kubernetes conventions require spec.names.plural to be the plural form. Compare FournosJob, which correctly uses plural: fournosjobs (plural).
Since the constant, CRD, and RBAC are all internally consistent with the singular form, the system currently functions without runtime errors. However, the naming violates Kubernetes conventions and must be fixed coherently:
- Update CRD
spec.names.pluralfromfournosclustertofournosclusters - Update constant to
FOURNOS_CLUSTER_PLURAL = "fournosclusters" - Update RBAC
resourcesfrom["fournoscluster", "fournoscluster/status", "fournoscluster/finalizers"]to usefournosclusters
Changing only the constant will break CustomObjectsApi calls and CRD matching.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@fournos-cluster/fournos_cluster/constants.py` around lines 14 - 15, The CRD
uses a singular plural name which violates Kubernetes conventions; update the
CRD's spec.names.plural value from "fournoscluster" to "fournosclusters", change
the constant FOURNOS_CLUSTER_PLURAL in constants.py to "fournosclusters", and
update any RBAC resource entries that reference "fournoscluster",
"fournoscluster/status", or "fournoscluster/finalizers" to use the plural form
"fournosclusters", "fournosclusters/status", and "fournosclusters/finalizers" so
the CRD, constants, and RBAC remain consistent and CustomObjectsApi calls
continue to work.
| def update_flavor_quotas( | ||
| self, | ||
| flavor_name: str, | ||
| gpu_resources: list[tuple[str, int]], | ||
| ) -> dict | None: | ||
| cq_name = settings.kueue_cluster_queue_name | ||
| try: | ||
| cq = self._k8s.get_cluster_custom_object( | ||
| group=KUEUE_GROUP, | ||
| version=KUEUE_VERSION, | ||
| plural=KUEUE_CLUSTER_QUEUE_PLURAL, | ||
| name=cq_name, | ||
| ) | ||
| except client.exceptions.ApiException as exc: | ||
| if exc.status == 404: | ||
| logger.warning("ClusterQueue %s not found, cannot update quotas", cq_name) | ||
| return None | ||
| raise | ||
|
|
||
| resource_groups = cq.get("spec", {}).get("resourceGroups", []) | ||
| if not resource_groups: | ||
| logger.warning("ClusterQueue %s has no resourceGroups", cq_name) | ||
| return None | ||
|
|
||
| rg = resource_groups[0] | ||
| covered = set(rg.get("coveredResources", [])) | ||
| flavors = rg.get("flavors", []) | ||
|
|
||
| target_flavor = None | ||
| for f in flavors: | ||
| if f["name"] == flavor_name: | ||
| target_flavor = f | ||
| break | ||
|
|
||
| if target_flavor is None: | ||
| logger.warning( | ||
| "Flavor %s not found in ClusterQueue %s, cannot update quotas", | ||
| flavor_name, | ||
| cq_name, | ||
| ) | ||
| return None | ||
|
|
||
| new_resources: list[dict] = [] | ||
| for short_name, count in gpu_resources: | ||
| resource_name = self._gpu_resource_name(short_name) | ||
| covered.add(resource_name) | ||
| new_resources.append({"name": resource_name, "nominalQuota": count}) | ||
|
|
||
| existing_slot = next( | ||
| ( | ||
| r | ||
| for r in target_flavor.get("resources", []) | ||
| if r["name"] == CLUSTER_SLOT_RESOURCE | ||
| ), | ||
| None, | ||
| ) | ||
| slot_quota = existing_slot["nominalQuota"] if existing_slot else MAX_CLUSTER_SLOTS | ||
| new_resources.append({"name": CLUSTER_SLOT_RESOURCE, "nominalQuota": slot_quota}) | ||
| covered.add(CLUSTER_SLOT_RESOURCE) | ||
|
|
||
| rg["coveredResources"] = sorted(covered) | ||
| target_flavor["resources"] = new_resources | ||
|
|
||
| result = self._k8s.patch_cluster_custom_object( | ||
| group=KUEUE_GROUP, | ||
| version=KUEUE_VERSION, | ||
| plural=KUEUE_CLUSTER_QUEUE_PLURAL, | ||
| name=cq_name, | ||
| body={"spec": {"resourceGroups": resource_groups}}, | ||
| ) | ||
| logger.info( | ||
| "Updated ClusterQueue %s flavor %s quotas: %s", | ||
| cq_name, | ||
| flavor_name, | ||
| gpu_resources, | ||
| ) | ||
| return result |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Kueue v1beta2 ClusterQueue resourceGroup require each flavor declare all coveredResources
💡 Result:
Yes. In Kueue v1beta2, within each ClusterQueue.spec.resourceGroups[], every flavor in spec.resourceGroups[].flavors must declare quotas for all resources listed in that resource group’s coveredResources (and in the same order). This is required by the API’s ResourceGroup design and validations: a resource group defines coveredResources and the flavors that “provide quotas for these resources”, and “Each flavor MUST list all the resources listed for this group in the same order as the.resources field” [1]. The Cluster Queue docs and examples show flavors declaring resources matching coveredResources within the same resourceGroup [2][3]. Note: by default, administrators must specify all resources required by Pods in the ClusterQueue’s resourceGroups; Kueue also supports excluding some resource types via cluster-level configuration (resource prefixes) [4]. Practical implication: when you set coveredResources: ["cpu","memory","foo.com/gpu"], each flavor under that resourceGroup must have resources entries for cpu, memory, and foo.com/gpu (not a subset) [2][3].
Citations:
- 1: https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta1/
- 2: https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/
- 3: https://kueue.sigs.k8s.io/docs/concepts/cluster_queue
- 4: https://github.com/kubernetes-sigs/kueue/blob/main/site/content/en/docs/tasks/manage/administer_cluster_quotas.md
update_flavor_quotas violates ClusterQueue resourceGroup invariant for sibling flavors.
When a flavor discovers a new GPU type, lines 152–170 add the new resource to coveredResources and rebuild only the target flavor's resources. Every other flavor in the same resourceGroups[0] is left without an entry for the newly-added resource. Kueue requires each flavor in a group to declare quotas for all coveredResources in the same order. The PATCH will fail or mark the ClusterQueue inactive the first time cluster-B discovers a GPU type that cluster-A doesn't have.
Backfill nominalQuota: 0 entries on the sibling flavors before patching:
♻️ Sketch
rg["coveredResources"] = sorted(covered)
target_flavor["resources"] = new_resources
+
+ # Ensure every sibling flavor declares all coveredResources, in order.
+ ordered = rg["coveredResources"]
+ for f in flavors:
+ existing = {r["name"]: r for r in f.get("resources", [])}
+ f["resources"] = [
+ existing.get(rname, {"name": rname, "nominalQuota": 0})
+ for rname in ordered
+ ]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@fournos-cluster/fournos_cluster/core/kueue.py` around lines 110 - 186,
update_flavor_quotas currently adds new GPU resource names to
rg["coveredResources"] and only rebuilds target_flavor["resources"], leaving
sibling flavors missing entries and violating Kueue's invariant; before
patching, iterate over rg["flavors"] (use the existing flavors list and identify
target_flavor) and for each sibling flavor build a resources list aligned with
the final coveredResources ordering (use the same sorted(covered) used for
rg["coveredResources"]) adding each resource with its existing nominalQuota if
present or nominalQuota: 0 otherwise (preserve existing CLUSTER_SLOT_RESOURCE
quota from each flavor when present, or use slot_quota fallback for target if
needed), then assign those lists back into each flavor so every flavor has
entries for every covered resource before calling patch_cluster_custom_object.
| next cycle. Check: | ||
| - Is the target cluster up? Try `oc --kubeconfig=<path> cluster-info` | ||
| - Is the kubeconfig secret valid? `oc get secret <name> -n psap-secrets -o yaml` | ||
| - Check operator logs: `oc logs deployment/fournos -n psap-automation | grep <cluster-name>` |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Confirm the actual deployment name shipped in this PR.
fd -t f deployment.yaml fournos-cluster | xargs rg -nP 'kind:\s*Deployment|name:'Repository: openshift-psap/fournos
Length of output: 394
🏁 Script executed:
# Check the actual README.md content at the specified lines
head -n 507 README.md | tail -n 10Repository: openshift-psap/fournos
Length of output: 621
🏁 Script executed:
# Search for all deployment configurations to understand the landscape
fd -t f deployment.yaml | xargs cat | rg -A 5 'kind: Deployment' | head -50Repository: openshift-psap/fournos
Length of output: 245
🏁 Script executed:
# Search the codebase for references to both "fournos" and "fournos-cluster" deployments
rg "deployment/fournos" --type mdRepository: openshift-psap/fournos
Length of output: 169
🏁 Script executed:
# Get the full fournos-cluster deployment configuration to understand its purpose
fd -t f deployment.yaml | xargs cat | rg -A 30 'kind: Deployment' | rg -A 30 'name: fournos-cluster'Repository: openshift-psap/fournos
Length of output: 1027
🏁 Script executed:
# Search for comments or documentation about what fournos-cluster does
rg "fournos-cluster|fournos" --type yaml -B 2 -A 5 | head -100Repository: openshift-psap/fournos
Length of output: 5410
🏁 Script executed:
# Check if there are environment variables or configuration that explains the deployment roles
rg "FOURNOS_CLUSTER" --type yaml -B 3 -A 3Repository: openshift-psap/fournos
Length of output: 702
🏁 Script executed:
# Check the fournos main deployment configuration to understand its purpose
fd -t f deployment.yaml | xargs cat | rg -A 30 'kind: Deployment' | rg -A 30 'name: fournos[^-]'Repository: openshift-psap/fournos
Length of output: 48
🏁 Script executed:
# Look for any documentation or code that explains the architectural separation
rg "cluster.*deployment|fournos-cluster.*controller" --type md -B 2 -A 2Repository: openshift-psap/fournos
Length of output: 48
🏁 Script executed:
# Check the main source code or controller logic to see which handles PSAPCluster
fd -t go -type f | xargs rg "PSAPCluster|cluster-info" -B 2 -A 2 | head -50Repository: openshift-psap/fournos
Length of output: 2444
🏁 Script executed:
# Search for any documentation or comments explaining the purpose of each deployment
rg "fournos deployment|operator logs" --type md -B 3 -A 3Repository: openshift-psap/fournos
Length of output: 1440
🏁 Script executed:
# Check if there are any references to which deployment handles which functionality
rg "controller.*fournos|fournos.*controller" --type md -B 2 -A 2Repository: openshift-psap/fournos
Length of output: 48
Update troubleshooting guide to reference the correct deployment for PSAPCluster logs.
The PSAPCluster troubleshooting section currently directs users to check logs in the fournos deployment, but cluster-related issues (kubeconfig/GPU discovery) are handled by the fournos-cluster deployment. Users following this advice will not find relevant error messages.
📝 Fix
- Check operator logs: `oc logs deployment/fournos -n psap-automation | grep <cluster-name>`
+ Check operator logs: `oc logs deployment/fournos-cluster -n psap-automation | grep <cluster-name>`🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@README.md` around lines 504 - 507, Update the PSAPCluster troubleshooting
instructions so logs are checked from the cluster-specific operator deployment:
replace references to the `fournos` deployment with `fournos-cluster` and update
the example command to `oc logs deployment/fournos-cluster -n psap-automation |
grep <cluster-name>`; ensure the PSAPCluster section text and the three-step
checklist mention `fournos-cluster` when directing users to check kubeconfig/GPU
discovery and cluster-related errors.
| |----------|---------|-------------| | ||
| | `FOURNOS_PSAPCLUSTER_TIMER_INTERVAL_SEC` | `30` | Reconciliation timer interval | | ||
| | `FOURNOS_GPU_DISCOVERY_DEFAULT_INTERVAL_SEC` | `300` | Default GPU discovery interval | | ||
| | `FOURNOS_GPU_DISCOVERY_TIMEOUT_SEC` | `10` | Timeout for connecting to target clusters | | ||
| | `FOURNOS_CLUSTER_DISCOVERY_INTERVAL_SEC` | `60` | Interval for auto-discovery scan of kubeconfig secrets | | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Confirm the env var names by listing the settings fields and prefix.
rg -nP --type=py -C2 'env_prefix|class Settings\(BaseSettings\)' \
fournos-cluster/fournos_cluster/settings.py
# Look for any cluster_discovery_interval_sec setting elsewhere.
rg -nP --type=py 'cluster_discovery_interval' || echo "not found"Repository: openshift-psap/fournos
Length of output: 266
🏁 Script executed:
cat -n fournos-cluster/fournos_cluster/settings.pyRepository: openshift-psap/fournos
Length of output: 1408
Env-var names in this table do not match the actual settings.
fournos-cluster/fournos_cluster/settings.py uses model_config = {"env_prefix": "FOURNOS_CLUSTER_"}, so pydantic-settings derives env var names by concatenating the prefix with the upper-cased field name:
| README documents | Actual env var |
|---|---|
FOURNOS_PSAPCLUSTER_TIMER_INTERVAL_SEC |
FOURNOS_CLUSTER_RECONCILE_INTERVAL_SEC |
FOURNOS_GPU_DISCOVERY_DEFAULT_INTERVAL_SEC |
FOURNOS_CLUSTER_GPU_DISCOVERY_DEFAULT_INTERVAL_SEC |
FOURNOS_GPU_DISCOVERY_TIMEOUT_SEC |
FOURNOS_CLUSTER_GPU_DISCOVERY_TIMEOUT_SEC |
FOURNOS_CLUSTER_DISCOVERY_INTERVAL_SEC |
Not defined in settings |
Users following the README will export incorrect env vars and the controller will silently use defaults instead. Update the table to match the actual Settings fields.
Suggested fix
-| `FOURNOS_PSAPCLUSTER_TIMER_INTERVAL_SEC` | `30` | Reconciliation timer interval |
-| `FOURNOS_GPU_DISCOVERY_DEFAULT_INTERVAL_SEC` | `300` | Default GPU discovery interval |
-| `FOURNOS_GPU_DISCOVERY_TIMEOUT_SEC` | `10` | Timeout for connecting to target clusters |
-| `FOURNOS_CLUSTER_DISCOVERY_INTERVAL_SEC` | `60` | Interval for auto-discovery scan of kubeconfig secrets |
+| `FOURNOS_CLUSTER_RECONCILE_INTERVAL_SEC` | `30` | Reconciliation timer interval |
+| `FOURNOS_CLUSTER_GPU_DISCOVERY_DEFAULT_INTERVAL_SEC` | `300` | Default GPU discovery interval |
+| `FOURNOS_CLUSTER_GPU_DISCOVERY_TIMEOUT_SEC` | `10` | Timeout for connecting to target clusters |(Remove the fourth row—no env var controls cluster discovery interval)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| |----------|---------|-------------| | |
| | `FOURNOS_PSAPCLUSTER_TIMER_INTERVAL_SEC` | `30` | Reconciliation timer interval | | |
| | `FOURNOS_GPU_DISCOVERY_DEFAULT_INTERVAL_SEC` | `300` | Default GPU discovery interval | | |
| | `FOURNOS_GPU_DISCOVERY_TIMEOUT_SEC` | `10` | Timeout for connecting to target clusters | | |
| | `FOURNOS_CLUSTER_DISCOVERY_INTERVAL_SEC` | `60` | Interval for auto-discovery scan of kubeconfig secrets | | |
| |----------|---------|-------------| | |
| | `FOURNOS_CLUSTER_RECONCILE_INTERVAL_SEC` | `30` | Reconciliation timer interval | | |
| | `FOURNOS_CLUSTER_GPU_DISCOVERY_DEFAULT_INTERVAL_SEC` | `300` | Default GPU discovery interval | | |
| | `FOURNOS_CLUSTER_GPU_DISCOVERY_TIMEOUT_SEC` | `10` | Timeout for connecting to target clusters | | |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@README.md` around lines 547 - 552, The README table lists incorrect env var
names; pydantic-settings in fourn o s_cluster/settings.py sets model_config =
{"env_prefix": "FOURNOS_CLUSTER_"} so environment variables are derived by
prefixing field names—update the table to use the actual env vars
FOURNOS_CLUSTER_RECONCILE_INTERVAL_SEC,
FOURNOS_CLUSTER_GPU_DISCOVERY_DEFAULT_INTERVAL_SEC, and
FOURNOS_CLUSTER_GPU_DISCOVERY_TIMEOUT_SEC (matching the Settings field names),
and remove the fourth row (no env var controls cluster discovery interval);
reference model_config/env_prefix and the Settings field names to locate the
authoritative names.
string_data is never populated when reading secrets from the K8s API; it only exists for creating secrets. The branch was unreachable.
| result = self._k8s.create_cluster_custom_object( | ||
| group=KUEUE_GROUP, | ||
| version=KUEUE_VERSION, | ||
| plural=KUEUE_RESOURCE_FLAVOR_PLURAL, | ||
| body=body, | ||
| ) | ||
| logger.info("Created ResourceFlavor %s", flavor_name) | ||
| return result | ||
| except client.exceptions.ApiException as exc: | ||
| if exc.status == 409: | ||
| logger.debug("ResourceFlavor %s already exists", flavor_name) | ||
| return None |
There was a problem hiding this comment.
shouldn't we update the resource when the status is 409?
| if exc.status == 404: | ||
| logger.warning("ClusterQueue %s not found", cq_name) | ||
| return None | ||
| raise |
There was a problem hiding this comment.
I'm not sure when this is supposed to happen 🤔
shouldn't we create the ClusterQueue if it doesn't exist?
The Python K8s client ignores configuration.request_timeout — the timeout must be passed directly to the API call via _request_timeout. Without this, GPU discovery hangs indefinitely on unreachable clusters. Also removes the stale test_stringdata_kubeconfig test that covered the previously removed string_data branch.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@fournos-cluster/tests/unit/test_gpu_discovery.py`:
- Around line 310-325: The test test_sets_request_timeout is asserting that
mock_api_client.configuration.request_timeout is not None, but discover_gpus
sets the timeout by passing _request_timeout directly to CoreV1Api.list_node, so
update the test to assert that CoreV1Api.return_value.list_node was called with
a timeout equal to the expected _request_timeout; specifically, in
test_sets_request_timeout arrange a known timeout value, call
discovery.discover_gpus(..., _request_timeout=EXPECTED), and replace the
existing assertion with one that checks
mock_k8s_client.CoreV1Api.return_value.list_node.assert_called_with(...,
timeout=EXPECTED) (or inspect call_args to verify the timeout argument) so the
test verifies the actual behavior of discover_gpus and will catch regressions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bcbcfc42-80c2-4044-951a-7fec02709b29
📒 Files selected for processing (2)
fournos-cluster/fournos_cluster/core/gpu_discovery.pyfournos-cluster/tests/unit/test_gpu_discovery.py
| def test_sets_request_timeout( | ||
| self, mock_k8s_client: MagicMock, mock_k8s_config: MagicMock | ||
| ) -> None: | ||
| discovery, mock_core = self._make_client() | ||
| self._mock_kubeconfig_secret(mock_core) | ||
|
|
||
| nodes = MagicMock() | ||
| nodes.items = [] | ||
| mock_api_client = MagicMock() | ||
| mock_k8s_config.new_client_from_config_dict.return_value = mock_api_client | ||
| mock_k8s_client.CoreV1Api.return_value.list_node.return_value = nodes | ||
|
|
||
| discovery.discover_gpus("cluster-1", "kubeconfig-cluster-1", "psap-secrets") | ||
|
|
||
| assert mock_api_client.configuration.request_timeout is not None | ||
|
|
There was a problem hiding this comment.
Test doesn't validate the timeout fix.
The test asserts mock_api_client.configuration.request_timeout is not None, but the implementation (lines 123-128 in gpu_discovery.py) never sets api_client.configuration.request_timeout. The actual fix passes _request_timeout directly to the list_node() call. This test provides false confidence and won't catch regressions if the _request_timeout parameter is accidentally removed.
✅ Proposed fix to validate the actual timeout behavior
`@patch`("fournos_cluster.core.gpu_discovery.k8s_config")
`@patch`("fournos_cluster.core.gpu_discovery.client")
def test_sets_request_timeout(
self, mock_k8s_client: MagicMock, mock_k8s_config: MagicMock
) -> None:
+ from fournos_cluster.settings import settings
discovery, mock_core = self._make_client()
self._mock_kubeconfig_secret(mock_core)
nodes = MagicMock()
nodes.items = []
mock_api_client = MagicMock()
mock_k8s_config.new_client_from_config_dict.return_value = mock_api_client
mock_k8s_client.CoreV1Api.return_value.list_node.return_value = nodes
discovery.discover_gpus("cluster-1", "kubeconfig-cluster-1", "psap-secrets")
- assert mock_api_client.configuration.request_timeout is not None
+ expected_timeout = (settings.gpu_discovery_timeout_sec, settings.gpu_discovery_timeout_sec)
+ mock_k8s_client.CoreV1Api.return_value.list_node.assert_called_once_with(_request_timeout=expected_timeout)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_sets_request_timeout( | |
| self, mock_k8s_client: MagicMock, mock_k8s_config: MagicMock | |
| ) -> None: | |
| discovery, mock_core = self._make_client() | |
| self._mock_kubeconfig_secret(mock_core) | |
| nodes = MagicMock() | |
| nodes.items = [] | |
| mock_api_client = MagicMock() | |
| mock_k8s_config.new_client_from_config_dict.return_value = mock_api_client | |
| mock_k8s_client.CoreV1Api.return_value.list_node.return_value = nodes | |
| discovery.discover_gpus("cluster-1", "kubeconfig-cluster-1", "psap-secrets") | |
| assert mock_api_client.configuration.request_timeout is not None | |
| def test_sets_request_timeout( | |
| self, mock_k8s_client: MagicMock, mock_k8s_config: MagicMock | |
| ) -> None: | |
| from fournos_cluster.settings import settings | |
| discovery, mock_core = self._make_client() | |
| self._mock_kubeconfig_secret(mock_core) | |
| nodes = MagicMock() | |
| nodes.items = [] | |
| mock_api_client = MagicMock() | |
| mock_k8s_config.new_client_from_config_dict.return_value = mock_api_client | |
| mock_k8s_client.CoreV1Api.return_value.list_node.return_value = nodes | |
| discovery.discover_gpus("cluster-1", "kubeconfig-cluster-1", "psap-secrets") | |
| expected_timeout = (settings.gpu_discovery_timeout_sec, settings.gpu_discovery_timeout_sec) | |
| mock_k8s_client.CoreV1Api.return_value.list_node.assert_called_once_with(_request_timeout=expected_timeout) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@fournos-cluster/tests/unit/test_gpu_discovery.py` around lines 310 - 325, The
test test_sets_request_timeout is asserting that
mock_api_client.configuration.request_timeout is not None, but discover_gpus
sets the timeout by passing _request_timeout directly to CoreV1Api.list_node, so
update the test to assert that CoreV1Api.return_value.list_node was called with
a timeout equal to the expected _request_timeout; specifically, in
test_sets_request_timeout arrange a known timeout value, call
discovery.discover_gpus(..., _request_timeout=EXPECTED), and replace the
existing assertion with one that checks
mock_k8s_client.CoreV1Api.return_value.list_node.assert_called_with(...,
timeout=EXPECTED) (or inspect call_args to verify the timeout argument) so the
test verifies the actual behavior of discover_gpus and will catch regressions.
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Summary
Extracts the cluster management logic from the fournos operator into an independent
fournos-cluster/controller that runs as its own Deployment.fournos.dev/v1) — single source of truth for cluster statefournos.dev/cluster-kubeconfig=true, auto-creates FournosCluster CRsspec.ownercreates alockOnly: true, exclusive: trueFournosJob that holds all 100 cluster-slots via Kueuespec.executionEnginevalidationArchitecture
Files changed
fournos-cluster/fournos/handlers/lifecycle.pyspec.executionEngineinstead ofspec.forgefournos/manifests/rbac/role_fournos.yamlconfig/forge/samples/job-full.yamlUsage
Test plan
Summary by CodeRabbit
New Features
Documentation
Tests
Configuration
Security / RBAC