Skip to content

feat: FournosCluster controller with GPU discovery, sentinel locking, and Kueue integration#74

Open
MML-coder wants to merge 8 commits into
mainfrom
feat/psapcluster-controller
Open

feat: FournosCluster controller with GPU discovery, sentinel locking, and Kueue integration#74
MML-coder wants to merge 8 commits into
mainfrom
feat/psapcluster-controller

Conversation

@MML-coder
Copy link
Copy Markdown
Collaborator

@MML-coder MML-coder commented Apr 29, 2026

Summary

Extracts the cluster management logic from the fournos operator into an independent
fournos-cluster/ controller that runs as its own Deployment.

  • FournosCluster CRD (fournos.dev/v1) — single source of truth for cluster state
  • Event-driven auto-discovery — watches kubeconfig secrets labeled fournos.dev/cluster-kubeconfig=true, auto-creates FournosCluster CRs
  • GPU discovery — connects to target clusters, aggregates GPU counts by vendor/model with exponential backoff
  • Sentinel lockingspec.owner creates a lockOnly: true, exclusive: true FournosJob that holds all 100 cluster-slots via Kueue
  • Self-healing — reconcile timer validates kubeconfig, retries GPU discovery, recreates missing sentinels
  • Fournos operator cleanup — removed PSAPCluster handlers, discovery daemon, GPU discovery code; fixed spec.executionEngine validation

Architecture

Kubeconfig Secret (psap-secrets)
  └─ Event-driven watch (label: fournos.dev/cluster-kubeconfig=true)
       ├─ ResourceFlavor (nodeLabel: fournos.dev/cluster=<name>)
       ├─ ClusterQueue flavor entry (cluster-slot: 100)
       └─ FournosCluster CR
            ├─ GPU Discovery (probes target cluster nodes)
            │    └─ Updates GPU quotas in ClusterQueue
            └─ Locking (spec.owner)
                 └─ Sentinel FournosJob (lockOnly + exclusive)
                      └─ Kueue Workload (holds 100 cluster-slots)

Files changed

Area Changes
fournos-cluster/ New independent controller (kopf, CRD, RBAC, Containerfile, 87 unit tests)
fournos/handlers/lifecycle.py Fixed validation: spec.executionEngine instead of spec.forge
fournos/ Removed PSAPCluster handlers, discovery daemon, GPU discovery, related settings
manifests/rbac/role_fournos.yaml Removed psapclusters RBAC rules
config/forge/samples/job-full.yaml Added namespace for convenience

Usage

# View clusters
oc get fournoscluster -n psap-automation

# Lock a cluster
oc patch fournoscluster athena-fire -n psap-automation --type merge \
  -p '{"spec":{"owner":"mehul"}}'

# Lock with TTL
oc patch fournoscluster athena-fire -n psap-automation --type merge \
  -p '{"spec":{"owner":"mehul","ttl":"4h"}}'

# Unlock
oc patch fournoscluster athena-fire -n psap-automation --type merge \
  -p '{"spec":{"owner":""}}'

Test plan

  • 87 unit tests pass in fournos-cluster/
  • Deployed to psap-automation cluster
  • Auto-discovery: labeled secret → FournosCluster CR created
  • GPU discovery: 8x NVIDIA on athena-fire
  • Locking: sentinel FournosJob created and admitted (holds 100 cluster-slots)
  • Blocking: new workloads get "insufficient unused quota" while locked
  • Lock release: clearing owner deletes sentinel, blocked workloads auto-admitted
  • Self-healing: reconciler recreates missing sentinels
  • Fournos operator: executionEngine validation fix verified

Summary by CodeRabbit

  • New Features

    • PSAPCluster CRD for cluster state, GPU inventory, TTL/ownership-based exclusive locking
    • Cluster operator to auto-onboard clusters, discover GPUs, manage lock sentinels, and update Kueue quotas
    • Lock-only job mode to reserve cluster quota without running pipelines
    • Kueue flavor/ClusterQueue management for per-cluster slot & GPU quotas
    • Deployment manifests and container image for the cluster operator
  • Documentation

    • Expanded README with onboarding, examples, config, troubleshooting, and E2E workflows
  • Tests

    • Extensive unit tests covering discovery, locking, Kueue integration, and handlers
  • Configuration

    • New runtime settings for queue name, secret patterns, and discovery timing
  • Security / RBAC

    • Updated RBAC to support lock sentinel, secrets access, and Kueue resources

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 29, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ashishkamra for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 29, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 798c5d17-bda4-472a-9ee2-c34f7c8f36a4

📥 Commits

Reviewing files that changed from the base of the PR and between e8f04a9 and 6cc860c.

📒 Files selected for processing (2)
  • config/forge/samples/job-full.yaml
  • fournos-cluster/tests/unit/test_cluster_handlers.py
✅ Files skipped from review due to trivial changes (1)
  • config/forge/samples/job-full.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • fournos-cluster/tests/unit/test_cluster_handlers.py

📝 Walkthrough

Walkthrough

Adds a FournosCluster operator and CRD for kubeconfig-managed cluster discovery, GPU inventory, Kueue quota wiring, sentinel lock jobs and lockOnly mode; includes operator wiring, manifests/RBAC, packaging, and unit tests.

Changes

FournosCluster Operator & Lock-Only Job Mode

Layer / File(s) Summary
Documentation & examples
README.md, dev/mock-psapcluster.yaml
Adds PSAPCluster docs, troubleshooting, end-to-end test workflow, and two sample PSAPCluster resources.
Operator packaging & deployment
fournos-cluster/Containerfile, fournos-cluster/deploy/deployment.yaml, fournos-cluster/pyproject.toml, fournos-cluster/README.md
Containerfile, Deployment, RBAC binding, package metadata, and operator README added.
Settings & constants
fournos-cluster/fournos_cluster/settings.py, fournos-cluster/fournos_cluster/constants.py, fournos/settings.py
New FournosCluster settings and constants; adds kueue_cluster_queue_name to main settings.
Secret parsing & auto-discovery
fournos/core/clusters.py, fournos-cluster/fournos_cluster/handlers/secret.py
Builds regex from pattern to extract cluster name, lists matching secrets, and auto-creates FournosCluster CR when a labeled kubeconfig Secret appears (propagates owner annotation).
CRD & API schema
fournos-cluster/manifests/crd.yaml, manifests/crd.yaml
Adds FournosCluster CRD and extends FournosJob CRD with spec.lockOnly.
GPU discovery client
fournos-cluster/fournos_cluster/core/gpu_discovery.py
Reads kubeconfig Secret, connects to target cluster, enumerates node GPU allocatables and labels, normalizes models, aggregates counts, and returns immutable DiscoveryResult.
Kueue integration
fournos-cluster/fournos_cluster/core/kueue.py, fournos/core/kueue.py
Adds KueueClient methods to create ResourceFlavor, add flavor to ClusterQueue (cluster-slot + GPU resources), update flavor quotas, and list flavors; handles 409/404 cases.
FournosCluster handlers & reconciliation
fournos-cluster/fournos_cluster/handlers/cluster.py
Create/owner/hardware handlers; sentinel FournosJob create/delete; reconcile steps for kubeconfig status, TTL expiry, GPU discovery/backoff, and sentinel self-healing.
Lock-only job support
manifests/crd.yaml, fournos/handlers/lifecycle.py, fournos/handlers/execution.py
Adds spec.lockOnly to FournosJob; lifecycle creates exclusive Kueue Workload for lock-only jobs and execution bypasses PipelineRun creation while reporting lock-held message.
RBAC, deployment & packaging
fournos-cluster/manifests/rbac/role.yaml, manifests/rbac/role_fournos.yaml
ClusterRole for fourn os-cluster, Role updates for fournos (create/delete fournosjobs), Kueue/localqueue rules, and secret patch permissions.
Samples & small edits
dev/mock-psapcluster.yaml, config/forge/samples/job-full.yaml
Adds example PSAPCluster resources and updates sample job target cluster to athena-fire.
Unit tests
fournos-cluster/tests/unit/*
Extensive tests for cluster handlers, secret handler, GPU discovery, and Kueue client covering logic, error cases, and reconciliation order.
Core helpers
fournos/core/clusters.py, fournos/core/kueue.py
Adds secret listing and cluster-name extraction helpers; extends KueueClient in core to manage ClusterQueue flavors.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

lgtm

Poem

"🐰 I found a kubeconfig beneath a tree,
I counted GPUs — A100s and H100s for me,
A sentinel job guards every precious slot,
TTL and reconcile keep the lock from being lost,
Operator rabbit hops off, happy and free."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.28% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main change: a new FournosCluster controller with GPU discovery, sentinel locking, and Kueue integration, which aligns with the extensive changeset adding the fournos-cluster/ controller module.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/psapcluster-controller

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
fournos/handlers/psapcluster.py (1)

52-64: 💤 Low value

Consider exposing a public method on ctx.registry instead of accessing _k8s directly.

Accessing the private _k8s attribute couples this handler to the internal implementation of ClusterRegistry. Consider adding a public method like ctx.registry.secret_exists(name, namespace) or using a shared CoreV1Api instance.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fournos/handlers/psapcluster.py` around lines 52 - 64, Update
_check_kubeconfig to stop accessing the private ctx.registry._k8s and call a new
public API on the registry (e.g., ctx.registry.secret_exists(secret_name,
namespace)) that encapsulates the read_namespaced_secret logic; add the
corresponding method on ClusterRegistry (secret_exists(name, namespace)) which
returns a boolean or raises a controlled exception for non-404 errors, then have
_check_kubeconfig map True->"Valid", False->"Missing" and handle other errors as
"Invalid" while logging the exception via logger.warning as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@fournos/core/gpu_discovery.py`:
- Around line 125-139: The timeout on the Kubernetes call is ineffective because
setting api_client.configuration.request_timeout doesn't apply to individual
calls; update the code that builds the client and calls list_node so that you
remove the assignment to api_client.configuration.request_timeout and instead
call target_core.list_node(...) passing
_request_timeout=(settings.gpu_discovery_timeout_sec,
settings.gpu_discovery_timeout_sec); keep the existing try/except that wraps the
call and raise GPUDiscoveryError on exception, and ensure api_client.close()
remains in the finally block.

In `@fournos/core/psapcluster.py`:
- Around line 181-185: The stopPolicy patch handler currently catches
ApiException but ApiException is not imported; update the exception handling in
that handler to catch client.exceptions.ApiException instead (i.e., change the
except block that reads "except ApiException as exc:" to use
client.exceptions.ApiException) so it matches other handlers in this module and
avoids a NameError; ensure the logger.warning and raise behavior remain
unchanged and reference cq_name as before.

In `@README.md`:
- Around line 307-312: The fenced code block showing the table lacks a language
identifier (MD040); update the backticks surrounding the block (the triple-fence
that begins the block containing "NAME OWNER GPUS KUBECONFIG LOCKED AGE") to
include a language tag such as "text" (e.g., change ``` to ```text) so
markdownlint no longer flags the code fence.

---

Nitpick comments:
In `@fournos/handlers/psapcluster.py`:
- Around line 52-64: Update _check_kubeconfig to stop accessing the private
ctx.registry._k8s and call a new public API on the registry (e.g.,
ctx.registry.secret_exists(secret_name, namespace)) that encapsulates the
read_namespaced_secret logic; add the corresponding method on ClusterRegistry
(secret_exists(name, namespace)) which returns a boolean or raises a controlled
exception for non-404 errors, then have _check_kubeconfig map True->"Valid",
False->"Missing" and handle other errors as "Invalid" while logging the
exception via logger.warning as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e083bc62-0476-4762-9cde-25f919d35cc5

📥 Commits

Reviewing files that changed from the base of the PR and between 83811e2 and 2634aef.

📒 Files selected for processing (15)
  • README.md
  • dev/mock-psapcluster.yaml
  • fournos/core/constants.py
  • fournos/core/gpu_discovery.py
  • fournos/core/psapcluster.py
  • fournos/handlers/__init__.py
  • fournos/handlers/psapcluster.py
  • fournos/operator.py
  • fournos/settings.py
  • fournos/state.py
  • manifests/psapcluster-crd.yaml
  • manifests/rbac/role_fournos.yaml
  • tests/unit/test_gpu_discovery.py
  • tests/unit/test_psapcluster_handlers.py
  • tests/unit/test_psapcluster_manager.py

Comment thread fournos-cluster/fournos_cluster/core/gpu_discovery.py
Comment thread fournos/core/psapcluster.py Outdated
Comment thread README.md
@MML-coder MML-coder changed the title feat: PSAPCluster CRD controller with GPU discovery and cluster locking feat: PSAPCluster controller with sentinel locking, auto-discovery, and GPU discovery May 1, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
fournos/operator.py (1)

169-177: ⚡ Quick win

Run discovery once before the first interval sleep.

Current loop delays initial discovery until the first timer elapses. Doing an immediate first pass reduces startup lag for newly added kubeconfig secrets.

Proposed fix
 def _discovery_loop():
     interval = settings.cluster_discovery_interval_sec
     while True:
-        time.sleep(interval)
         try:
             handlers.scan_clusters()
         except Exception:
             logger.exception("Cluster discovery scan failed")
+        time.sleep(interval)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fournos/operator.py` around lines 169 - 177, The discovery loop in
_discovery_loop currently waits for settings.cluster_discovery_interval_sec
before the first scan; change the loop to perform an immediate scan on startup
by calling handlers.scan_clusters() (inside the existing try/except that logs
via logger.exception) before sleeping, i.e. move the sleep to occur after the
scan so the sequence is: try: handlers.scan_clusters() except Exception:
logger.exception(...) then time.sleep(settings.cluster_discovery_interval_sec),
and repeat.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@fournos/core/discovery.py`:
- Around line 47-63: The loop that processes each secret should not abort the
entire discovery pass on a single failure; wrap the per-cluster work (after
cluster_name resolution) in a try/except so exceptions from
extract_cluster_name, self._kueue.create_flavor,
self._kueue.add_flavor_to_cluster_queue, or self._create_psapcluster are caught,
logged with contextual info (cluster_name and secret_name) and then continue to
the next secret; ensure you only append to discovered on success so failed
clusters are not marked discovered and rethrow nothing so the outer discovery
loop proceeds.

In `@fournos/core/kueue.py`:
- Around line 215-271: add_flavor_to_cluster_queue (and similarly
update_flavor_quotas) currently does a read-modify-write on the global
ClusterQueue and can lose concurrent changes; modify these functions to perform
optimistic retry-on-conflict by reading and including the object's current
metadata.resourceVersion when patching and retrying on ApiException conflict
(409) a few times, or alternatively switch to a narrower patch/Server-Side Apply
that targets only the specific flavor element instead of replacing
spec.resourceGroups; locate add_flavor_to_cluster_queue and update_flavor_quotas
in kueue.py and implement a loop that GETs the ClusterQueue, computes the
change, issues a PATCH/SSApply with resourceVersion or field-specific patch, and
retries on conflict until success or max attempts.

In `@fournos/handlers/lifecycle.py`:
- Around line 52-63: Reject combinations where lockOnly is true but exclusive is
false instead of silently overriding: in the validation blocks that read
exclusive = spec["exclusive"] and lock_only = spec.get("lockOnly", False) (and
the duplicate block around lines with similar logic), add a check that if
lock_only is True and exclusive is False then set patch.status["phase"] =
Phase.FAILED and patch.status["message"] to a clear message like "lockOnly: true
requires 'exclusive' to be true" and return; reference the same symbols
(exclusive, lock_only, spec) and ensure the same check is applied in both places
(the initial validation and the later similar block that currently enforces
exclusivity in _create_lock_workload()) so user input is rejected rather than
silently overridden.

In `@fournos/handlers/psapcluster.py`:
- Around line 216-220: reconcile_psapcluster currently calls
_reconcile_ttl_expiry(...) then continues to call _reconcile_lock_job(...) which
can immediately recreate a lock because patch was mutated but spec/status still
reflect the old locked state; modify _reconcile_ttl_expiry to return a boolean
(e.g., released=True if it called _release_lock()) and update
reconcile_psapcluster to check that return and stop the reconcile pass (early
return) when TTL expiry released the lock so _reconcile_lock_job is not invoked
with stale spec/status; apply the same pattern to the other reconcile sequences
mentioned (the blocks around lines 238-260 and 354-383) so each
_reconcile_ttl_expiry caller aborts further reconcile steps when it reports a
release.
- Around line 303-310: The code currently always appends a new condition into
patch.status.setdefault("conditions", []) (see the append call with
_make_condition and COND_GPU_DISCOVERED), causing duplicate COND_GPU_DISCOVERED
entries over time; change this to "upsert" by finding and removing any existing
condition whose "type" equals COND_GPU_DISCOVERED from
patch.status["conditions"] before adding the new _make_condition(...) entry.
Apply the same replacement in both the failure path (the shown append using
str(exc)) and the corresponding success path where COND_GPU_DISCOVERED is
appended, using the symbols patch.status, COND_GPU_DISCOVERED, and
_make_condition to locate and modify the code.
- Around line 342-347: The guard `and new_resources` prevents clearing Kueue
quotas when discovery reports zero GPUs; change the logic so that
`ctx.kueue.update_flavor_quotas(name, new_resources)` is called whenever
`new_resources != prev_resources` (even if `new_resources` is empty) so stale
GPU quotas are removed — update the block around the variables `prev_gpus`,
`new_resources`, `prev_resources` and the call to
`ctx.kueue.update_flavor_quotas(name, new_resources)` to remove the `and
new_resources` condition and ensure empty `new_resources` clears quotas.

In `@README.md`:
- Around line 323-325: The README's PSAPCluster onboarding is inconsistent about
manual quota setup: update the wording around the step that says "Add a
ResourceFlavor and quota for the cluster in `config/kueue-cluster-config.yaml`"
(and the repeated text at the later block) to clearly state whether this manual
step is required, optional fallback, or legacy; explicitly mention that the
controller (referenced as "controller updates ClusterQueue quotas automatically"
in the "Onboarding a new cluster" section) will normally create/update
ClusterQueue quotas for onboarded clusters and keep the manual
`config/kueue-cluster-config.yaml` step as an optional/manual fallback for edge
cases or legacy workflows, and add a short cross-reference to the "Onboarding a
new cluster" section and a one-line example/criteria when the manual step is
needed.

---

Nitpick comments:
In `@fournos/operator.py`:
- Around line 169-177: The discovery loop in _discovery_loop currently waits for
settings.cluster_discovery_interval_sec before the first scan; change the loop
to perform an immediate scan on startup by calling handlers.scan_clusters()
(inside the existing try/except that logs via logger.exception) before sleeping,
i.e. move the sleep to occur after the scan so the sequence is: try:
handlers.scan_clusters() except Exception: logger.exception(...) then
time.sleep(settings.cluster_discovery_interval_sec), and repeat.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9c74b8b6-0fa5-4d35-a8eb-3736ecf1f197

📥 Commits

Reviewing files that changed from the base of the PR and between 2634aef and efa8139.

📒 Files selected for processing (19)
  • README.md
  • config/forge/samples/job-full.yaml
  • fournos/core/clusters.py
  • fournos/core/constants.py
  • fournos/core/discovery.py
  • fournos/core/kueue.py
  • fournos/handlers/__init__.py
  • fournos/handlers/discovery.py
  • fournos/handlers/execution.py
  • fournos/handlers/lifecycle.py
  • fournos/handlers/psapcluster.py
  • fournos/operator.py
  • fournos/settings.py
  • fournos/state.py
  • manifests/crd.yaml
  • manifests/psapcluster-crd.yaml
  • manifests/rbac/role_fournos.yaml
  • tests/unit/test_cluster_discovery.py
  • tests/unit/test_psapcluster_handlers.py
✅ Files skipped from review due to trivial changes (3)
  • config/forge/samples/job-full.yaml
  • fournos/handlers/init.py
  • manifests/psapcluster-crd.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
  • fournos/state.py
  • fournos/core/constants.py
  • manifests/rbac/role_fournos.yaml
  • tests/unit/test_psapcluster_handlers.py

Comment thread fournos/core/discovery.py Outdated
Comment on lines +47 to +63
for secret_name in secrets:
cluster_name = extract_cluster_name(secret_name)
if cluster_name is None:
continue

if cluster_name in existing_clusters:
continue

logger.info("Discovered new cluster %s from secret %s", cluster_name, secret_name)

if cluster_name not in existing_flavors:
self._kueue.create_flavor(cluster_name)

self._kueue.add_flavor_to_cluster_queue(cluster_name)
self._create_psapcluster(cluster_name, secret_name)
discovered.append(cluster_name)

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

A single onboarding failure aborts the whole discovery pass.

If one cluster fails during flavor/CQ/CR creation, the loop exits and later valid clusters are skipped for that cycle. Isolate errors per cluster and continue.

Proposed fix
         discovered: list[str] = []
         for secret_name in secrets:
             cluster_name = extract_cluster_name(secret_name)
             if cluster_name is None:
                 continue

             if cluster_name in existing_clusters:
                 continue

             logger.info("Discovered new cluster %s from secret %s", cluster_name, secret_name)

-            if cluster_name not in existing_flavors:
-                self._kueue.create_flavor(cluster_name)
-
-            self._kueue.add_flavor_to_cluster_queue(cluster_name)
-            self._create_psapcluster(cluster_name, secret_name)
-            discovered.append(cluster_name)
+            try:
+                if cluster_name not in existing_flavors:
+                    self._kueue.create_flavor(cluster_name)
+                    existing_flavors.add(cluster_name)
+
+                self._kueue.add_flavor_to_cluster_queue(cluster_name)
+                self._create_psapcluster(cluster_name, secret_name)
+                discovered.append(cluster_name)
+            except Exception:
+                logger.exception(
+                    "Failed onboarding discovered cluster %s from secret %s",
+                    cluster_name,
+                    secret_name,
+                )
+                continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fournos/core/discovery.py` around lines 47 - 63, The loop that processes each
secret should not abort the entire discovery pass on a single failure; wrap the
per-cluster work (after cluster_name resolution) in a try/except so exceptions
from extract_cluster_name, self._kueue.create_flavor,
self._kueue.add_flavor_to_cluster_queue, or self._create_psapcluster are caught,
logged with contextual info (cluster_name and secret_name) and then continue to
the next secret; ensure you only append to discovered on success so failed
clusters are not marked discovered and rethrow nothing so the outer discovery
loop proceeds.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a valid concern

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kpouget, glad we're aligned! The fix is straightforward — wrapping the per-cluster onboarding block in a try/except ensures that a transient API error (e.g., a flaky Kueue patch or a race on PSAPCluster creation) for one cluster doesn't silently drop all subsequent clusters from that scan cycle.

The proposed fix from the review comment is ready to apply as-is:

discovered: list[str] = []
for secret_name in secrets:
    cluster_name = extract_cluster_name(secret_name)
    if cluster_name is None:
        continue

    if cluster_name in existing_clusters:
        continue

    logger.info("Discovered new cluster %s from secret %s", cluster_name, secret_name)

    try:
        if cluster_name not in existing_flavors:
            self._kueue.create_flavor(cluster_name)
            existing_flavors.add(cluster_name)

        self._kueue.add_flavor_to_cluster_queue(cluster_name)
        self._create_psapcluster(cluster_name, secret_name)
        discovered.append(cluster_name)
    except Exception:
        logger.exception(
            "Failed onboarding discovered cluster %s from secret %s",
            cluster_name,
            secret_name,
        )
        continue

Note also the existing_flavors.add(cluster_name) after a successful create_flavor call — this keeps the in-memory set consistent so that if the same flavor name somehow appears again later in the same scan, it won't attempt a redundant creation.

Would you like me to plan these code changes for you?

Copy link
Copy Markdown
Collaborator Author

@MML-coder MML-coder May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion to incorporate related to transient api error.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment thread fournos/core/kueue.py
Comment on lines 52 to +63
exclusive = spec["exclusive"]
lock_only = spec.get("lockOnly", False)

if lock_only and not cluster:
patch.status["phase"] = Phase.FAILED
patch.status["message"] = "lockOnly: true requires 'cluster' to be set"
return

if not lock_only and not spec.get("forge"):
patch.status["phase"] = Phase.FAILED
patch.status["message"] = "spec.forge is required for non-lockOnly jobs"
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fail fast on lockOnly: true + exclusive: false instead of silently overriding.

Current flow accepts that combo but _create_lock_workload() enforces exclusive anyway, so user intent is ignored. Rejecting this input makes behavior explicit and debuggable.

Proposed fix
 def on_create(spec, name, namespace, status, patch, body):
@@
     cluster = spec.get("cluster")
     exclusive = spec["exclusive"]
     lock_only = spec.get("lockOnly", False)

+    if lock_only and not exclusive:
+        patch.status["phase"] = Phase.FAILED
+        patch.status["message"] = "lockOnly jobs require 'exclusive: true'"
+        return
+
     if lock_only and not cluster:
         patch.status["phase"] = Phase.FAILED
         patch.status["message"] = "lockOnly: true requires 'cluster' to be set"
         return

Also applies to: 95-106

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fournos/handlers/lifecycle.py` around lines 52 - 63, Reject combinations
where lockOnly is true but exclusive is false instead of silently overriding: in
the validation blocks that read exclusive = spec["exclusive"] and lock_only =
spec.get("lockOnly", False) (and the duplicate block around lines with similar
logic), add a check that if lock_only is True and exclusive is False then set
patch.status["phase"] = Phase.FAILED and patch.status["message"] to a clear
message like "lockOnly: true requires 'exclusive' to be true" and return;
reference the same symbols (exclusive, lock_only, spec) and ensure the same
check is applied in both places (the initial validation and the later similar
block that currently enforces exclusivity in _create_lock_workload()) so user
input is rejected rather than silently overridden.

Comment thread fournos/handlers/psapcluster.py Outdated
Comment on lines +216 to +220
def reconcile_psapcluster(spec, name, namespace, status, patch, body):
_reconcile_kubeconfig(spec, name, status, patch)
_reconcile_ttl_expiry(spec, name, status, patch)
_reconcile_gpu_discovery(spec, name, status, patch)
_reconcile_lock_job(spec, name, status, patch)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Stop the reconcile pass after TTL expiry releases the lock.

_reconcile_ttl_expiry() only updates patch, but the same pass still calls _reconcile_lock_job() with the old spec/status. After _release_lock() deletes the sentinel, the self-healing branch still sees status.locked == True and the old spec.owner, so it can recreate the lock immediately and make TTL expiry ineffective.

Suggested fix
 def reconcile_psapcluster(spec, name, namespace, status, patch, body):
     _reconcile_kubeconfig(spec, name, status, patch)
-    _reconcile_ttl_expiry(spec, name, status, patch)
+    if _reconcile_ttl_expiry(spec, name, status, patch):
+        return
     _reconcile_gpu_discovery(spec, name, status, patch)
     _reconcile_lock_job(spec, name, status, patch)
 
-
 def _reconcile_ttl_expiry(spec, name, status, patch):
     if not status.get("locked"):
-        return
+        return False
 ...
     if datetime.now(timezone.utc) >= expires_at:
         prev_owner = spec.get("owner", "unknown")
         logger.info(
             "PSAPCluster %s: ownership expired (was owned by %s)", name, prev_owner
         )
         patch.spec["owner"] = ""
         _release_lock(name, patch)
+        return True
+    return False

Also applies to: 238-260, 354-383

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fournos/handlers/psapcluster.py` around lines 216 - 220,
reconcile_psapcluster currently calls _reconcile_ttl_expiry(...) then continues
to call _reconcile_lock_job(...) which can immediately recreate a lock because
patch was mutated but spec/status still reflect the old locked state; modify
_reconcile_ttl_expiry to return a boolean (e.g., released=True if it called
_release_lock()) and update reconcile_psapcluster to check that return and stop
the reconcile pass (early return) when TTL expiry released the lock so
_reconcile_lock_job is not invoked with stale spec/status; apply the same
pattern to the other reconcile sequences mentioned (the blocks around lines
238-260 and 354-383) so each _reconcile_ttl_expiry caller aborts further
reconcile steps when it reports a release.

Comment on lines +303 to +310
patch.status.setdefault("conditions", []).append(
_make_condition(
COND_GPU_DISCOVERED,
"False",
"DiscoveryFailed",
str(exc),
)
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Upsert conditions instead of appending duplicates on every timer run.

Both the success and failure paths append a fresh COND_GPU_DISCOVERED entry every reconciliation. On a long-lived cluster this grows status.conditions without bound and makes consumers pick between duplicate condition records. Replace the existing condition by type instead of always appending.

Also applies to: 338-340

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fournos/handlers/psapcluster.py` around lines 303 - 310, The code currently
always appends a new condition into patch.status.setdefault("conditions", [])
(see the append call with _make_condition and COND_GPU_DISCOVERED), causing
duplicate COND_GPU_DISCOVERED entries over time; change this to "upsert" by
finding and removing any existing condition whose "type" equals
COND_GPU_DISCOVERED from patch.status["conditions"] before adding the new
_make_condition(...) entry. Apply the same replacement in both the failure path
(the shown append using str(exc)) and the corresponding success path where
COND_GPU_DISCOVERED is appended, using the symbols patch.status,
COND_GPU_DISCOVERED, and _make_condition to locate and modify the code.

Comment thread fournos/handlers/psapcluster.py Outdated
Comment on lines +342 to +347
prev_gpus = hardware.get("gpus", [])
new_resources = [(g.short_name, g.count) for g in result.gpus]
prev_resources = [(g.get("shortName"), g.get("count")) for g in prev_gpus]
if new_resources != prev_resources and new_resources:
try:
ctx.kueue.update_flavor_quotas(name, new_resources)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clear Kueue quotas when discovery drops to zero GPUs.

The and new_resources guard skips update_flavor_quotas() when a cluster previously had GPUs but discovery now returns none. That leaves stale GPU quota in Kueue and can schedule workloads onto capacity that no longer exists.

Suggested fix
-    if new_resources != prev_resources and new_resources:
+    if new_resources != prev_resources:
         try:
             ctx.kueue.update_flavor_quotas(name, new_resources)
         except Exception as exc:
             logger.warning(
                 "PSAPCluster %s: failed to update flavor quotas: %s", name, exc
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
prev_gpus = hardware.get("gpus", [])
new_resources = [(g.short_name, g.count) for g in result.gpus]
prev_resources = [(g.get("shortName"), g.get("count")) for g in prev_gpus]
if new_resources != prev_resources and new_resources:
try:
ctx.kueue.update_flavor_quotas(name, new_resources)
prev_gpus = hardware.get("gpus", [])
new_resources = [(g.short_name, g.count) for g in result.gpus]
prev_resources = [(g.get("shortName"), g.get("count")) for g in prev_gpus]
if new_resources != prev_resources:
try:
ctx.kueue.update_flavor_quotas(name, new_resources)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fournos/handlers/psapcluster.py` around lines 342 - 347, The guard `and
new_resources` prevents clearing Kueue quotas when discovery reports zero GPUs;
change the logic so that `ctx.kueue.update_flavor_quotas(name, new_resources)`
is called whenever `new_resources != prev_resources` (even if `new_resources` is
empty) so stale GPU quotas are removed — update the block around the variables
`prev_gpus`, `new_resources`, `prev_resources` and the call to
`ctx.kueue.update_flavor_quotas(name, new_resources)` to remove the `and
new_resources` condition and ensure empty `new_resources` clears quotas.

Comment thread README.md
Comment thread fournos/core/discovery.py Outdated
Comment on lines +52 to +53
if cluster_name in existing_clusters:
continue
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to make sure that the objects (flavor, etc) are there, even if cluster_name in existing_clusters
they need to

  • exist
  • have the right content
    (I think they need to be created every time)

Comment on lines +16 to +25
GPU_MODEL_MAP: dict[str, str] = {
"NVIDIA-A100-SXM4-80GB": "a100",
"NVIDIA-A100-SXM4-40GB": "a100",
"NVIDIA-A100-PCIE-80GB": "a100",
"NVIDIA-A100-PCIE-40GB": "a100",
"NVIDIA-H100-SXM5-80GB": "h100",
"NVIDIA-H100-PCIE-80GB": "h100",
"NVIDIA-H200-SXM-141GB": "h200",
"AMD-Instinct-MI300X": "mi300x",
}
Copy link
Copy Markdown
Collaborator

@kpouget kpouget May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the can be in a user-configurable config map
(can be done later. If we don't do it in the PR, need to fill a ticket so that we don't forget it)

Comment thread fournos-cluster/fournos_cluster/core/gpu_discovery.py Outdated
Copy link
Copy Markdown
Collaborator

@kpouget kpouget left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MML-coder , could you turn this into a controller fully independent from FOURNOS ?

you create a secondary top directory, and only important the relevant FOURNOS bits (constants, I assume)

we'll also need a name for the controller,
and maybe use another name than "PSAP"
Maybe FournosCluster, or something alike if we find another nice name

Comment on lines +1 to +2
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll rediscuss it, but I think we'll want to have

spec:
   resources: 
      h200: 8

with spec.resources either user provided or dynamically looked up if spec.resources: {}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking of user could provide how many GPus they want to use by specifying spec.resources: 2 or 8 ? and the job can use available GPUs for the job?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are two things I have in mind:

  1. we want the controller to do this only once (hence the spec field. Like a dynamic default value, not a truly reconciliated field)
  2. in the future, we want to allow the customization of the resources list with SW properties (TASK: refine the future behavior of the exclusive/cluster/hardware flags #70)

Comment thread fournos/core/discovery.py Outdated
MML-coder added 5 commits May 7, 2026 12:07
…ocking

Introduces a PSAPCluster custom resource and kopf controller that unifies
cluster definition, discovers GPU hardware dynamically, manages per-cluster
Kueue ClusterQueues, and enables ownership locking via stopPolicy.

- PSAPCluster CRD (fournos.dev/v1) with spec fields: kubeconfigSecret,
  owner, ttl, evict, gpuDiscoveryInterval
- GPU discovery client that connects to target clusters via kubeconfig
  secrets, lists nodes, and aggregates GPU counts by vendor/model
- Per-cluster ClusterQueues in a shared cohort with self-healing
- Ownership locking with Hold/HoldAndDrain stopPolicy and TTL auto-expiry
- Timer-based reconciliation: kubeconfig validation, TTL expiry, GPU
  discovery with backoff, CQ/RF self-healing
- 82 unit tests covering all modules

Closes #37, closes #51
Replace per-cluster ClusterQueue stopPolicy locking with sentinel
FournosJob approach. A lockOnly sentinel holds all 100 cluster-slots
via Kueue, blocking competing workloads without requiring per-cluster
CQ infrastructure.

Add auto-discovery daemon that scans kubeconfig secrets every 60s and
auto-creates ResourceFlavor, ClusterQueue flavor entry, and PSAPCluster
CR for each new cluster found.

Key changes:
- Add lockOnly field to FournosJob CRD, skip Forge for lockOnly jobs
- Rewrite PSAPCluster handlers for sentinel create/delete lifecycle
- Remove PSAPClusterManager (fournos/core/psapcluster.py)
- Add create_flavor, add_flavor_to_cluster_queue, update_flavor_quotas
  to KueueClient
- Add ClusterDiscovery with extract_cluster_name, list_kubeconfig_secrets
- Wire discovery daemon thread in operator startup
- Fix RBAC: add create/delete for fournosjobs, create for psapclusters
- Add end-to-end locking test walkthrough to README
- 28 new discovery unit tests, rewritten PSAPCluster handler tests
…ter service

Move the cluster management logic (GPU discovery, sentinel locking,
auto-discovery, Kueue integration) out of the fournos operator into a
standalone kopf-based controller in fournos-cluster/.

The new controller watches kubeconfig secrets in psap-secrets and manages
FournosCluster CRs in psap-automation. It runs as its own Deployment with
dedicated RBAC and container image.

Changes to the fournos operator:
- Remove PSAPCluster handlers, discovery daemon, GPU discovery, and related
  settings/state
- Fix lifecycle validation to check spec.executionEngine instead of spec.forge
- Remove psapclusters RBAC from role_fournos.yaml

Tested on psap-automation cluster:
- Auto-discovery from labeled secrets
- GPU discovery (8x NVIDIA on athena-fire)
- Sentinel locking (blocks competing workloads via Kueue)
- Lock release (sentinel deleted, blocked workloads admitted)
@MML-coder MML-coder force-pushed the feat/psapcluster-controller branch from efa8139 to 89627f7 Compare May 8, 2026 22:24
@MML-coder MML-coder changed the title feat: PSAPCluster controller with sentinel locking, auto-discovery, and GPU discovery feat: extract FournosCluster controller into independent fournos-cluster service May 8, 2026
@MML-coder MML-coder changed the title feat: extract FournosCluster controller into independent fournos-cluster service feat: FournosCluster controller with GPU discovery, sentinel locking, and Kueue integration May 8, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
fournos/core/constants.py (1)

30-31: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix formatter drift causing CI failure.

Line 30 introduces spacing that ruff format --check wants to rewrite, and lint is currently failing on this PR. Please run ruff format (or normalize these blank lines manually) before merge.

Suggested minimal fix
 class Shutdown(StrEnum):
     STOP = "Stop"
     TERMINATE = "Terminate"
-    
-
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fournos/core/constants.py` around lines 30 - 31, Remove the unexpected extra
blank line(s) introduced in fournos/core/constants.py (around the top of the
module where module-level constants are defined) so the file matches ruff's
formatting expectations; run "ruff format" (or manually normalize the blank
lines to a single blank line between top-level sections) and re-run "ruff format
--check" to ensure the formatter drift is fixed before merging.
fournos/handlers/lifecycle.py (1)

1-272: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

CI is failing on ruff format --check.

The lint job reports formatting changes needed for this file. Run ruff format (or ruff format --write) and commit before merge, otherwise CI will keep blocking the PR.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fournos/handlers/lifecycle.py` around lines 1 - 272, The file fails ruff
formatting checks; run the auto-formatter and commit the updated file.
Specifically, open the lifecycle handlers containing functions on_create,
_create_lock_workload, _find_exclusive_locker, _pending_status, and
reconcile_pending, run `ruff format` (or `ruff format --write`) to apply the
required whitespace/line-break/style fixes, verify no remaining ruff format
--check errors, and commit the formatted changes so CI stops failing.
♻️ Duplicate comments (3)
fournos-cluster/fournos_cluster/core/gpu_discovery.py (1)

125-139: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Timeout is still configured on api_client.configuration and has no effect at call time.

The Kubernetes Python client does not honor Configuration.request_timeout — that attribute is not consulted by ApiClient.call_api() / RESTClient.request(). The timeout must be passed per-call as _request_timeout. As written, list_node() will block indefinitely against an unreachable target cluster, which directly defeats the intent of gpu_discovery_timeout_sec and can stall the reconciler thread.

A previous review raised this same issue on an earlier commit; the implementation was not updated.

Does kubernetes-client/python honor Configuration.request_timeout for individual API calls, or must _request_timeout be passed per call?
🛠️ Suggested fix
     api_client = k8s_config.new_client_from_config_dict(kubeconfig_dict)
-    api_client.configuration.request_timeout = (
-        settings.gpu_discovery_timeout_sec,
-        settings.gpu_discovery_timeout_sec,
-    )
+    timeout = (
+        settings.gpu_discovery_timeout_sec,
+        settings.gpu_discovery_timeout_sec,
+    )

     try:
         target_core = client.CoreV1Api(api_client)
-        nodes = target_core.list_node()
+        nodes = target_core.list_node(_request_timeout=timeout)
     except Exception as exc:
         raise GPUDiscoveryError(
             f"Failed to connect to cluster {cluster_name!r}: {exc}"
         ) from exc
     finally:
         api_client.close()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fournos-cluster/fournos_cluster/core/gpu_discovery.py` around lines 125 -
139, The code configures api_client.configuration.request_timeout which the
Kubernetes Python client ignores; change the call that fetches nodes
(target_core.list_node) to pass the timeout per-call via the internal
_request_timeout parameter (e.g.,
_request_timeout=(settings.gpu_discovery_timeout_sec,
settings.gpu_discovery_timeout_sec)) so the call will time out; keep the
existing try/except raising GPUDiscoveryError on failure and the
api_client.close() in the finally block.
README.md (1)

323-345: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Step 2 still contradicts step 3 / "controller automatically …".

The PSAPCluster onboarding section still tells users to manually add a ResourceFlavor and quota in step 2, while the bullet a few lines below states the controller "Discovers GPUs … and updates the global fournos-queue ClusterQueue quotas for this cluster's ResourceFlavor". The previous review on this section is still open — please clarify whether step 2 is a manual fallback or legacy flow.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 323 - 345, The documentation currently contradicts
itself about who creates ResourceFlavors and quotas; update the PSAPCluster
onboarding section to clearly state that the controller will auto-discover GPUs
and create/update the cluster's ResourceFlavor and update the global
`fournos-queue` ClusterQueue quotas when a PSAPCluster is applied, and make step
2 (the instruction to manually add a ResourceFlavor and quota in
`config/kueue-cluster-config.yaml`) an explicit "manual fallback" for
environments where the controller is disabled/unavailable; mention
`PSAPCluster`, `ResourceFlavor`, `fournos-queue`, and
`config/kueue-cluster-config.yaml` by name so readers know when to rely on
automation vs perform the manual step.
fournos/handlers/lifecycle.py (1)

51-89: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

lockOnly: true + exclusive: false is still silently overridden.

exclusive is read from the spec and checked at line 84, but for lock_only: true the path branches into _create_lock_workload (line 88) where exclusive=True is hardcoded (line 104) — so a user who explicitly sets exclusive: false along with lockOnly: true gets a different behavior than they declared, with no signal back.

The previous review proposed rejecting this combination explicitly; that recommendation still applies to the current code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fournos/handlers/lifecycle.py` around lines 51 - 89, Add an explicit
validation that rejects the unsupported combination of lock_only and
exclusive=False: when spec["lockOnly"] (lock_only) is true and
spec.get("exclusive") is present and False, set
patch.status["phase"]=Phase.FAILED with a clear message (e.g. "lockOnly: true
cannot be combined with exclusive: false") and return before calling
_create_lock_workload; this prevents silently overriding the user's exclusive
setting and keeps the existing LABEL_EXCLUSIVE_CLUSTER/_create_lock_workload
behavior intact.
🧹 Nitpick comments (4)
fournos/handlers/lifecycle.py (1)

108-125: 💤 Low value

Document the deliberate 409 fall-through.

When the create call returns 409 (workload already exists), control silently falls through to the Phase.PENDING block. That's the right behavior for idempotency, but it isn't obvious from reading the code. A one-line comment would prevent a future refactor from "fixing" it into a return.

♻️ Suggested clarification
     except k8s_client.exceptions.ApiException as exc:
-        if exc.status != 409:
+        if exc.status != 409:
             patch.status["phase"] = Phase.FAILED
             patch.status["message"] = f"Failed to create Workload: {exc.reason}"
             logger.error("Job %s: Workload creation failed: %s", name, exc.reason)
             return
+        # 409: workload already exists (e.g. handler retry); proceed to PENDING.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fournos/handlers/lifecycle.py` around lines 108 - 125, Add a one-line comment
inside the except k8s_client.exceptions.ApiException as exc block explaining
that a 409 status is intentionally allowed to fall through (workload already
exists) for idempotent behavior; make it adjacent to the conditional that checks
exc.status != 409 and mention that when exc.status == 409 we intentionally
proceed to set patch.status to Phase.PENDING and set the COND_WORKLOAD_ADMITTED
condition via set_condition so future maintainers don’t change the fall-through
to a return.
fournos-cluster/fournos_cluster/core/gpu_discovery.py (1)

116-123: 💤 Low value

Silently picking contexts[0] can mask a malformed kubeconfig.

A kubeconfig with no current-context and multiple contexts is unusual; defaulting to the first one without logging makes it hard to diagnose why GPU discovery hits the "wrong" cluster. Consider logging a warning so this fallback is visible in operator logs.

♻️ Suggested change
         if "current-context" not in kubeconfig_dict:
             contexts = kubeconfig_dict.get("contexts", [])
             if contexts:
+                if len(contexts) > 1:
+                    logger.warning(
+                        "Kubeconfig for %s has no current-context; "
+                        "defaulting to first of %d contexts (%s)",
+                        cluster_name, len(contexts), contexts[0].get("name"),
+                    )
                 kubeconfig_dict["current-context"] = contexts[0]["name"]
             else:
                 raise GPUDiscoveryError(
                     f"Kubeconfig for {cluster_name!r} has no contexts"
                 )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fournos-cluster/fournos_cluster/core/gpu_discovery.py` around lines 116 -
123, When you detect a missing "current-context" and choose to fallback to
contexts[0], emit a warning log that includes cluster_name, the list of
available contexts and the chosen context so the fallback is visible in operator
logs; update the block that inspects kubeconfig_dict and contexts (the code that
currently assigns kubeconfig_dict["current-context"] = contexts[0]["name"]) to
call the module logger (or existing logger used for GPU discovery) to warn
before assigning, while leaving the GPUDiscoveryError raise path intact for the
empty-contexts case.
fournos-cluster/fournos_cluster/handlers/cluster.py (1)

57-67: 💤 Low value

Narrow the exception in _check_kubeconfig.

Catching bare Exception and only conditionally inspecting .status masks programming errors (e.g., AttributeError, KeyError from a malformed spec) as Invalid. Catch kubernetes.client.exceptions.ApiException for the 404 → Missing / non-404 → Invalid mapping and let unexpected exceptions propagate to Kopf for retry/visibility.

♻️ Suggested change
-    try:
-        core.read_namespaced_secret(secret_name, settings.secrets_namespace)
-        return "Valid"
-    except Exception as exc:
-        if hasattr(exc, "status") and exc.status == 404:
-            return "Missing"
-        logger.warning("Error checking kubeconfig secret %s: %s", secret_name, exc)
-        return "Invalid"
+    try:
+        core.read_namespaced_secret(secret_name, settings.secrets_namespace)
+        return "Valid"
+    except client.exceptions.ApiException as exc:
+        if exc.status == 404:
+            return "Missing"
+        logger.warning("Error checking kubeconfig secret %s: %s", secret_name, exc)
+        return "Invalid"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fournos-cluster/fournos_cluster/handlers/cluster.py` around lines 57 - 67,
The function _check_kubeconfig currently catches all Exception which masks
programming errors; change it to only catch
kubernetes.client.exceptions.ApiException when calling
core.read_namespaced_secret(secret_name, settings.secrets_namespace), check
exc.status == 404 to return "Missing" and return "Invalid" for other
ApiException statuses, and remove the broad except so that unexpected errors
(e.g., KeyError on spec or AttributeError) propagate to Kopf; reference symbols:
_check_kubeconfig, spec, secret_name, core.read_namespaced_secret,
settings.secrets_namespace, logger.warning, and import/use
kubernetes.client.exceptions.ApiException.
fournos-cluster/fournos_cluster/operator.py (1)

25-26: 💤 Low value

logging.basicConfig() without force=True will not apply the configured level or format.

By the time @kopf.on.startup() runs, Kopf has already attached handlers to the root logger. Calling basicConfig() without force=True returns silently without applying changes. Use force=True to override existing configuration, or call logging.getLogger().setLevel(log_level) directly. For custom formatting, consider using kopf run --log-format ... CLI options or add a handler with your formatter in the startup hook.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fournos-cluster/fournos_cluster/operator.py` around lines 25 - 26, The
current call to logging.basicConfig(level=log_level, format=...) won't override
Kopf's existing handlers; update the startup logging setup (where
logging.basicConfig is called and settings.log_level is used) to either call
logging.basicConfig(..., force=True) or explicitly set the root logger level and
add a handler with your formatter (e.g., logging.getLogger().setLevel(log_level)
plus create and attach a StreamHandler with the formatter) so the configured
level/format are actually applied.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@fournos-cluster/deploy/deployment.yaml`:
- Around line 38-39: Replace the personal mutable image reference and Always
pull policy: update the image field value that currently reads
"quay.io/memehta/fournos-cluster:latest" to an organization-owned repository and
a pinned identifier (for example "quay.io/openshift-psap/fournos-cluster:vX.Y.Z"
or a digest "quay.io/openshift-psap/fournos-cluster@sha256:...") and keep or
change imagePullPolicy only as needed for your release workflow (e.g., remove
imagePullPolicy: Always or set to IfNotPresent for pinned tags); ensure you
modify the deployment.yaml image entry and any related CI/CD references that
push tags so pods no longer rely on a personal namespace or the mutable "latest"
tag.
- Around line 62-69: Add container-level readOnlyRootFilesystem: true to the
existing container.securityContext (or if the Kopf process needs writable space,
mount an emptyDir volume at a writable path such as /tmp and limit writes to
that mount) and add a minimal pod-level securityContext with runAsNonRoot: true
(and optionally fsGroup if needed) to the Pod spec to satisfy PSA restricted
checks; update the container spec referenced by securityContext,
allowPrivilegeEscalation, capabilities, runAsNonRoot and seccompProfile and add
a top-level pod securityContext entry in the same Deployment manifest.

In `@fournos-cluster/fournos_cluster/constants.py`:
- Around line 14-15: The CRD uses a singular plural name which violates
Kubernetes conventions; update the CRD's spec.names.plural value from
"fournoscluster" to "fournosclusters", change the constant
FOURNOS_CLUSTER_PLURAL in constants.py to "fournosclusters", and update any RBAC
resource entries that reference "fournoscluster", "fournoscluster/status", or
"fournoscluster/finalizers" to use the plural form "fournosclusters",
"fournosclusters/status", and "fournosclusters/finalizers" so the CRD,
constants, and RBAC remain consistent and CustomObjectsApi calls continue to
work.

In `@fournos-cluster/fournos_cluster/core/kueue.py`:
- Around line 110-186: update_flavor_quotas currently adds new GPU resource
names to rg["coveredResources"] and only rebuilds target_flavor["resources"],
leaving sibling flavors missing entries and violating Kueue's invariant; before
patching, iterate over rg["flavors"] (use the existing flavors list and identify
target_flavor) and for each sibling flavor build a resources list aligned with
the final coveredResources ordering (use the same sorted(covered) used for
rg["coveredResources"]) adding each resource with its existing nominalQuota if
present or nominalQuota: 0 otherwise (preserve existing CLUSTER_SLOT_RESOURCE
quota from each flavor when present, or use slot_quota fallback for target if
needed), then assign those lists back into each flavor so every flavor has
entries for every covered resource before calling patch_cluster_custom_object.

In `@README.md`:
- Around line 547-552: The README table lists incorrect env var names;
pydantic-settings in fourn o s_cluster/settings.py sets model_config =
{"env_prefix": "FOURNOS_CLUSTER_"} so environment variables are derived by
prefixing field names—update the table to use the actual env vars
FOURNOS_CLUSTER_RECONCILE_INTERVAL_SEC,
FOURNOS_CLUSTER_GPU_DISCOVERY_DEFAULT_INTERVAL_SEC, and
FOURNOS_CLUSTER_GPU_DISCOVERY_TIMEOUT_SEC (matching the Settings field names),
and remove the fourth row (no env var controls cluster discovery interval);
reference model_config/env_prefix and the Settings field names to locate the
authoritative names.
- Around line 504-507: Update the PSAPCluster troubleshooting instructions so
logs are checked from the cluster-specific operator deployment: replace
references to the `fournos` deployment with `fournos-cluster` and update the
example command to `oc logs deployment/fournos-cluster -n psap-automation | grep
<cluster-name>`; ensure the PSAPCluster section text and the three-step
checklist mention `fournos-cluster` when directing users to check kubeconfig/GPU
discovery and cluster-related errors.

---

Outside diff comments:
In `@fournos/core/constants.py`:
- Around line 30-31: Remove the unexpected extra blank line(s) introduced in
fournos/core/constants.py (around the top of the module where module-level
constants are defined) so the file matches ruff's formatting expectations; run
"ruff format" (or manually normalize the blank lines to a single blank line
between top-level sections) and re-run "ruff format --check" to ensure the
formatter drift is fixed before merging.

In `@fournos/handlers/lifecycle.py`:
- Around line 1-272: The file fails ruff formatting checks; run the
auto-formatter and commit the updated file. Specifically, open the lifecycle
handlers containing functions on_create, _create_lock_workload,
_find_exclusive_locker, _pending_status, and reconcile_pending, run `ruff
format` (or `ruff format --write`) to apply the required
whitespace/line-break/style fixes, verify no remaining ruff format --check
errors, and commit the formatted changes so CI stops failing.

---

Duplicate comments:
In `@fournos-cluster/fournos_cluster/core/gpu_discovery.py`:
- Around line 125-139: The code configures
api_client.configuration.request_timeout which the Kubernetes Python client
ignores; change the call that fetches nodes (target_core.list_node) to pass the
timeout per-call via the internal _request_timeout parameter (e.g.,
_request_timeout=(settings.gpu_discovery_timeout_sec,
settings.gpu_discovery_timeout_sec)) so the call will time out; keep the
existing try/except raising GPUDiscoveryError on failure and the
api_client.close() in the finally block.

In `@fournos/handlers/lifecycle.py`:
- Around line 51-89: Add an explicit validation that rejects the unsupported
combination of lock_only and exclusive=False: when spec["lockOnly"] (lock_only)
is true and spec.get("exclusive") is present and False, set
patch.status["phase"]=Phase.FAILED with a clear message (e.g. "lockOnly: true
cannot be combined with exclusive: false") and return before calling
_create_lock_workload; this prevents silently overriding the user's exclusive
setting and keeps the existing LABEL_EXCLUSIVE_CLUSTER/_create_lock_workload
behavior intact.

In `@README.md`:
- Around line 323-345: The documentation currently contradicts itself about who
creates ResourceFlavors and quotas; update the PSAPCluster onboarding section to
clearly state that the controller will auto-discover GPUs and create/update the
cluster's ResourceFlavor and update the global `fournos-queue` ClusterQueue
quotas when a PSAPCluster is applied, and make step 2 (the instruction to
manually add a ResourceFlavor and quota in `config/kueue-cluster-config.yaml`)
an explicit "manual fallback" for environments where the controller is
disabled/unavailable; mention `PSAPCluster`, `ResourceFlavor`, `fournos-queue`,
and `config/kueue-cluster-config.yaml` by name so readers know when to rely on
automation vs perform the manual step.

---

Nitpick comments:
In `@fournos-cluster/fournos_cluster/core/gpu_discovery.py`:
- Around line 116-123: When you detect a missing "current-context" and choose to
fallback to contexts[0], emit a warning log that includes cluster_name, the list
of available contexts and the chosen context so the fallback is visible in
operator logs; update the block that inspects kubeconfig_dict and contexts (the
code that currently assigns kubeconfig_dict["current-context"] =
contexts[0]["name"]) to call the module logger (or existing logger used for GPU
discovery) to warn before assigning, while leaving the GPUDiscoveryError raise
path intact for the empty-contexts case.

In `@fournos-cluster/fournos_cluster/handlers/cluster.py`:
- Around line 57-67: The function _check_kubeconfig currently catches all
Exception which masks programming errors; change it to only catch
kubernetes.client.exceptions.ApiException when calling
core.read_namespaced_secret(secret_name, settings.secrets_namespace), check
exc.status == 404 to return "Missing" and return "Invalid" for other
ApiException statuses, and remove the broad except so that unexpected errors
(e.g., KeyError on spec or AttributeError) propagate to Kopf; reference symbols:
_check_kubeconfig, spec, secret_name, core.read_namespaced_secret,
settings.secrets_namespace, logger.warning, and import/use
kubernetes.client.exceptions.ApiException.

In `@fournos-cluster/fournos_cluster/operator.py`:
- Around line 25-26: The current call to logging.basicConfig(level=log_level,
format=...) won't override Kopf's existing handlers; update the startup logging
setup (where logging.basicConfig is called and settings.log_level is used) to
either call logging.basicConfig(..., force=True) or explicitly set the root
logger level and add a handler with your formatter (e.g.,
logging.getLogger().setLevel(log_level) plus create and attach a StreamHandler
with the formatter) so the configured level/format are actually applied.

In `@fournos/handlers/lifecycle.py`:
- Around line 108-125: Add a one-line comment inside the except
k8s_client.exceptions.ApiException as exc block explaining that a 409 status is
intentionally allowed to fall through (workload already exists) for idempotent
behavior; make it adjacent to the conditional that checks exc.status != 409 and
mention that when exc.status == 409 we intentionally proceed to set patch.status
to Phase.PENDING and set the COND_WORKLOAD_ADMITTED condition via set_condition
so future maintainers don’t change the fall-through to a return.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 97ec2708-e0f5-4e8e-a6a5-1efb8a2c20f9

📥 Commits

Reviewing files that changed from the base of the PR and between efa8139 and 89627f7.

📒 Files selected for processing (35)
  • README.md
  • config/forge/samples/job-full.yaml
  • dev/mock-psapcluster.yaml
  • fournos-cluster/Containerfile
  • fournos-cluster/README.md
  • fournos-cluster/deploy/deployment.yaml
  • fournos-cluster/fournos_cluster/__init__.py
  • fournos-cluster/fournos_cluster/__main__.py
  • fournos-cluster/fournos_cluster/constants.py
  • fournos-cluster/fournos_cluster/core/__init__.py
  • fournos-cluster/fournos_cluster/core/gpu_discovery.py
  • fournos-cluster/fournos_cluster/core/kueue.py
  • fournos-cluster/fournos_cluster/handlers/__init__.py
  • fournos-cluster/fournos_cluster/handlers/cluster.py
  • fournos-cluster/fournos_cluster/handlers/secret.py
  • fournos-cluster/fournos_cluster/operator.py
  • fournos-cluster/fournos_cluster/settings.py
  • fournos-cluster/fournos_cluster/state.py
  • fournos-cluster/manifests/crd.yaml
  • fournos-cluster/manifests/rbac/role.yaml
  • fournos-cluster/pyproject.toml
  • fournos-cluster/tests/__init__.py
  • fournos-cluster/tests/unit/__init__.py
  • fournos-cluster/tests/unit/test_cluster_handlers.py
  • fournos-cluster/tests/unit/test_gpu_discovery.py
  • fournos-cluster/tests/unit/test_kueue.py
  • fournos-cluster/tests/unit/test_secret_handlers.py
  • fournos/core/clusters.py
  • fournos/core/constants.py
  • fournos/core/kueue.py
  • fournos/handlers/execution.py
  • fournos/handlers/lifecycle.py
  • fournos/settings.py
  • manifests/crd.yaml
  • manifests/rbac/role_fournos.yaml
✅ Files skipped from review due to trivial changes (5)
  • fournos-cluster/fournos_cluster/handlers/init.py
  • fournos-cluster/pyproject.toml
  • config/forge/samples/job-full.yaml
  • fournos-cluster/README.md
  • dev/mock-psapcluster.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
  • fournos/handlers/execution.py
  • fournos/core/clusters.py
  • manifests/crd.yaml
  • fournos/core/kueue.py

Comment on lines +38 to +39
image: quay.io/memehta/fournos-cluster:latest
imagePullPolicy: Always
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Production deployment must not reference a personal image namespace with a mutable latest tag.

quay.io/memehta/fournos-cluster:latest couples the deployment to a personal account: if access is revoked or the account is removed, all pod restarts will fail. Combined with imagePullPolicy: Always, every restart silently pulls whatever image was last pushed, making rollbacks and audit trails unreliable.

Move the image to an org-level registry path (e.g., quay.io/openshift-psap/fournos-cluster) and pin to a specific tag or digest.

🔧 Suggested change
-          image: quay.io/memehta/fournos-cluster:latest
-          imagePullPolicy: Always
+          image: quay.io/openshift-psap/fournos-cluster:<tag-or-digest>
+          imagePullPolicy: IfNotPresent
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fournos-cluster/deploy/deployment.yaml` around lines 38 - 39, Replace the
personal mutable image reference and Always pull policy: update the image field
value that currently reads "quay.io/memehta/fournos-cluster:latest" to an
organization-owned repository and a pinned identifier (for example
"quay.io/openshift-psap/fournos-cluster:vX.Y.Z" or a digest
"quay.io/openshift-psap/fournos-cluster@sha256:...") and keep or change
imagePullPolicy only as needed for your release workflow (e.g., remove
imagePullPolicy: Always or set to IfNotPresent for pinned tags); ensure you
modify the deployment.yaml image entry and any related CI/CD references that
push tags so pods no longer rely on a personal namespace or the mutable "latest"
tag.

Comment on lines +62 to +69
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Two Trivy-flagged security hardening gaps in the container security context.

  1. KSV-0014readOnlyRootFilesystem: true is missing. If the Kopf process doesn't write to the filesystem at runtime, this is a straightforward addition; otherwise pair it with an emptyDir volume mounted at any writable path the process needs (e.g., /tmp).
  2. KSV-0118 — No pod-level securityContext is set. While runAsNonRoot: true exists at the container level, some admission controllers (PSA restricted profile) also require it at the pod level. Adding a minimal pod-level context closes this gap.
🛡️ Proposed fix
     spec:
       serviceAccountName: fournos-cluster
+      securityContext:
+        runAsNonRoot: true
+        seccompProfile:
+          type: RuntimeDefault
       containers:
         - name: fournos-cluster
           ...
           securityContext:
             allowPrivilegeEscalation: false
             capabilities:
               drop:
                 - ALL
             runAsNonRoot: true
+            readOnlyRootFilesystem: true
             seccompProfile:
               type: RuntimeDefault
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fournos-cluster/deploy/deployment.yaml` around lines 62 - 69, Add
container-level readOnlyRootFilesystem: true to the existing
container.securityContext (or if the Kopf process needs writable space, mount an
emptyDir volume at a writable path such as /tmp and limit writes to that mount)
and add a minimal pod-level securityContext with runAsNonRoot: true (and
optionally fsGroup if needed) to the Pod spec to satisfy PSA restricted checks;
update the container spec referenced by securityContext,
allowPrivilegeEscalation, capabilities, runAsNonRoot and seccompProfile and add
a top-level pod securityContext entry in the same Deployment manifest.

Comment on lines +14 to +15
FOURNOSJOB_PLURAL = "fournosjobs"
FOURNOS_CLUSTER_PLURAL = "fournoscluster"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# 1. Find the CRD manifest and inspect its names (plural/singular/kind).
fd -t f -e yaml -e yml | xargs rg -lP '^kind:\s*CustomResourceDefinition' 2>/dev/null
fd -t f -e yaml -e yml | xargs rg -nP '^kind:\s*CustomResourceDefinition' 2>/dev/null \
  | cut -d: -f1 | sort -u \
  | xargs -I{} sh -c 'echo "=== {} ==="; rg -nP "names:|plural:|singular:|kind:|listKind:" "{}"'

# 2. Confirm every callsite of the constant.
rg -nP --type=py -C2 '\bFOURNOS_CLUSTER_PLURAL\b'

# 3. Cross-check RBAC and any other YAML referencing the resource name.
rg -nP -C1 '\bfournoscluster(s)?\b' -g '*.yaml' -g '*.yml'

Repository: openshift-psap/fournos

Length of output: 4848


CRD manifest has plural: fournoscluster (singular) — the constant correctly mirrors this but both violate Kubernetes naming conventions.

The constant FOURNOS_CLUSTER_PLURAL = "fournoscluster" is not actually wrong; it correctly reflects what the CRD manifest defines at fournos-cluster/manifests/crd.yaml line 10 (plural: fournoscluster). However, the CRD itself has the bug — Kubernetes conventions require spec.names.plural to be the plural form. Compare FournosJob, which correctly uses plural: fournosjobs (plural).

Since the constant, CRD, and RBAC are all internally consistent with the singular form, the system currently functions without runtime errors. However, the naming violates Kubernetes conventions and must be fixed coherently:

  1. Update CRD spec.names.plural from fournoscluster to fournosclusters
  2. Update constant to FOURNOS_CLUSTER_PLURAL = "fournosclusters"
  3. Update RBAC resources from ["fournoscluster", "fournoscluster/status", "fournoscluster/finalizers"] to use fournosclusters

Changing only the constant will break CustomObjectsApi calls and CRD matching.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fournos-cluster/fournos_cluster/constants.py` around lines 14 - 15, The CRD
uses a singular plural name which violates Kubernetes conventions; update the
CRD's spec.names.plural value from "fournoscluster" to "fournosclusters", change
the constant FOURNOS_CLUSTER_PLURAL in constants.py to "fournosclusters", and
update any RBAC resource entries that reference "fournoscluster",
"fournoscluster/status", or "fournoscluster/finalizers" to use the plural form
"fournosclusters", "fournosclusters/status", and "fournosclusters/finalizers" so
the CRD, constants, and RBAC remain consistent and CustomObjectsApi calls
continue to work.

Comment on lines +110 to +186
def update_flavor_quotas(
self,
flavor_name: str,
gpu_resources: list[tuple[str, int]],
) -> dict | None:
cq_name = settings.kueue_cluster_queue_name
try:
cq = self._k8s.get_cluster_custom_object(
group=KUEUE_GROUP,
version=KUEUE_VERSION,
plural=KUEUE_CLUSTER_QUEUE_PLURAL,
name=cq_name,
)
except client.exceptions.ApiException as exc:
if exc.status == 404:
logger.warning("ClusterQueue %s not found, cannot update quotas", cq_name)
return None
raise

resource_groups = cq.get("spec", {}).get("resourceGroups", [])
if not resource_groups:
logger.warning("ClusterQueue %s has no resourceGroups", cq_name)
return None

rg = resource_groups[0]
covered = set(rg.get("coveredResources", []))
flavors = rg.get("flavors", [])

target_flavor = None
for f in flavors:
if f["name"] == flavor_name:
target_flavor = f
break

if target_flavor is None:
logger.warning(
"Flavor %s not found in ClusterQueue %s, cannot update quotas",
flavor_name,
cq_name,
)
return None

new_resources: list[dict] = []
for short_name, count in gpu_resources:
resource_name = self._gpu_resource_name(short_name)
covered.add(resource_name)
new_resources.append({"name": resource_name, "nominalQuota": count})

existing_slot = next(
(
r
for r in target_flavor.get("resources", [])
if r["name"] == CLUSTER_SLOT_RESOURCE
),
None,
)
slot_quota = existing_slot["nominalQuota"] if existing_slot else MAX_CLUSTER_SLOTS
new_resources.append({"name": CLUSTER_SLOT_RESOURCE, "nominalQuota": slot_quota})
covered.add(CLUSTER_SLOT_RESOURCE)

rg["coveredResources"] = sorted(covered)
target_flavor["resources"] = new_resources

result = self._k8s.patch_cluster_custom_object(
group=KUEUE_GROUP,
version=KUEUE_VERSION,
plural=KUEUE_CLUSTER_QUEUE_PLURAL,
name=cq_name,
body={"spec": {"resourceGroups": resource_groups}},
)
logger.info(
"Updated ClusterQueue %s flavor %s quotas: %s",
cq_name,
flavor_name,
gpu_resources,
)
return result
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Kueue v1beta2 ClusterQueue resourceGroup require each flavor declare all coveredResources

💡 Result:

Yes. In Kueue v1beta2, within each ClusterQueue.spec.resourceGroups[], every flavor in spec.resourceGroups[].flavors must declare quotas for all resources listed in that resource group’s coveredResources (and in the same order). This is required by the API’s ResourceGroup design and validations: a resource group defines coveredResources and the flavors that “provide quotas for these resources”, and “Each flavor MUST list all the resources listed for this group in the same order as the.resources field” [1]. The Cluster Queue docs and examples show flavors declaring resources matching coveredResources within the same resourceGroup [2][3]. Note: by default, administrators must specify all resources required by Pods in the ClusterQueue’s resourceGroups; Kueue also supports excluding some resource types via cluster-level configuration (resource prefixes) [4]. Practical implication: when you set coveredResources: ["cpu","memory","foo.com/gpu"], each flavor under that resourceGroup must have resources entries for cpu, memory, and foo.com/gpu (not a subset) [2][3].

Citations:


update_flavor_quotas violates ClusterQueue resourceGroup invariant for sibling flavors.

When a flavor discovers a new GPU type, lines 152–170 add the new resource to coveredResources and rebuild only the target flavor's resources. Every other flavor in the same resourceGroups[0] is left without an entry for the newly-added resource. Kueue requires each flavor in a group to declare quotas for all coveredResources in the same order. The PATCH will fail or mark the ClusterQueue inactive the first time cluster-B discovers a GPU type that cluster-A doesn't have.

Backfill nominalQuota: 0 entries on the sibling flavors before patching:

♻️ Sketch
     rg["coveredResources"] = sorted(covered)
     target_flavor["resources"] = new_resources
+
+    # Ensure every sibling flavor declares all coveredResources, in order.
+    ordered = rg["coveredResources"]
+    for f in flavors:
+        existing = {r["name"]: r for r in f.get("resources", [])}
+        f["resources"] = [
+            existing.get(rname, {"name": rname, "nominalQuota": 0})
+            for rname in ordered
+        ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fournos-cluster/fournos_cluster/core/kueue.py` around lines 110 - 186,
update_flavor_quotas currently adds new GPU resource names to
rg["coveredResources"] and only rebuilds target_flavor["resources"], leaving
sibling flavors missing entries and violating Kueue's invariant; before
patching, iterate over rg["flavors"] (use the existing flavors list and identify
target_flavor) and for each sibling flavor build a resources list aligned with
the final coveredResources ordering (use the same sorted(covered) used for
rg["coveredResources"]) adding each resource with its existing nominalQuota if
present or nominalQuota: 0 otherwise (preserve existing CLUSTER_SLOT_RESOURCE
quota from each flavor when present, or use slot_quota fallback for target if
needed), then assign those lists back into each flavor so every flavor has
entries for every covered resource before calling patch_cluster_custom_object.

Comment thread README.md
Comment on lines +504 to +507
next cycle. Check:
- Is the target cluster up? Try `oc --kubeconfig=<path> cluster-info`
- Is the kubeconfig secret valid? `oc get secret <name> -n psap-secrets -o yaml`
- Check operator logs: `oc logs deployment/fournos -n psap-automation | grep <cluster-name>`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm the actual deployment name shipped in this PR.
fd -t f deployment.yaml fournos-cluster | xargs rg -nP 'kind:\s*Deployment|name:'

Repository: openshift-psap/fournos

Length of output: 394


🏁 Script executed:

# Check the actual README.md content at the specified lines
head -n 507 README.md | tail -n 10

Repository: openshift-psap/fournos

Length of output: 621


🏁 Script executed:

# Search for all deployment configurations to understand the landscape
fd -t f deployment.yaml | xargs cat | rg -A 5 'kind: Deployment' | head -50

Repository: openshift-psap/fournos

Length of output: 245


🏁 Script executed:

# Search the codebase for references to both "fournos" and "fournos-cluster" deployments
rg "deployment/fournos" --type md

Repository: openshift-psap/fournos

Length of output: 169


🏁 Script executed:

# Get the full fournos-cluster deployment configuration to understand its purpose
fd -t f deployment.yaml | xargs cat | rg -A 30 'kind: Deployment' | rg -A 30 'name: fournos-cluster'

Repository: openshift-psap/fournos

Length of output: 1027


🏁 Script executed:

# Search for comments or documentation about what fournos-cluster does
rg "fournos-cluster|fournos" --type yaml -B 2 -A 5 | head -100

Repository: openshift-psap/fournos

Length of output: 5410


🏁 Script executed:

# Check if there are environment variables or configuration that explains the deployment roles
rg "FOURNOS_CLUSTER" --type yaml -B 3 -A 3

Repository: openshift-psap/fournos

Length of output: 702


🏁 Script executed:

# Check the fournos main deployment configuration to understand its purpose
fd -t f deployment.yaml | xargs cat | rg -A 30 'kind: Deployment' | rg -A 30 'name: fournos[^-]'

Repository: openshift-psap/fournos

Length of output: 48


🏁 Script executed:

# Look for any documentation or code that explains the architectural separation
rg "cluster.*deployment|fournos-cluster.*controller" --type md -B 2 -A 2

Repository: openshift-psap/fournos

Length of output: 48


🏁 Script executed:

# Check the main source code or controller logic to see which handles PSAPCluster
fd -t go -type f | xargs rg "PSAPCluster|cluster-info" -B 2 -A 2 | head -50

Repository: openshift-psap/fournos

Length of output: 2444


🏁 Script executed:

# Search for any documentation or comments explaining the purpose of each deployment
rg "fournos deployment|operator logs" --type md -B 3 -A 3

Repository: openshift-psap/fournos

Length of output: 1440


🏁 Script executed:

# Check if there are any references to which deployment handles which functionality
rg "controller.*fournos|fournos.*controller" --type md -B 2 -A 2

Repository: openshift-psap/fournos

Length of output: 48


Update troubleshooting guide to reference the correct deployment for PSAPCluster logs.

The PSAPCluster troubleshooting section currently directs users to check logs in the fournos deployment, but cluster-related issues (kubeconfig/GPU discovery) are handled by the fournos-cluster deployment. Users following this advice will not find relevant error messages.

📝 Fix
- Check operator logs: `oc logs deployment/fournos -n psap-automation | grep <cluster-name>`
+ Check operator logs: `oc logs deployment/fournos-cluster -n psap-automation | grep <cluster-name>`
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 504 - 507, Update the PSAPCluster troubleshooting
instructions so logs are checked from the cluster-specific operator deployment:
replace references to the `fournos` deployment with `fournos-cluster` and update
the example command to `oc logs deployment/fournos-cluster -n psap-automation |
grep <cluster-name>`; ensure the PSAPCluster section text and the three-step
checklist mention `fournos-cluster` when directing users to check kubeconfig/GPU
discovery and cluster-related errors.

Comment thread README.md
Comment on lines +547 to +552
|----------|---------|-------------|
| `FOURNOS_PSAPCLUSTER_TIMER_INTERVAL_SEC` | `30` | Reconciliation timer interval |
| `FOURNOS_GPU_DISCOVERY_DEFAULT_INTERVAL_SEC` | `300` | Default GPU discovery interval |
| `FOURNOS_GPU_DISCOVERY_TIMEOUT_SEC` | `10` | Timeout for connecting to target clusters |
| `FOURNOS_CLUSTER_DISCOVERY_INTERVAL_SEC` | `60` | Interval for auto-discovery scan of kubeconfig secrets |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm the env var names by listing the settings fields and prefix.
rg -nP --type=py -C2 'env_prefix|class Settings\(BaseSettings\)' \
  fournos-cluster/fournos_cluster/settings.py
# Look for any cluster_discovery_interval_sec setting elsewhere.
rg -nP --type=py 'cluster_discovery_interval' || echo "not found"

Repository: openshift-psap/fournos

Length of output: 266


🏁 Script executed:

cat -n fournos-cluster/fournos_cluster/settings.py

Repository: openshift-psap/fournos

Length of output: 1408


Env-var names in this table do not match the actual settings.

fournos-cluster/fournos_cluster/settings.py uses model_config = {"env_prefix": "FOURNOS_CLUSTER_"}, so pydantic-settings derives env var names by concatenating the prefix with the upper-cased field name:

README documents Actual env var
FOURNOS_PSAPCLUSTER_TIMER_INTERVAL_SEC FOURNOS_CLUSTER_RECONCILE_INTERVAL_SEC
FOURNOS_GPU_DISCOVERY_DEFAULT_INTERVAL_SEC FOURNOS_CLUSTER_GPU_DISCOVERY_DEFAULT_INTERVAL_SEC
FOURNOS_GPU_DISCOVERY_TIMEOUT_SEC FOURNOS_CLUSTER_GPU_DISCOVERY_TIMEOUT_SEC
FOURNOS_CLUSTER_DISCOVERY_INTERVAL_SEC Not defined in settings

Users following the README will export incorrect env vars and the controller will silently use defaults instead. Update the table to match the actual Settings fields.

Suggested fix
-| `FOURNOS_PSAPCLUSTER_TIMER_INTERVAL_SEC` | `30` | Reconciliation timer interval |
-| `FOURNOS_GPU_DISCOVERY_DEFAULT_INTERVAL_SEC` | `300` | Default GPU discovery interval |
-| `FOURNOS_GPU_DISCOVERY_TIMEOUT_SEC` | `10` | Timeout for connecting to target clusters |
-| `FOURNOS_CLUSTER_DISCOVERY_INTERVAL_SEC` | `60` | Interval for auto-discovery scan of kubeconfig secrets |
+| `FOURNOS_CLUSTER_RECONCILE_INTERVAL_SEC` | `30` | Reconciliation timer interval |
+| `FOURNOS_CLUSTER_GPU_DISCOVERY_DEFAULT_INTERVAL_SEC` | `300` | Default GPU discovery interval |
+| `FOURNOS_CLUSTER_GPU_DISCOVERY_TIMEOUT_SEC` | `10` | Timeout for connecting to target clusters |

(Remove the fourth row—no env var controls cluster discovery interval)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
|----------|---------|-------------|
| `FOURNOS_PSAPCLUSTER_TIMER_INTERVAL_SEC` | `30` | Reconciliation timer interval |
| `FOURNOS_GPU_DISCOVERY_DEFAULT_INTERVAL_SEC` | `300` | Default GPU discovery interval |
| `FOURNOS_GPU_DISCOVERY_TIMEOUT_SEC` | `10` | Timeout for connecting to target clusters |
| `FOURNOS_CLUSTER_DISCOVERY_INTERVAL_SEC` | `60` | Interval for auto-discovery scan of kubeconfig secrets |
|----------|---------|-------------|
| `FOURNOS_CLUSTER_RECONCILE_INTERVAL_SEC` | `30` | Reconciliation timer interval |
| `FOURNOS_CLUSTER_GPU_DISCOVERY_DEFAULT_INTERVAL_SEC` | `300` | Default GPU discovery interval |
| `FOURNOS_CLUSTER_GPU_DISCOVERY_TIMEOUT_SEC` | `10` | Timeout for connecting to target clusters |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 547 - 552, The README table lists incorrect env var
names; pydantic-settings in fourn o s_cluster/settings.py sets model_config =
{"env_prefix": "FOURNOS_CLUSTER_"} so environment variables are derived by
prefixing field names—update the table to use the actual env vars
FOURNOS_CLUSTER_RECONCILE_INTERVAL_SEC,
FOURNOS_CLUSTER_GPU_DISCOVERY_DEFAULT_INTERVAL_SEC, and
FOURNOS_CLUSTER_GPU_DISCOVERY_TIMEOUT_SEC (matching the Settings field names),
and remove the fourth row (no env var controls cluster discovery interval);
reference model_config/env_prefix and the Settings field names to locate the
authoritative names.

string_data is never populated when reading secrets from the K8s API;
it only exists for creating secrets. The branch was unreachable.
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 9, 2026
Comment thread config/forge/samples/job-full.yaml Outdated
Comment on lines +44 to +55
result = self._k8s.create_cluster_custom_object(
group=KUEUE_GROUP,
version=KUEUE_VERSION,
plural=KUEUE_RESOURCE_FLAVOR_PLURAL,
body=body,
)
logger.info("Created ResourceFlavor %s", flavor_name)
return result
except client.exceptions.ApiException as exc:
if exc.status == 409:
logger.debug("ResourceFlavor %s already exists", flavor_name)
return None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we update the resource when the status is 409?

if exc.status == 404:
logger.warning("ClusterQueue %s not found", cq_name)
return None
raise
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure when this is supposed to happen 🤔
shouldn't we create the ClusterQueue if it doesn't exist?

The Python K8s client ignores configuration.request_timeout — the
timeout must be passed directly to the API call via _request_timeout.
Without this, GPU discovery hangs indefinitely on unreachable clusters.

Also removes the stale test_stringdata_kubeconfig test that covered the
previously removed string_data branch.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@fournos-cluster/tests/unit/test_gpu_discovery.py`:
- Around line 310-325: The test test_sets_request_timeout is asserting that
mock_api_client.configuration.request_timeout is not None, but discover_gpus
sets the timeout by passing _request_timeout directly to CoreV1Api.list_node, so
update the test to assert that CoreV1Api.return_value.list_node was called with
a timeout equal to the expected _request_timeout; specifically, in
test_sets_request_timeout arrange a known timeout value, call
discovery.discover_gpus(..., _request_timeout=EXPECTED), and replace the
existing assertion with one that checks
mock_k8s_client.CoreV1Api.return_value.list_node.assert_called_with(...,
timeout=EXPECTED) (or inspect call_args to verify the timeout argument) so the
test verifies the actual behavior of discover_gpus and will catch regressions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bcbcfc42-80c2-4044-951a-7fec02709b29

📥 Commits

Reviewing files that changed from the base of the PR and between 89627f7 and e8f04a9.

📒 Files selected for processing (2)
  • fournos-cluster/fournos_cluster/core/gpu_discovery.py
  • fournos-cluster/tests/unit/test_gpu_discovery.py

Comment on lines +310 to +325
def test_sets_request_timeout(
self, mock_k8s_client: MagicMock, mock_k8s_config: MagicMock
) -> None:
discovery, mock_core = self._make_client()
self._mock_kubeconfig_secret(mock_core)

nodes = MagicMock()
nodes.items = []
mock_api_client = MagicMock()
mock_k8s_config.new_client_from_config_dict.return_value = mock_api_client
mock_k8s_client.CoreV1Api.return_value.list_node.return_value = nodes

discovery.discover_gpus("cluster-1", "kubeconfig-cluster-1", "psap-secrets")

assert mock_api_client.configuration.request_timeout is not None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Test doesn't validate the timeout fix.

The test asserts mock_api_client.configuration.request_timeout is not None, but the implementation (lines 123-128 in gpu_discovery.py) never sets api_client.configuration.request_timeout. The actual fix passes _request_timeout directly to the list_node() call. This test provides false confidence and won't catch regressions if the _request_timeout parameter is accidentally removed.

✅ Proposed fix to validate the actual timeout behavior
     `@patch`("fournos_cluster.core.gpu_discovery.k8s_config")
     `@patch`("fournos_cluster.core.gpu_discovery.client")
     def test_sets_request_timeout(
         self, mock_k8s_client: MagicMock, mock_k8s_config: MagicMock
     ) -> None:
+        from fournos_cluster.settings import settings
         discovery, mock_core = self._make_client()
         self._mock_kubeconfig_secret(mock_core)
 
         nodes = MagicMock()
         nodes.items = []
         mock_api_client = MagicMock()
         mock_k8s_config.new_client_from_config_dict.return_value = mock_api_client
         mock_k8s_client.CoreV1Api.return_value.list_node.return_value = nodes
 
         discovery.discover_gpus("cluster-1", "kubeconfig-cluster-1", "psap-secrets")
 
-        assert mock_api_client.configuration.request_timeout is not None
+        expected_timeout = (settings.gpu_discovery_timeout_sec, settings.gpu_discovery_timeout_sec)
+        mock_k8s_client.CoreV1Api.return_value.list_node.assert_called_once_with(_request_timeout=expected_timeout)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_sets_request_timeout(
self, mock_k8s_client: MagicMock, mock_k8s_config: MagicMock
) -> None:
discovery, mock_core = self._make_client()
self._mock_kubeconfig_secret(mock_core)
nodes = MagicMock()
nodes.items = []
mock_api_client = MagicMock()
mock_k8s_config.new_client_from_config_dict.return_value = mock_api_client
mock_k8s_client.CoreV1Api.return_value.list_node.return_value = nodes
discovery.discover_gpus("cluster-1", "kubeconfig-cluster-1", "psap-secrets")
assert mock_api_client.configuration.request_timeout is not None
def test_sets_request_timeout(
self, mock_k8s_client: MagicMock, mock_k8s_config: MagicMock
) -> None:
from fournos_cluster.settings import settings
discovery, mock_core = self._make_client()
self._mock_kubeconfig_secret(mock_core)
nodes = MagicMock()
nodes.items = []
mock_api_client = MagicMock()
mock_k8s_config.new_client_from_config_dict.return_value = mock_api_client
mock_k8s_client.CoreV1Api.return_value.list_node.return_value = nodes
discovery.discover_gpus("cluster-1", "kubeconfig-cluster-1", "psap-secrets")
expected_timeout = (settings.gpu_discovery_timeout_sec, settings.gpu_discovery_timeout_sec)
mock_k8s_client.CoreV1Api.return_value.list_node.assert_called_once_with(_request_timeout=expected_timeout)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fournos-cluster/tests/unit/test_gpu_discovery.py` around lines 310 - 325, The
test test_sets_request_timeout is asserting that
mock_api_client.configuration.request_timeout is not None, but discover_gpus
sets the timeout by passing _request_timeout directly to CoreV1Api.list_node, so
update the test to assert that CoreV1Api.return_value.list_node was called with
a timeout equal to the expected _request_timeout; specifically, in
test_sets_request_timeout arrange a known timeout value, call
discovery.discover_gpus(..., _request_timeout=EXPECTED), and replace the
existing assertion with one that checks
mock_k8s_client.CoreV1Api.return_value.list_node.assert_called_with(...,
timeout=EXPECTED) (or inspect call_args to verify the timeout argument) so the
test verifies the actual behavior of discover_gpus and will catch regressions.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 14, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 14, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants