Skip to content

Distributed ux polish#7

Open
DhirenMhatre wants to merge 8 commits into
masterfrom
distributed-ux-polish
Open

Distributed ux polish#7
DhirenMhatre wants to merge 8 commits into
masterfrom
distributed-ux-polish

Conversation

@DhirenMhatre

Copy link
Copy Markdown

Description

This PR fixes #

Notes for Reviewers

Signed commits

  • Yes, I signed my commits.

mudler added 8 commits April 19, 2026 08:03
Before this change `DistributedBackendManager.CheckUpgrades` delegated to the
local manager, which read backends from the frontend filesystem. In
distributed deployments the frontend has no backends installed locally —
they live on workers — so the upgrade-detection loop never ran and the UI
silently never surfaced upgrades even when the gallery advertised newer
versions or digests.

Worker-side: NATS backend.list reply now carries Version, URI and Digest
for each installed backend (read from metadata.json).

Frontend-side: DistributedBackendManager.ListBackends aggregates per-node
refs (name, status, version, digest) instead of deduping, and CheckUpgrades
feeds that aggregation into gallery.CheckUpgradesAgainst — a new entrypoint
factored out of CheckBackendUpgrades so both paths share the same core
logic.

Cluster drift policy: when per-node version/digest tuples disagree, the
backend is flagged upgradeable regardless of whether any single node
matches the gallery, and UpgradeInfo.NodeDrift enumerates the outliers so
operators can see *why* it is out of sync. The next upgrade-all realigns
the cluster.

Tests cover: drift detection, unanimous-match (no upgrade), and the
empty-installed-version path that the old distributed code silently
missed.
The System page (Manage.jsx) only showed updates as a tiny inline arrow,
so operators routinely missed them. Port the Backend Gallery's upgrade UX
so System speaks the same visual language:

- Yellow banner at the top of the Backends tab when upgrades are pending,
  with an "Upgrade all" button (serial fan-out, matches the gallery) and a
  "Updates only" filter toggle.
- Warning pill (↑ N) next to the tab label so the count is glanceable even
  when the banner is scrolled out of view.
- Per-row labeled "Upgrade to vX.Y" button (replaces the icon-only button
  that silently flipped semantics between Reinstall and Upgrade), plus an
  "Update available" badge in the new Version column.
- New columns: Version (with upgrade + drift chips), Nodes (per-node
  attribution badges for distributed mode, degrading to a compact
  "on N nodes · M offline" chip above three nodes), Installed (relative
  time).
- System backends render a "Protected" chip instead of a bare "—" so rows
  still align and the reason is obvious.
- Delete uses the softer btn-danger-ghost so rows don't scream red; the
  ConfirmDialog still owns the "are you sure".

The upgrade checker also needed the same per-worker fix as the previous
commit: NewUpgradeChecker now takes a BackendManager getter so its
periodic runs call the distributed CheckUpgrades (which asks workers)
instead of the empty frontend filesystem. Without this the /api/backends/
upgrades endpoint stayed empty in distributed mode even with the protocol
change in place.

New CSS primitives — .upgrade-banner, .tab-pill, .badge-row, .cell-stack,
.cell-mono, .cell-muted, .row-actions, .btn-danger-ghost — all live in
App.css so other pages can adopt them without duplicating styles.
The Nodes page was the biggest visual liability in distributed mode.
Rework the main dashboard surfaces in place without changing behavior:

StatCards: uniform height (96px min), left accent bar colored by the
metric's semantic (success/warning/error/primary), icon lives in a
36x36 soft-tinted chip top-right, value is left-aligned and large.
Grid auto-fills so the row doesn't collapse on narrow viewports. This
replaces the previous thin-bordered boxes with inconsistent heights.

Table rows: expandable rows now show a chevron cue on the left (rotates
on expand) so users know rows open. Status cell became a dedicated chip
with an LED-style halo dot instead of a bare bullet. Action buttons gained
labels — "Approve", "Resume", "Drain" — so the icons aren't doing all
the semantic work; the destructive remove action uses the softer
btn-danger-ghost variant so rows don't scream red, with the ConfirmDialog
still owning the real "are you sure". Applied cell-mono/cell-muted
utility classes so label chips and addresses share one spacing/font
grammar instead of re-declaring inline styles everywhere.

Expanded drawer: empty states for Loaded Models and Installed Backends
now render as a proper drawer-empty card (dashed border, icon, one-line
hint) instead of a plain muted string that read like broken formatting.

Tabs: three inline-styled buttons became the shared .tab class so they
inherit focus ring, hover state, and the rest of the design system —
matches the System page.

"Add more workers" toggle turned into a .nodes-add-worker dashed-border
button labelled "Register a new worker" (action voice) instead of a
chevron + muted link that operators kept mistaking for broken text.

New shared CSS primitives carry over to other pages:
.stat-grid + .stat-card, .row-chevron, .node-status, .drawer-empty,
.nodes-add-worker.
Two connected problems handled together:

1) Backend delete/install/upgrade used to silently skip non-healthy nodes,
   so a delete during an outage left a zombie on the offline node once it
   returned. The fan-out now records intent in a new pending_backend_ops
   table before attempting the NATS round-trip. Currently-healthy nodes
   get an immediate attempt; everyone else is queued. Unique index on
   (node_id, backend, op) means reissuing the same operation refreshes
   next_retry_at instead of stacking duplicates.

2) Loaded-model state could drift from reality: a worker OOM'd, got
   killed, or restarted a backend process would leave a node_models row
   claiming the model was still loaded, feeding ghost entries into the
   /api/nodes/models listing and the router's scheduling decisions.

The existing ReplicaReconciler gains two new passes that run under a
fresh KeyStateReconciler advisory lock (non-blocking, so one wedged
frontend doesn't freeze the cluster):

  - drainPendingBackendOps: retries queued ops whose next_retry_at has
    passed on currently-healthy nodes. Success deletes the row; failure
    bumps attempts and pushes next_retry_at out with exponential backoff
    (30s → 15m cap). ErrNoResponders also marks the node unhealthy.

  - probeLoadedModels: gRPC-HealthChecks addresses the DB thinks are
    loaded but hasn't seen touched in the last probeStaleAfter (2m).
    Unreachable addresses are removed from the registry. A pluggable
    ModelProber lets tests substitute a fake without standing up gRPC.

DistributedBackendManager exposes DeleteBackendDetailed so the HTTP
handler can surface per-node outcomes ("2 succeeded, 1 queued") to the
UI in a follow-up commit; the existing DeleteBackend still returns
error-only for callers that don't care about node breakdown.

Multi-frontend safety: the state pass uses advisorylock.TryWithLockCtx
on a new key so N frontends coordinate — the same pattern the health
monitor and replica reconciler already rely on. Single-node mode runs
both passes inline (adapter is nil, state drain is a no-op).

Tests cover the upsert semantics, backoff math, the probe removing an
unreachable model but keeping a reachable one, and filtering by
probeStaleAfter.
When a frontend restarted in distributed mode, models that workers had
already loaded weren't visible until the operator clicked into each node
manually — the /api/models/capabilities endpoint only knew about
configs on the frontend's filesystem, not the registry-backed truth.

/api/models/capabilities now joins in ListAllLoadedModels() when the
registry is active, returning loaded_on[] with node id/name/state/status
for each model. Models that live in the registry but lack a local config
(the actual ghosts, not recovered from the frontend's file cache) still
surface with source="registry-only" so operators can see and persist
them; without that emission they'd be invisible to this frontend.

Manage → Models replaces the old Running/Idle pill with a distribution
cell that lists the first three nodes the model is loaded on as chips
colored by state (green loaded, blue loading, amber anything else). On
wider clusters the remaining count collapses into a +N chip with a
title-attribute breakdown. Disabled / single-node behavior unchanged.

Adopted models get an extra "Adopted" ghost-icon chip with hover copy
explaining what it means and how to make it permanent.

Distributed mode also enables a 10s auto-refresh and a "Last synced Xs
ago" indicator next to the Update button so ghost rows drop off within
one reconcile tick after their owning process dies. Non-distributed
mode is untouched — no polling, no cell-stack, same old Running/Idle.
Large clusters were going to break the Manage → Backends Nodes column:
the old inline logic rendered every node as a badge and would shred the
layout at >10 workers, plus the Manage → Models distribution cell had
copy-pasted its own slightly-different version.

NodeDistributionChip handles any cluster size with two render modes:
  - small (≤3 nodes): inline chips of node names, colored by health.
  - large: a single "on N nodes · M offline · K drift" summary chip;
    clicking opens a Popover with a per-node table (name, status,
    version, digest for backends; name, status, state for models).

Drift counting mirrors the backend's summarizeNodeDrift so the UI
number matches UpgradeInfo.NodeDrift. Digests are truncated to the
docker-style 12-char form with the full value preserved in the title.

Popover is a new general-purpose primitive: fixed positioning anchored
to the trigger, flips above when there's no room below, closes on
outside-click or Escape, returns focus to the trigger. Uses .card as
its surface so theming is inherited. Also useful for a future
labels-editor popup and the user menu.

Manage.jsx drops its duplicated inline Nodes-column + loaded_on cell
and uses the shared chip with context="backends" / "models"
respectively. Delete code removes ~40 lines of ad-hoc logic.
The Backends gallery had a nice search + chip + toggle strip; the System
page had nothing, so the two surfaces felt like different apps. Lift the
pattern into a reusable FilterBar and wire both System tabs through it.

New component core/http/react-ui/src/components/FilterBar.jsx renders a
search input, a role="tablist" chip row (aria-selected for a11y), and
optional toggles / right slot. Chips support an optional `count` which
the System page uses to show "User 3", "Updates 1" etc.

System Models tab: search by id or backend; chips for
All/Running/Idle/Disabled/Pinned plus a conditional Distributed chip in
distributed mode. "Last synced" + Update button live in the right slot.

System Backends tab: search by name/alias/meta-backend-for; chips for
All/User/System/Meta plus conditional Updates / Offline-nodes chips
when relevant. The old ad-hoc "Updates only" toggle from the upgrade
banner folded into the Updates chip — one source of truth for that
filter. Offline chip only appears in distributed mode when at least
one backend has an unhealthy node, so the chip row stays quiet on
healthy clusters.

Filter state persists in URL query params (mq/mf/bq/bf) so deep links
and tab switches keep the operator's filter context instead of
resetting every time.

Also adds an "Adopted" distribution path: when a model in
/api/models/capabilities carries source="registry-only" (discovered on
a worker but not configured locally), the Models tab shows a ghost chip
labelled "Adopted" with hover copy explaining how to persist it — this
is what closes the loop on the ghost-model story end-to-end.
pending_backend_ops rows targeting agent-type workers looped forever:
the reconciler fan-out hit a NATS subject the worker doesn't subscribe
to, returned ErrNoResponders, we marked the node unhealthy, and the
health monitor flipped it back to healthy on the next heartbeat. Next
tick, same row, same failure.

Three related fixes:

1. enqueueAndDrainBackendOp skips nodes whose NodeType != backend.
   Agent workers handle agent NATS subjects, not backend.install /
   delete / list, so enqueueing for them guarantees an infinite retry
   loop. Silent skip is correct — they aren't consumers of these ops.

2. Reconciler drain mirrors enqueueAndDrainBackendOp's behavior on
   nats.ErrNoResponders: mark the node unhealthy before recording the
   failure, so subsequent ListDuePendingBackendOps (filters by
   status=healthy) stops picking the row until the node actually
   recovers. Matches the synchronous fan-out path.

3. Dead-letter cap at maxPendingBackendOpAttempts (10). After ~1h of
   exponential backoff the row is a poison message; further retries
   just thrash NATS. Row is deleted and logged at ERROR so it stays
   visible without staying infinite.

Plus a one-shot startup cleanup in NewNodeRegistry: drop queue rows
that target agent-type nodes, non-existent nodes, or carry an empty
backend name. Guarded by the same schema-migration advisory lock so
only one instance performs it. The guards above prevent new rows of
this shape; this closes the migration gap for existing ones.

Tests: the prune migration (valid row stays, agent + empty-name rows
drop) on top of existing upsert / backoff coverage.
@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

Policy Check Failed

✗ 3/3 policy checks failed:

• Need 2 more approval(s) (0/2) — comment LGTM or approve via review
• Missing ticket reference (expected: JIRA-, ENG-, #*)
• 20 code file(s) changed but no test files added


To merge this PR:

  1. Address the failed checks listed above
  2. Ensure branch protection requires the codity/policy-check status

Configure policies in your dashboard

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

PR Summary

What Changed

  • Fixed distributed-mode upgrade detection by routing checks through the active backend manager instead of empty local filesystem.
  • Added durable PendingBackendOp queue for backend lifecycle operations with automatic retry and exponential backoff.
  • Built new React UI components for cluster visibility: node distribution chips, drift detection badges, and persistent URL-based filtering.

Key Changes by Area

Distributed Backend Management

  • core/services/nodes/managers_distributed.go: Replaced fire-and-forget operations with enqueueAndDrainBackendOp using persistent queue.
  • core/services/nodes/reconciler.go: Added reconcileState with model health probing and stale probe detection.

Upgrade Detection

  • core/gallery/upgrade.go: New CheckUpgradesAgainst function using aggregated SystemBackends from workers.
  • core/application/upgrade_checker.go: Added backendManagerFn lazy getter for proper distributed routing.

React UI

  • core/http/react-ui/src/components/NodeDistributionChip.jsx: Visualizes model/backend distribution with inline chips or popover detail.
  • core/http/react-ui/src/pages/Manage.jsx: Added drift warnings, upgrade banners, URL-persisted filters, and 10s auto-refresh.

Data Model

  • core/gallery/backends.go: Extended SystemBackend with Nodes []NodeBackendRef for per-node attribution.
  • core/services/nodes/registry.go: New PendingBackendOp table with composite unique index and retry management.

Files Changed

File Changes Summary
core/application/distributed.go Added Adapter, RegistrationToken, ProbeStaleAfter to ReplicaReconcilerOptions
core/application/startup.go Passes lazy getter to NewUpgradeChecker for distributed mode
core/application/upgrade_checker.go Added backendManagerFn lazy getter for proper backend routing
core/cli/worker.go Worker reports Version, URI, Digest in lifecycle events
core/gallery/backends.go Added Nodes []NodeBackendRef and NodeDriftInfo types
core/gallery/upgrade.go New CheckUpgradesAgainst and summarizeNodeDrift functions
core/gallery/upgrade_test.go Added drift detection test suite
core/http/react-ui/src/App.css Table utilities, stat cards, popover system, status indicators
core/http/react-ui/src/components/FilterBar.jsx Shared search/filter/toggle control strip
core/http/react-ui/src/components/NodeDistributionChip.jsx Node distribution visualization with drift detection
core/http/react-ui/src/components/Popover.jsx Accessible anchored popover with focus management
core/http/react-ui/src/pages/Manage.jsx Distributed mode badges, filters, auto-refresh, upgrade banners
core/http/react-ui/src/pages/Nodes.jsx Stat cards, loaded models table, improved accessibility
core/http/routes/ui_api.go Added LoadedOn field and Source="registry-only" for ghost models
core/services/advisorylock/keys.go Added KeyStateReconciler constant
core/services/galleryop/service.go Extended SystemBackend metadata fields
core/services/messaging/subjects.go New messaging subjects for backend operations
core/services/nodes/managers_distributed.go enqueueAndDrainBackendOp with durable intent queue
core/services/nodes/reconciler.go reconcileState, drainPendingBackendOps, model health probing
core/services/nodes/reconciler_test.go fakeProber helper and probeLoadedModels tests
core/services/nodes/registry.go PendingBackendOp CRUD, exponential backoff, startup pruning

Review Focus Areas

  • Queue durability: Verify PendingBackendOp retry logic handles node restarts without duplicate operations.
  • Drift detection accuracy: Check summarizeNodeDrift handles mixed-version clusters correctly.
  • Reconciler lock splitting: Confirm non-blocking state lock prevents frontend stampede while allowing scaling operations.

Architecture

Design Decisions

  • Lazy getter pattern in upgrade_checker.go defers backend manager resolution until check time, avoiding circular dependencies during startup.
  • Composite unique index (node_id, backend, op) prevents duplicate intents but allows same operation on different nodes.
  • Split advisory locks: replica scaling blocks, state reconciliation tries non-blocking to prioritize user-facing operations over cleanup.

Scalability & Extensibility

  • Popover component uses fixed positioning with scroll handling for large node lists.
  • NodeDistributionChip switches from inline to summary mode at 3 nodes, tunable threshold for future density needs.

Risks

  • Intentional: Exponential backoff caps at 15 minutes, operations may lag behind actual cluster state during recovery.
  • Unintentional: probeStaleAfter default (2m) may mask genuinely slow model loads on resource-constrained nodes.

Comment on lines +642 to +645
if (showOnlyUpgradable && backendsFilter !== 'upgradable') {
// One-shot reconciliation — the old state becomes the new chip.
setBackendsFilter('upgradable')
setShowOnlyUpgradable(false)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional High

Calling state setters during render here can trigger React re-render loops, so move this reconciliation into a useEffect or derive the filter without mutating state in render.

Suggested fix
          useEffect(() => {
            if (showOnlyUpgradable && backendsFilter !== 'upgradable') {
              setBackendsFilter('upgradable')
              setShowOnlyUpgradable(false)
            }
          }, [showOnlyUpgradable, backendsFilter])
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert javascript developer with deep knowledge of security, performance, and best practices.

### Context

File: core/http/react-ui/src/pages/Manage.jsx
Lines: 642-645
Issue Type: functional-high
Severity: high

Issue Description:
Calling state setters during render here can trigger React re-render loops, so move this reconciliation into a useEffect or derive the filter without mutating state in render.

Current Code:
          if (showOnlyUpgradable && backendsFilter !== 'upgradable') {
            // One-shot reconciliation — the old state becomes the new chip.
            setBackendsFilter('upgradable')
            setShowOnlyUpgradable(false)
          }

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow javascript best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

Comment on lines +28 to +33
func (g grpcModelProber) IsAlive(ctx context.Context, address string) bool {
client := grpcclient.NewClientWithToken(address, false, nil, false, g.token)
probeCtx, cancel := context.WithTimeout(ctx, 1*time.Second)
defer cancel()
ok, _ := client.HealthCheck(probeCtx)
return ok

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional High

The liveness probe trusts the persisted model address without validation, so a compromised or stale node_models row can trigger server-side requests to arbitrary internal endpoints; restrict probes to the owning node's registered host or validate the address before dialing.

Suggested fix
func (g grpcModelProber) IsAlive(ctx context.Context, address string) bool {
	if !isAllowedProbeAddress(address) {
		return false
	}
	client := grpcclient.NewClientWithToken(address, false, nil, false, g.token)
	probeCtx, cancel := context.WithTimeout(ctx, 1*time.Second)
	defer cancel()
	ok, _ := client.HealthCheck(probeCtx)
	return ok
}
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert go developer with deep knowledge of security, performance, and best practices.

### Context

File: core/services/nodes/reconciler.go
Lines: 28-33
Issue Type: functional-high
Severity: high

Issue Description:
The liveness probe trusts the persisted model address without validation, so a compromised or stale node_models row can trigger server-side requests to arbitrary internal endpoints; restrict probes to the owning node's registered host or validate the address before dialing.

Current Code:
func (g grpcModelProber) IsAlive(ctx context.Context, address string) bool {
	client := grpcclient.NewClientWithToken(address, false, nil, false, g.token)
	probeCtx, cancel := context.WithTimeout(ctx, 1*time.Second)
	defer cancel()
	ok, _ := client.HealthCheck(probeCtx)
	return ok
}

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow go best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

Comment on lines +54 to +60
<button
key={f.key}
role="tab"
aria-selected={activeFilter === f.key}
className={`filter-btn ${activeFilter === f.key ? 'active' : ''}`}
onClick={() => onFilterChange(f.key)}
>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Robustness Medium

The tab buttons are missing a type="button" attribute, so embedding this component inside a form will submit the form on filter clicks; set an explicit button type.

Suggested fix
                <button
                  key={f.key}
                  type="button"
                  role="tab"
                  aria-selected={activeFilter === f.key}
                  className={`filter-btn ${activeFilter === f.key ? 'active' : ''}`}
                  onClick={() => onFilterChange(f.key)}
                >
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert javascript developer with deep knowledge of security, performance, and best practices.

### Context

File: core/http/react-ui/src/components/FilterBar.jsx
Lines: 54-60
Issue Type: robustness-medium
Severity: medium

Issue Description:
The tab buttons are missing a `type="button"` attribute, so embedding this component inside a form will submit the form on filter clicks; set an explicit button type.

Current Code:
                <button
                  key={f.key}
                  role="tab"
                  aria-selected={activeFilter === f.key}
                  className={`filter-btn ${activeFilter === f.key ? 'active' : ''}`}
                  onClick={() => onFilterChange(f.key)}
                >

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow javascript best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

ariaLabel={context === 'models' ? 'Model distribution' : 'Backend distribution'}
>
<div className="popover__header">
<strong>Installed on {total} node{total === 1 ? '' : 's'}</strong>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional Medium

The popover header always says "Installed on" even for models where the component is documenting loaded state, so use a context-specific label.

Suggested change
<strong>Installed on {total} node{total === 1 ? '' : 's'}</strong>
<strong>{context === 'models' ? 'Loaded on ' : 'Installed on '}{total} node{total === 1 ? '' : 's'}</strong>
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert javascript developer with deep knowledge of security, performance, and best practices.

### Context

File: core/http/react-ui/src/components/NodeDistributionChip.jsx
Lines: 97-97
Issue Type: functional-medium
Severity: medium

Issue Description:
The popover header always says "Installed on" even for models where the component is documenting loaded state, so use a context-specific label.

Current Code:
          <strong>Installed on {total} node{total === 1 ? '' : 's'}</strong>

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow javascript best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

Comment on lines +197 to +199
default:
xlog.Warn("Reconciler: unknown pending op", "op", op.Op, "id", op.ID)
continue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security Medium

Unknown pending op types are skipped forever without updating attempts or deleting the row, so a malformed queued operation can persist indefinitely; dead-letter or mark failure for unsupported op values.

Suggested fix
		default:
			xlog.Warn("Reconciler: unknown pending op", "op", op.Op, "id", op.ID)
			_ = rc.registry.RecordPendingBackendOpFailure(ctx, op.ID, fmt.Sprintf("unsupported op: %s", op.Op))
			if op.Attempts+1 >= maxPendingBackendOpAttempts {
				_ = rc.registry.DeletePendingBackendOp(ctx, op.ID)
			}
			continue
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert go developer with deep knowledge of security, performance, and best practices.

### Context

File: core/services/nodes/reconciler.go
Lines: 197-199
Issue Type: security-medium
Severity: medium

Issue Description:
Unknown pending op types are skipped forever without updating attempts or deleting the row, so a malformed queued operation can persist indefinitely; dead-letter or mark failure for unsupported op values.

Current Code:
		default:
			xlog.Warn("Reconciler: unknown pending op", "op", op.Op, "id", op.ID)
			continue

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow go best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

Security Scan Summary

Metric Value
Vulnerabilities Critical: 0
Overall Risk Clean
Files Scanned 21

No critical security issues detected

Scan completed in 42.6s

Security scan powered by Codity.ai

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

Dependency vulnerability scanning

Metric Value
Vulnerabilities Found 6
Scanner npm audit
View vulnerability details (6 items)

1. brace-expansion various

CVE: GHSA-f886-m6hf-6m8v
Severity: MODERATE
Fixed in: True

brace-expansion: Zero-step sequence causes process hang and memory exhaustion


2. express-rate-limit various

CVE: N/A
Severity: MODERATE
Fixed in: True

Vulnerability in express-rate-limit


3. fast-uri various

CVE: GHSA-q3j6-qgpj-74h6, GHSA-v39h-62p7-jpjc
Severity: HIGH
Fixed in: True

fast-uri vulnerable to path traversal via percent-encoded dot segments


4. hono various

CVE: GHSA-qp7p-654g-cw7p, GHSA-hm8q-7f3q-5f36, GHSA-p77w-8qqv-26rm, GHSA-9vqf-7f2p-gf9v, GHSA-69xw-7hcm-h432
Severity: MODERATE
Fixed in: True

Hono has CSS Declaration Injection via Style Object Values in JSX SSR


5. ip-address various

CVE: GHSA-v2v4-37r5-5v8g
Severity: MODERATE
Fixed in: True

ip-address has XSS in Address6 HTML-emitting methods


6. postcss various

CVE: GHSA-qx2v-qp2m-jg93
Severity: MODERATE
Fixed in: True

PostCSS has XSS via Unescaped </style> in its CSS Stringify Output

Powered by Codity.ai · Docs

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

Dependency vulnerability scanning

Metric Value
Vulnerabilities Found 4
Scanner pip-audit
View vulnerability details (4 items)

1. pip 24.0

CVE: GHSA-4xh5-x5gv-qwph
Fixed in: 25.3

When extracting a tar archive pip may not check symbolic links point into the extraction directory if the tarfile module doesn't implement PEP 706. Note that upgrading pip to a "fixed" version for thi


2. pip 24.0

CVE: GHSA-6vgw-5pg2-w6jp
Fixed in: 26.0

When pip is installing and extracting a maliciously crafted wheel archive, files may be extracted outside the installation directory. The path traversal is limited to prefixes of the installation dire


3. pip 24.0

CVE: GHSA-58qw-9mgm-455v
Fixed in:

pip handles concatenated tar and ZIP files as ZIP files regardless of filename or whether a file is both a tar and ZIP file. This behavior could result in confusing installation behavior, such as inst


4. pip 24.0

CVE: GHSA-jp4c-xjxw-mgf9
Fixed in: 26.1

pip prior to version 26.1 would run self-update check functionality after installing wheel files which required importing well-known Python modules names. These module imports were intentionally defer

Powered by Codity.ai · Docs

@greptile-apps

greptile-apps Bot commented May 20, 2026

Copy link
Copy Markdown

Greptile Summary

This PR polishes the distributed mode UX across the backend: it fixes the long-standing bug where upgrade detection always read the empty frontend filesystem instead of aggregating from workers, introduces a durable pending_backend_ops queue so backend lifecycle ops survive worker downtime, and adds cluster-drift detection to surface nodes running divergent backend versions. The frontend receives matching work — per-node distribution chips, upgrade banners, richer filter bars, and auto-refresh in distributed mode.

  • Backend (Go): CheckUpgradesAgainst replaces the single-node-only CheckBackendUpgrades for distributed callers; enqueueAndDrainBackendOp replaces fire-and-forget fan-outs with durable intent rows retried by the new reconcileState pass; gRPC health probes prune ghost model rows automatically.
  • Frontend (React): New NodeDistributionChip, Popover, and FilterBar components; Manage.jsx gains per-tab search/filter with URL persistence, an upgrade banner, and a "Upgrade all" action; Nodes.jsx moves from inline styles to design-system CSS classes.

Confidence Score: 3/5

The Go backend changes are solid and well-tested; the frontend Upgrade all action unconditionally shows a success notice even when every upgrade call fails, which will mislead operators on a broken cluster.

The core distributed infrastructure work is carefully implemented with good test coverage. The frontend misinforms the operator when all backends fail to upgrade — a user who clicks Upgrade all and sees only error toasts followed immediately by a success banner has no reliable signal about the true cluster state.

core/http/react-ui/src/pages/Manage.jsx — handleUpgradeAll success-toast logic needs fixing before this ships to operators.

Important Files Changed

Filename Overview
core/http/react-ui/src/pages/Manage.jsx Significant UI overhaul adding per-tab filtering, URL-synced filter state, auto-refresh in distributed mode, upgrade banner, and NodeDistributionChip; has a bug where the success toast fires unconditionally even when all individual upgrades fail, plus dead showOnlyUpgradable state.
core/services/nodes/managers_distributed.go Major refactor extracting enqueueAndDrainBackendOp with durable intent recording and per-node result reporting; has a narrow race between the direct fan-out and the reconciler both applying the same op on a healthy node.
core/application/upgrade_checker.go Adds lazy backendManagerFn indirection so the checker routes through the distributed backend manager in cluster mode; reduces initial delay from 30 s to 10 s which may be too short for workers to fully register.
core/services/nodes/registry.go Adds PendingBackendOp GORM model with unique composite index, full CRUD, exponential-backoff failure recording, and startup pruning of stale agent-node rows.
core/services/nodes/reconciler.go Adds reconcileState with pending-op drain and gRPC-health-based ghost-model pruning passes; uses separate advisory lock from replica scaling, overridable ModelProber for testing.
core/gallery/upgrade.go Splits CheckBackendUpgrades into a single-node wrapper plus CheckUpgradesAgainst accepting pre-aggregated SystemBackends; adds cluster-drift detection via summarizeNodeDrift majority-vote logic.
core/http/routes/ui_api.go Extends the models list API to populate loaded_on distribution data from the node registry and emits registry-only ghost entries for models with no local config file.

Sequence Diagram

sequenceDiagram
    participant UI as React UI
    participant API as ui_api.go
    participant DBM as DistributedBackendManager
    participant Reg as NodeRegistry (DB)
    participant Rec as ReplicaReconciler
    participant W as Worker Node (NATS)

    UI->>API: POST /api/backends/upgrade/:name
    API->>DBM: UpgradeBackend(ctx, name)
    DBM->>Reg: UpsertPendingBackendOp(nodeID, name, upgrade)
    alt node is healthy
        DBM->>W: InstallBackend (NATS RPC)
        W-->>DBM: reply.Success
        DBM->>Reg: deletePendingRow(nodeID, name, upgrade)
    else node is offline/draining
        DBM-->>API: "status=queued"
        Note over Rec,W: Reconciler picks up later
    end

    loop Every 30s
        Rec->>Reg: ListDuePendingBackendOps()
        Reg-->>Rec: due ops for healthy nodes
        Rec->>W: InstallBackend / DeleteBackend (NATS)
        alt success
            Rec->>Reg: DeletePendingBackendOp(id)
        else failure
            Rec->>Reg: RecordPendingBackendOpFailure(id, err)
            Note over Rec: exponential backoff, dead-letter at 10 attempts
        end
        Rec->>Reg: query stale loaded model rows
        Rec->>W: gRPC HealthCheck (probe)
        alt unreachable
            Rec->>Reg: RemoveNodeModel(nodeID, modelName)
        end
    end

    UI->>API: GET /api/system/models
    API->>Reg: ListAllLoadedModels + List nodes
    API-->>UI: modelCapability with LoadedOn and Source
Loading

Reviews (1): Last reviewed commit: "fix(distributed): stop queue loops on ag..." | Re-trigger Greptile

Comment on lines +274 to +281
for (const name of names) {
try {
await backendsApi.upgrade(name)
} catch (err) {
addToast(`Upgrade failed for ${name}: ${err.message}`, 'error')
}
}
addToast(`Upgrade started for ${names.length} backend${names.length === 1 ? '' : 's'}`, 'info')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Success toast always fires regardless of upgrade outcome. Every call to backendsApi.upgrade is individually caught, so the outer try block never throws. The "Upgrade started for N backends" toast fires even when every individual upgrade failed — an operator who sees N error toasts followed by a success notice will be misled into thinking work is in progress when none of it is.

Suggested change
for (const name of names) {
try {
await backendsApi.upgrade(name)
} catch (err) {
addToast(`Upgrade failed for ${name}: ${err.message}`, 'error')
}
}
addToast(`Upgrade started for ${names.length} backend${names.length === 1 ? '' : 's'}`, 'info')
let succeeded = 0
for (const name of names) {
try {
await backendsApi.upgrade(name)
succeeded++
} catch (err) {
addToast(`Upgrade failed for ${name}: ${err.message}`, 'error')
}
}
if (succeeded > 0) {
addToast(`Upgrade started for ${succeeded} backend${succeeded === 1 ? '' : 's'}`, 'info')
}

Comment on lines +640 to +646
// Legacy "showOnlyUpgradable" toggle is now the 'upgradable' chip —
// keep backward-compat by mapping it onto the new filter.
if (showOnlyUpgradable && backendsFilter !== 'upgradable') {
// One-shot reconciliation — the old state becomes the new chip.
setBackendsFilter('upgradable')
setShowOnlyUpgradable(false)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Dead state and unreachable migration block. showOnlyUpgradable is initialised to false and setShowOnlyUpgradable(true) is never called anywhere in the component, so this compatibility shim can never run. The state variable and this block can be removed entirely.

Suggested change
// Legacy "showOnlyUpgradable" toggle is now the 'upgradable' chip —
// keep backward-compat by mapping it onto the new filter.
if (showOnlyUpgradable && backendsFilter !== 'upgradable') {
// One-shot reconciliation — the old state becomes the new chip.
setBackendsFilter('upgradable')
setShowOnlyUpgradable(false)
}
// (showOnlyUpgradable legacy migration removed — state never set)

Comment on lines +116 to +144
if err := d.registry.UpsertPendingBackendOp(ctx, node.ID, backend, op, galleriesJSON); err != nil {
xlog.Warn("Failed to enqueue backend op", "op", op, "node", node.Name, "backend", backend, "error", err)
result.Nodes = append(result.Nodes, NodeOpStatus{
NodeID: node.ID, NodeName: node.Name, Status: "error",
Error: fmt.Sprintf("enqueue failed: %v", err),
})
continue
}

if node.Status != StatusHealthy {
// Intent is recorded; reconciler will retry when the node recovers.
result.Nodes = append(result.Nodes, NodeOpStatus{
NodeID: node.ID, NodeName: node.Name, Status: "queued",
Error: fmt.Sprintf("node %s, will retry when healthy", node.Status),
})
continue
}
if _, delErr := d.adapter.DeleteBackend(node.ID, name); delErr != nil {
if errors.Is(delErr, nats.ErrNoResponders) {
// Node's NATS subscription is gone — likely restarted with a new ID.
// Mark it unhealthy so future fan-outs skip it.
xlog.Warn("No NATS responders for node, marking unhealthy", "node", node.Name, "nodeID", node.ID)
d.registry.MarkUnhealthy(context.Background(), node.ID)
continue

applyErr := apply(node)
if applyErr == nil {
// Find the row we just upserted and delete it; cheap but requires
// a lookup since UpsertPendingBackendOp doesn't return the ID.
if err := d.deletePendingRow(ctx, node.ID, backend, op); err != nil {
xlog.Debug("Failed to clear pending backend op after success", "error", err)
}
xlog.Warn("Failed to propagate backend deletion to worker", "node", node.Name, "backend", name, "error", delErr)
errs = append(errs, fmt.Errorf("node %s: %w", node.Name, delErr))
result.Nodes = append(result.Nodes, NodeOpStatus{
NodeID: node.ID, NodeName: node.Name, Status: "success",
})
continue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Reconciler can race the initial fan-out on healthy nodes. UpsertPendingBackendOp sets NextRetryAt = time.Now(), so the row is immediately eligible for ListDuePendingBackendOps. If a reconciler tick fires in the ≤30 s window between the upsert and the successful apply/deletePendingRow call, both the direct fan-out path and drainPendingBackendOps will call the same NATS subject for the same node concurrently. For delete the second call may see a "not found" error that triggers unnecessary retries. Setting NextRetryAt ahead (e.g., time.Now().Add(interval + 5*time.Second)) would close the window.

// Initial delay: don't slow down startup. Short enough that operators
// don't stare at an empty upgrade banner for long; long enough that
// workers have registered and reported their installed backends.
initialDelay := 10 * time.Second

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 10 s initial delay is likely too short for a distributed cluster. Worker registration, NATS subscription setup, and the initial ListBackends fan-out typically take longer than 10 s on any real cluster. Running CheckUpgrades before workers are visible means the upgrade banner silently shows nothing until the next scheduled check (6 h later). The old 30 s value was closer to the right order of magnitude.

Suggested change
initialDelay := 10 * time.Second
initialDelay := 30 * time.Second

@DhirenMhatre

Copy link
Copy Markdown
Author

@codity review

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

Policy Check Failed

✗ 3/3 policy checks failed:

• Need 2 more approval(s) (0/2) — comment LGTM or approve via review
• Missing ticket reference (expected: JIRA-, ENG-, #*)
• 20 code file(s) changed but no test files added


To merge this PR:

  1. Address the failed checks listed above
  2. Ensure branch protection requires the codity/policy-check status

Configure policies in your dashboard

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

PR Summary

What Changed

  • Added durable backend lifecycle operations with a pending-ops queue that retries delete/install/upgrade when nodes recover, preventing "zombie" backends.
  • Implemented distributed upgrade detection and cluster drift monitoring that flags version/digest mismatches across nodes.
  • Polished the React UI with new components for node distribution visibility, auto-refresh, and persistent filters.

Key Changes by Area

Backend Lifecycle

  • managers_distributed.go: Replaced fire-and-forget operations with enqueueAndDrainBackendOp() that persists intent and retries on node recovery.
  • reconciler.go: Added drain loop for pending ops with exponential backoff and ghost model cleanup via gRPC health probes.

Upgrade & Drift Detection

  • upgrade_checker.go: Added lazy backendManagerFn to route checks through active backend manager (local or distributed).
  • upgrade.go: New CheckUpgradesAgainst() function detects cluster drift when nodes disagree on version/digest.

UI Components

  • NodeDistributionChip.jsx: Shows model/backend distribution across nodes; inline for ≤3 nodes, popover summary for larger clusters with drift warnings.
  • FilterBar.jsx: Extracted search + filter + toggle controls for reuse across System tabs.
  • Popover.jsx: Minimal accessible popover with auto-positioning and focus management.

API & Registry

  • ui_api.go: Models endpoint now joins node registry to populate loaded_on field and surfaces registry-only ghost models.
  • registry.go: New PendingBackendOp table with queue management methods and startup cleanup migration.

Files Changed

File Changes Summary
core/application/distributed.go ReplicaReconciler receives Adapter, RegistrationToken, ProbeStaleAfter: 2m
core/application/startup.go Startup cleanup migration for malformed pending ops rows
core/application/upgrade_checker.go Lazy backendManagerFn for routing upgrade checks through active manager
core/cli/worker.go Worker reports Version, URI, Digest in backend lifecycle events
core/gallery/backends.go Added Nodes []NodeBackendRef and NodeDriftInfo for per-node attribution
core/gallery/upgrade.go New CheckUpgradesAgainst() with cluster drift detection
core/gallery/upgrade_test.go Tests for upgrade and drift detection logic
core/http/react-ui/src/App.css New CSS utilities: badges, stat cards, popover system, filter layout, status indicators
core/http/react-ui/src/components/FilterBar.jsx New shared search + filter + toggle control strip
core/http/react-ui/src/components/NodeDistributionChip.jsx Smart cluster distribution display with inline/popover modes
core/http/react-ui/src/components/Popover.jsx Minimal accessible popover with auto-positioning
core/http/react-ui/src/pages/Manage.jsx Auto-refresh, URL-persisted filters, upgrade banner, drift warnings, NodeDistributionChip integration
core/http/react-ui/src/pages/Nodes.jsx StatCard CSS classes, expandable rows, cluster stat cards
core/http/routes/ui_api.go Models endpoint joins node registry for loaded_on, surfaces registry-only models
core/services/advisorylock/keys.go New KeyStateReconciler (107) for reconciler coordination
core/services/galleryop/service.go Integration with pending ops queue
core/services/messaging/subjects.go Extended NodeBackendInfo with Version, URI, Digest
core/services/nodes/managers_distributed.go enqueueAndDrainBackendOp() with durable queue, DeleteBackendDetailed() for UI visibility
core/services/nodes/reconciler.go Pending ops drain loop, ghost model cleanup, advisory lock separation
core/services/nodes/reconciler_test.go Tests for reconciler drain and cleanup logic
core/services/nodes/registry.go PendingBackendOp table, queue management methods, startup cleanup

Review Focus Areas

  • Dead letter handling: Operations abandoned after 10 attempts in reconciler.go:171-242. Confirm this is acceptable vs. alerting.
  • Advisory lock contention: KeyStateReconciler vs replica-reconciler separation prevents blocking, but verify no deadlock scenarios.
  • Drift detection accuracy: CheckUpgradesAgainst() flags version/digest mismatches. Review if partial cluster consensus should trigger alerts.

Architecture

Design Decisions

  • Chose pending-ops table over in-memory queue for durability across restarts. Tradeoff: database writes on every operation.
  • Separated replica scaling and state reconciliation locks intentionally. Slow reconciler passes should not block urgent scaling decisions.
  • Lazy backendManagerFn in upgrade checker defers local vs. distributed decision to call time, avoiding startup ordering complexity.

Scalability & Extensibility

  • NodeDistributionChip scales to large clusters via popover summary. Out of scope: virtualization for 100+ node clusters.
  • Exponential backoff (30s-15min) caps retry storm risk. Dead lettering after 10 attempts prevents infinite retry on permanent failures.

Risks

  • Intentional: Ghost model cleanup via gRPC probes may remove temporarily unreachable models during network partitions. Acceptable tradeoff for stale data removal.
  • Unintentional: UpsertPendingBackendOp conflict resolution resets retry time on re-issue. Review if this allows retry starvation via rapid re-issues.

Merge Status

NOT MERGEABLE — PR Score 23/100, below threshold (50)

  • [H1] 1 CRITICAL security vulnerability found (SEC-001)
  • [H4] PR quality score (23) is below merge floor (50)
  • [H5] 3 HIGH-severity inline review findings need resolution (threshold: 3)
  • [H6] Code quality raw score (15) is below merge floor (40)

Comment on lines +279 to +283
if err := rc.registry.RemoveNodeModel(ctx, m.NodeID, m.ModelName); err != nil {
xlog.Warn("Reconciler: failed to remove unreachable model", "node", m.NodeID, "model", m.ModelName, "error", err)
continue
}
xlog.Warn("Reconciler: model unreachable, removed from registry",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Robustness High

The code removes the registry row before unloading the worker, so unload failures leave an unmanaged loaded backend; unload first or restore the row on failure.

Suggested fix
		if err := rc.unloader.UnloadModelOnNode(m.NodeID, m.ModelName); err != nil {
			xlog.Warn("Reconciler: failed to unload unreachable model", "node", m.NodeID, "model", m.ModelName, "error", err)
			continue
		}
		if err := rc.registry.RemoveNodeModel(ctx, m.NodeID, m.ModelName); err != nil {
			xlog.Warn("Reconciler: failed to remove unreachable model", "node", m.NodeID, "model", m.ModelName, "error", err)
			continue
		}
		xlog.Warn("Reconciler: model unreachable, removed from registry",
			"node", m.NodeID, "model", m.ModelName, "address", m.Address)
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert go developer with deep knowledge of security, performance, and best practices.

### Context

File: core/services/nodes/reconciler.go
Lines: 279-283
Issue Type: robustness-high
Severity: high

Issue Description:
The code removes the registry row before unloading the worker, so unload failures leave an unmanaged loaded backend; unload first or restore the row on failure.

Current Code:
		if err := rc.registry.RemoveNodeModel(ctx, m.NodeID, m.ModelName); err != nil {
			xlog.Warn("Reconciler: failed to remove unreachable model", "node", m.NodeID, "model", m.ModelName, "error", err)
			continue
		}
		xlog.Warn("Reconciler: model unreachable, removed from registry",
			"node", m.NodeID, "model", m.ModelName, "address", m.Address)

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow go best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

Comment on lines +200 to +205
ctx := context.Background()
_, err := d.enqueueAndDrainBackendOp(ctx, OpBackendDelete, name, nil, func(node BackendNode) error {
_, err := d.adapter.DeleteBackend(node.ID, name)
return err
})
return err

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security Medium

DeleteBackend uses context.Background instead of a request-scoped context, so cancellation and deadlines from the caller are ignored; pass a caller-provided context through this path.

Suggested fix
	func (d *DistributedBackendManager) DeleteBackend(ctx context.Context, name string) error {
		if err := d.local.DeleteBackend(name); err != nil {
			if !errors.Is(err, gallery.ErrBackendNotFound) {
				return err
			}
			xlog.Debug("Backend not found locally, will attempt deletion on workers", "backend", name)
		}

		_, err := d.enqueueAndDrainBackendOp(ctx, OpBackendDelete, name, nil, func(node BackendNode) error {
			_, err := d.adapter.DeleteBackend(node.ID, name)
			return err
		})
		return err
	}
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert go developer with deep knowledge of security, performance, and best practices.

### Context

File: core/services/nodes/managers_distributed.go
Lines: 200-205
Issue Type: security-medium
Severity: medium

Issue Description:
DeleteBackend uses context.Background instead of a request-scoped context, so cancellation and deadlines from the caller are ignored; pass a caller-provided context through this path.

Current Code:
	ctx := context.Background()
	_, err := d.enqueueAndDrainBackendOp(ctx, OpBackendDelete, name, nil, func(node BackendNode) error {
		_, err := d.adapter.DeleteBackend(node.ID, name)
		return err
	})
	return err

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow go best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

Security Scan Summary

Metric Value
Vulnerabilities Critical: 0
Overall Risk Clean
Files Scanned 21

No critical security issues detected

Scan completed in 46.8s

Security scan powered by Codity.ai

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

Dependency vulnerability scanning

Metric Value
Vulnerabilities Found 6
Scanner npm audit
View vulnerability details (6 items)

1. brace-expansion various

CVE: GHSA-f886-m6hf-6m8v
Severity: MODERATE
Fixed in: True

brace-expansion: Zero-step sequence causes process hang and memory exhaustion


2. express-rate-limit various

CVE: N/A
Severity: MODERATE
Fixed in: True

Vulnerability in express-rate-limit


3. fast-uri various

CVE: GHSA-q3j6-qgpj-74h6, GHSA-v39h-62p7-jpjc
Severity: HIGH
Fixed in: True

fast-uri vulnerable to path traversal via percent-encoded dot segments


4. hono various

CVE: GHSA-qp7p-654g-cw7p, GHSA-hm8q-7f3q-5f36, GHSA-p77w-8qqv-26rm, GHSA-9vqf-7f2p-gf9v, GHSA-69xw-7hcm-h432
Severity: MODERATE
Fixed in: True

Hono has CSS Declaration Injection via Style Object Values in JSX SSR


5. ip-address various

CVE: GHSA-v2v4-37r5-5v8g
Severity: MODERATE
Fixed in: True

ip-address has XSS in Address6 HTML-emitting methods


6. postcss various

CVE: GHSA-qx2v-qp2m-jg93
Severity: MODERATE
Fixed in: True

PostCSS has XSS via Unescaped </style> in its CSS Stringify Output

Powered by Codity.ai · Docs

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

Dependency vulnerability scanning

Metric Value
Vulnerabilities Found 4
Scanner pip-audit
View vulnerability details (4 items)

1. pip 24.0

CVE: GHSA-4xh5-x5gv-qwph
Fixed in: 25.3

When extracting a tar archive pip may not check symbolic links point into the extraction directory if the tarfile module doesn't implement PEP 706. Note that upgrading pip to a "fixed" version for thi


2. pip 24.0

CVE: GHSA-6vgw-5pg2-w6jp
Fixed in: 26.0

When pip is installing and extracting a maliciously crafted wheel archive, files may be extracted outside the installation directory. The path traversal is limited to prefixes of the installation dire


3. pip 24.0

CVE: GHSA-58qw-9mgm-455v
Fixed in:

pip handles concatenated tar and ZIP files as ZIP files regardless of filename or whether a file is both a tar and ZIP file. This behavior could result in confusing installation behavior, such as inst


4. pip 24.0

CVE: GHSA-jp4c-xjxw-mgf9
Fixed in: 26.1

pip prior to version 26.1 would run self-update check functionality after installing wheel files which required importing well-known Python modules names. These module imports were intentionally defer

Powered by Codity.ai · Docs

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

License Compliance Scan

Metric Value
Packages Scanned 778
High Risk (Strong Copyleft) 0
Medium Risk (Weak Copyleft) 6
Low Risk (Permissive) 646
Unknown License 126

Weak copyleft licenses found - verify compatibility

Some packages have unknown licenses - manual review required

Medium Risk Licenses - 6 packages

(MPL-2.0 OR Apache-2.0) (1 packages):

  • dompurify 3.4.0

MPL-2.0 (5 packages):

  • github.com/philippgille/chromem-go 0.7.0
  • github.com/libp2p/go-yamux/v5 5.0.1
  • github.com/hashicorp/golang-lru 1.0.2
  • github.com/hashicorp/golang-lru/v2 2.0.7
  • github.com/shoenig/go-m1cpu 0.1.6
Unknown Licenses - 126 packages
  • github.com/nats-io/nkeys 0.4.15
  • github.com/moby/moby/client 0.4.0
  • github.com/moby/moby/api 1.54.1
  • github.com/nats-io/nuid 1.0.1
  • github.com/wk8/go-ordered-map/v2 2.1.8
  • github.com/ProtonMail/go-crypto 1.1.6
  • github.com/PuerkitoBio/goquery 1.10.3
  • github.com/RoaringBitmap/roaring/v2 2.4.5
  • github.com/andybalholm/cascadia 1.3.3
  • github.com/antchfx/htmlquery 1.3.4
  • github.com/bits-and-blooms/bitset 1.24.0
  • github.com/antchfx/xmlquery 1.4.4
  • github.com/antchfx/xpath 1.3.4
  • github.com/blevesearch/bleve/v2 2.5.7
  • github.com/blevesearch/bleve_index_api 1.2.11
  • github.com/blevesearch/geo 0.2.4
  • github.com/blevesearch/gtreap 0.1.1
  • github.com/blevesearch/go-porterstemmer 1.0.3
  • github.com/blevesearch/mmap-go 1.0.4
  • github.com/blevesearch/scorch_segment_api/v2 2.3.13

...and 106 more

Powered by Codity.ai · Docs

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

Code Quality Report — test-org-codity/LocalAI · PR #7

Scanned: 2026-05-20 12:14 UTC | Score: 15/100 | Provider: github

Executive Summary

Severity Count
Critical 0
High 0
Medium 5
Low 211
Top Findings

[CQ-LLM-002] core/application/distributed.go:242 (Documentation · MEDIUM)

Issue: Missing detailed documentation for the new parameters in the ReplicaReconcilerOptions struct.
Suggestion: Add JSDoc comments for the new fields in ReplicaReconcilerOptions.

// Create ReplicaReconciler for auto-scaling model replicas. Adapter + ...

[CQ-LLM-001] core/application/distributed.go:243 (Complexity · MEDIUM)

Issue: The function initDistributed has increased cyclomatic complexity due to additional parameters and logic.
Suggestion: Consider breaking down the function into smaller, more manageable functions.

func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB) (*Distributed, error) { ... }

[CQ-LLM-005] core/application/upgrade_checker.go:40 (Testability · MEDIUM)

Issue: The use of a global state (backendManagerFn) can make unit testing difficult.
Suggestion: Consider using dependency injection to pass the backend manager to the UpgradeChecker.

backendManagerFn func() galleryop.BackendManager

[CQ-LLM-003] core/application/upgrade_checker.go:146 (Error_Handling · MEDIUM)

Issue: Potentially swallowing errors when checking upgrades without handling the case where both upgrades and error are nil.
Suggestion: Add error handling to ensure that if both upgrades and err are nil, it is logged or handled appropriately.

if upgrades == nil && err == nil { ... }

[CQ-LLM-004] core/application/upgrade_checker.go:158 (Performance · MEDIUM)

Issue: The runCheck function may lead to performance issues due to the nested calls to backendManagerFn and CheckBackendUpgrades.
Suggestion: Consider optimizing the logic to avoid unnecessary calls and improve performance.

if uc.backendManagerFn != nil { if bm := uc.backendManagerFn(); bm != nil { ... } }

[CQ-008] core/application/distributed.go:256 (Maintainability · LOW)

Issue: Magic number 30 in code
Suggestion: Extract to a named constant

Interval:          30 * time.Second,

[CQ-009] core/application/upgrade_checker.go:53 (Style · LOW)

Issue: Line exceeds 120 characters (163 chars)
Suggestion: Break long lines into multiple lines for readability

func NewUpgradeChecker(appConfig *config.ApplicationConfig, ml *model.ModelLoader, db *gorm.DB, backendManagerFn func() ...

[CQ-008] core/application/upgrade_checker.go:81 (Maintainability · LOW)

Issue: Magic number 10 in code
Suggestion: Extract to a named constant

initialDelay := 10 * time.Second

[CQ-LLM-006] core/gallery/upgrade.go:25 (Maintainability · LOW)

Issue: The NodeDrift field in UpgradeInfo lacks a clear explanation of its purpose.
Suggestion: Add a comment explaining the purpose of NodeDrift in the UpgradeInfo struct.

// NodeDrift lists nodes whose installed version or digest differs from ...

[CQ-009] core/gallery/upgrade.go:59 (Style · LOW)

Issue: Line exceeds 120 characters (175 chars)
Suggestion: Break long lines into multiple lines for readability

func CheckUpgradesAgainst(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, installedBac...

Per-File Breakdown

File Critical High Medium Low Total
core/application/distributed.go 0 0 2 1 3
core/application/upgrade_checker.go 0 0 3 2 5
core/gallery/upgrade.go 0 0 0 2 2
core/http/react-ui/src/App.css 0 0 0 14 14
core/http/react-ui/src/components/FilterBar.jsx 0 0 0 1 1
core/http/react-ui/src/components/NodeDistributionChip.jsx 0 0 0 11 11
core/http/react-ui/src/components/Popover.jsx 0 0 0 2 2
core/http/react-ui/src/pages/Manage.jsx 0 0 0 122 122
core/http/react-ui/src/pages/Nodes.jsx 0 0 0 36 36
core/services/advisorylock/keys.go 0 0 0 1 1
core/services/nodes/managers_distributed.go 0 0 0 1 1
core/services/nodes/reconciler.go 0 0 0 2 2
core/services/nodes/reconciler_test.go 0 0 0 9 9
core/services/nodes/registry.go 0 0 0 7 7

Recommendations

  • Run automated tests after applying fixes to verify no regressions.

@DhirenMhatre

Copy link
Copy Markdown
Author

@codity review

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

Policy Check Failed

✗ 3/3 policy checks failed:

• Need 2 more approval(s) (0/2) — comment LGTM or approve via review
• Missing ticket reference (expected: JIRA-, ENG-, #*)
• 20 code file(s) changed but no test files added


To merge this PR:

  1. Address the failed checks listed above
  2. Ensure branch protection requires the codity/policy-check status

Configure policies in your dashboard

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

PR Summary

What Changed

  • Fixed distributed upgrade detection by routing checks through the active backend manager instead of empty local filesystem.
  • Added durable backend lifecycle operations with a pending-ops queue that retries when nodes recover.
  • Built cluster drift visibility into the UI with node distribution chips, version mismatch warnings, and per-backend upgrade controls.

Key Changes by Area

Distributed Backend Management

  • PendingBackendOp table with unique constraints prevents duplicate install/delete/upgrade intents across nodes.
  • Reconciler drains queue with exponential backoff (30s to 15m cap) and dead-letters after 10 attempts.
  • DeleteBackendDetailed returns per-node status for UI visibility.

Cluster Upgrade Detection

  • SystemBackend.Nodes tracks per-node version, digest, and URI for drift detection.
  • CheckUpgradesAgainst aggregates worker state instead of reading frontend filesystem.
  • Worker lifecycle events now report version metadata.

State Reconciliation

  • ModelProber interface with gRPC health checks removes orphaned model processes.
  • probeLoadedModels clears ghost entries for stale addresses (older than 2m).
  • Split advisory locks: replica scaling vs. state reconciliation use separate keys.

UI/UX

  • NodeDistributionChip shows inline chips (≤3 nodes) or summary popover with drift detection.
  • URL-persisted filters survive tab switches; 10s auto-refresh in distributed mode.
  • "Adopted" badge for registry-only models discovered on workers.

Files Changed

File Changes Summary
core/application/distributed.go Startup wiring for distributed mode
core/application/startup.go Initialization sequence updates
core/application/upgrade_checker.go:47-62 backendManagerFn lazy getter for routing upgrade checks
core/cli/worker.go:738-743 Worker reports version, URI, digest in lifecycle events
core/gallery/backends.go:394-414 SystemBackend.Nodes field and NodeBackendRef struct
core/gallery/upgrade.go:23-129 NodeDrift field, CheckUpgradesAgainst function, summarizeNodeDrift
core/gallery/upgrade_test.go Tests for cluster upgrade detection
core/http/react-ui/src/App.css:1529-1540 Badge utilities, popover animations, status indicators, filter layouts
core/http/react-ui/src/components/FilterBar.jsx Reusable search + chip filter component
core/http/react-ui/src/components/NodeDistributionChip.jsx Node distribution display with drift detection
core/http/react-ui/src/components/Popover.jsx Accessible auto-positioning popover
core/http/react-ui/src/pages/Manage.jsx:32-821 Drift badges, URL-persisted filters, auto-refresh, backend upgrade controls
core/http/react-ui/src/pages/Nodes.jsx Stat cards, tab classes, worker registration improvements
core/http/routes/ui_api.go loaded_on array for model distribution; registry-only model surfacing
core/services/advisorylock/keys.go KeyStateReconciler constant
core/services/galleryop/service.go BackendManager() accessor
core/services/messaging/subjects.go NodeBackendInfo version/digest/URI fields
core/services/nodes/managers_distributed.go:97-167 Pending-ops queue integration, agent node skipping
core/services/nodes/reconciler.go:171-242 drainPendingBackendOps and probeLoadedModels reconciliation passes
core/services/nodes/reconciler_test.go Tests for state reconciliation and model probing
core/services/nodes/registry.go:107-128 PendingBackendOp table, queue operations, startup cleanup

Review Focus Areas

  • Dead-letter behavior after 10 failed attempts: verify this matches operational expectations for backend operations.
  • Advisory lock splitting: confirm KeyStateReconciler non-blocking behavior doesn't starve replica scaling under load.
  • gRPC health check timeout in ModelProber: check it handles slow-but-healthy model processes correctly.

Architecture

Design Decisions

  • Pending-ops queue over synchronous fan-out: intentional tradeoff for durability when nodes are transiently offline. Accepts complexity of retry bookkeeping.
  • Lazy backendManagerFn getter: minimal fix for upgrade checker without restructuring initialization order.
  • Separate advisory locks: state reconciliation can run without blocking replica scaling, but may race; non-blocking TryWithLockCtx skips rather than waits.

Scalability & Extensibility

  • NodeDistributionChip collapses to summary at >3 nodes: keeps DOM weight bounded for large clusters.
  • Exponential backoff caps at 15 minutes: prevents thundering herd on recovering nodes.

Risks

  • Intentional: Dead-lettering after 10 attempts leaves backends in indeterminate state; requires manual intervention or external monitoring.
  • Unintentional: probeLoadedModels uses 2m staleness threshold without configurability; may be too aggressive for slow-starting models.

Merge Status

NOT MERGEABLE — PR Score 8/100, below threshold (50)

  • [H1] 1 CRITICAL security vulnerability found (SEC-001)
  • [H4] PR quality score (8) is below merge floor (50)
  • [H5] 5 HIGH-severity inline review findings need resolution (threshold: 3)
  • [H6] Code quality raw score (14) is below merge floor (40)

Comment on lines +642 to +645
if (showOnlyUpgradable && backendsFilter !== 'upgradable') {
// One-shot reconciliation — the old state becomes the new chip.
setBackendsFilter('upgradable')
setShowOnlyUpgradable(false)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Robustness High

Calling state setters during render causes render-loop risk and should be moved into a useEffect that watches showOnlyUpgradable and backendsFilter.

Suggested fix
          useEffect(() => {
            if (showOnlyUpgradable && backendsFilter !== 'upgradable') {
              setBackendsFilter('upgradable')
              setShowOnlyUpgradable(false)
            }
          }, [showOnlyUpgradable, backendsFilter])
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert javascript developer with deep knowledge of security, performance, and best practices.

### Context

File: core/http/react-ui/src/pages/Manage.jsx
Lines: 642-645
Issue Type: robustness-high
Severity: high

Issue Description:
Calling state setters during render causes render-loop risk and should be moved into a useEffect that watches showOnlyUpgradable and backendsFilter.

Current Code:
          if (showOnlyUpgradable && backendsFilter !== 'upgradable') {
            // One-shot reconciliation — the old state becomes the new chip.
            setBackendsFilter('upgradable')
            setShowOnlyUpgradable(false)
          }

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow javascript best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

Comment on lines +1061 to +1069
return r.db.WithContext(ctx).Transaction(func(tx *gorm.DB) error {
var row PendingBackendOp
if err := tx.First(&row, id).Error; err != nil {
return err
}
row.Attempts++
row.LastError = errMsg
row.NextRetryAt = time.Now().Add(backoffForAttempt(row.Attempts))
return tx.Save(&row).Error

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Robustness High

The row update in RecordPendingBackendOpFailure is vulnerable to lost updates under concurrent failures; use an atomic UPDATE that increments attempts in SQL.

Suggested fix
	return r.db.WithContext(ctx).Transaction(func(tx *gorm.DB) error {
		var row PendingBackendOp
		if err := tx.Clauses(clause.Locking{Strength: "UPDATE"}).First(&row, id).Error; err != nil {
			return err
		}
		attempts := row.Attempts + 1
		return tx.Model(&PendingBackendOp{}).
			Where("id = ?", id).
			Updates(map[string]any{
				"attempts":      gorm.Expr("attempts + 1"),
				"last_error":    errMsg,
				"next_retry_at": time.Now().Add(backoffForAttempt(attempts)),
			}).Error
	})
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert go developer with deep knowledge of security, performance, and best practices.

### Context

File: core/services/nodes/registry.go
Lines: 1061-1069
Issue Type: robustness-high
Severity: high

Issue Description:
The row update in RecordPendingBackendOpFailure is vulnerable to lost updates under concurrent failures; use an atomic UPDATE that increments attempts in SQL.

Current Code:
	return r.db.WithContext(ctx).Transaction(func(tx *gorm.DB) error {
		var row PendingBackendOp
		if err := tx.First(&row, id).Error; err != nil {
			return err
		}
		row.Attempts++
		row.LastError = errMsg
		row.NextRetryAt = time.Now().Add(backoffForAttempt(row.Attempts))
		return tx.Save(&row).Error
	})

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow go best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

Comment thread core/gallery/backends.go
NodeStatus string `json:"node_status"` // healthy | unhealthy | offline | draining | pending
Version string `json:"version,omitempty"`
Digest string `json:"digest,omitempty"`
URI string `json:"uri,omitempty"`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security Medium

Exposing raw backend URI values in the JSON response can leak internal filesystem paths or registry locations, so omit this field from the API model or redact it before serialization.

Suggested change
URI string `json:"uri,omitempty"`
URI string `json:"-"`
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert go developer with deep knowledge of security, performance, and best practices.

### Context

File: core/gallery/backends.go
Lines: 412-412
Issue Type: security-medium
Severity: medium

Issue Description:
Exposing raw backend URI values in the JSON response can leak internal filesystem paths or registry locations, so omit this field from the API model or redact it before serialization.

Current Code:
	URI         string `json:"uri,omitempty"`

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow go best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

Comment thread core/gallery/upgrade.go
Comment on lines +26 to +39
// NodeDrift lists nodes whose installed version or digest differs from
// the cluster majority. Non-empty means the cluster has diverged and an
// upgrade will realign it. Empty in single-node mode.
NodeDrift []NodeDriftInfo `json:"node_drift,omitempty"`
}

// CheckBackendUpgrades compares installed backends against gallery entries
// and returns a map of backend names to UpgradeInfo for those that have
// newer versions or different OCI digests available.
// NodeDriftInfo describes one node that disagrees with the cluster majority
// on which version/digest of a backend is installed.
type NodeDriftInfo struct {
NodeID string `json:"node_id"`
NodeName string `json:"node_name"`
Version string `json:"version,omitempty"`
Digest string `json:"digest,omitempty"`
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security Medium

NodeDrift exposes internal node IDs, names, versions, and digests to any caller of this API, so redact these fields or gate them behind an authorization check before serializing.

Suggested fix
	// NodeDrift lists nodes whose installed version or digest differs from
	// the cluster majority. Populate this only for privileged callers.
	NodeDrift []NodeDriftInfo `json:"node_drift,omitempty"`
}

// NodeDriftInfo describes one node that disagrees with the cluster majority
// on which version/digest of a backend is installed.
type NodeDriftInfo struct {
	NodeName string `json:"node_name,omitempty"`
}
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert go developer with deep knowledge of security, performance, and best practices.

### Context

File: core/gallery/upgrade.go
Lines: 26-39
Issue Type: security-medium
Severity: medium

Issue Description:
NodeDrift exposes internal node IDs, names, versions, and digests to any caller of this API, so redact these fields or gate them behind an authorization check before serializing.

Current Code:
	// NodeDrift lists nodes whose installed version or digest differs from
	// the cluster majority. Non-empty means the cluster has diverged and an
	// upgrade will realign it. Empty in single-node mode.
	NodeDrift []NodeDriftInfo `json:"node_drift,omitempty"`
}

// NodeDriftInfo describes one node that disagrees with the cluster majority
// on which version/digest of a backend is installed.
type NodeDriftInfo struct {
	NodeID   string `json:"node_id"`
	NodeName string `json:"node_name"`
	Version  string `json:"version,omitempty"`
	Digest   string `json:"digest,omitempty"`
}

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow go best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

Comment on lines +54 to +60
<button
key={f.key}
role="tab"
aria-selected={activeFilter === f.key}
className={`filter-btn ${activeFilter === f.key ? 'active' : ''}`}
onClick={() => onFilterChange(f.key)}
>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional Medium

The tab buttons are missing type="button", so using this component inside a form will submit the form when a filter is clicked; set an explicit button type.

Suggested fix
                <button
                  type="button"
                  key={f.key}
                  role="tab"
                  aria-selected={activeFilter === f.key}
                  className={`filter-btn ${activeFilter === f.key ? 'active' : ''}`}
                  onClick={() => onFilterChange(f.key)}
                >
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert javascript developer with deep knowledge of security, performance, and best practices.

### Context

File: core/http/react-ui/src/components/FilterBar.jsx
Lines: 54-60
Issue Type: functional-medium
Severity: medium

Issue Description:
The tab buttons are missing `type="button"`, so using this component inside a form will submit the form when a filter is clicked; set an explicit button type.

Current Code:
                <button
                  key={f.key}
                  role="tab"
                  aria-selected={activeFilter === f.key}
                  className={`filter-btn ${activeFilter === f.key ? 'active' : ''}`}
                  onClick={() => onFilterChange(f.key)}
                >

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow javascript best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

Comment on lines +114 to +115
{list.map(n => (
<tr key={n.node_id ?? n.NodeID ?? getName(n)}>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional Medium

Rows use a fallback key of getName(n), which can collide for duplicate or empty node names and cause React to reuse the wrong row state, so use a guaranteed-unique fallback such as the map index when no ID is present.

Also reported at: core/http/react-ui/src/components/NodeDistributionChip.jsx L46–L55

Suggested fix
              {list.map((n, idx) => (
                <tr key={n.node_id ?? n.NodeID ?? `${getName(n)}-${idx}`}>
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert javascript developer with deep knowledge of security, performance, and best practices.

### Context

File: core/http/react-ui/src/components/NodeDistributionChip.jsx
Lines: 114-115
Issue Type: functional-medium
Severity: medium

Issue Description:
Rows use a fallback key of getName(n), which can collide for duplicate or empty node names and cause React to reuse the wrong row state, so use a guaranteed-unique fallback such as the map index when no ID is present.

_Also reported at: `core/http/react-ui/src/components/NodeDistributionChip.jsx` L46–L55_

Current Code:
              {list.map(n => (
                <tr key={n.node_id ?? n.NodeID ?? getName(n)}>

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow javascript best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira


for _, node := range allNodes {
if node.Status != StatusHealthy {
if node.Status == StatusPending || node.Status == StatusOffline || node.Status == StatusDraining {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional Medium

ListBackends now queries every non-pending, non-offline, non-draining node, including non-backend workers that do not subscribe to backend subjects; skip non-backend node types here the same way enqueueAndDrainBackendOp does.

Suggested fix
		if node.Status == StatusPending || node.Status == StatusOffline || node.Status == StatusDraining {
			continue
		}
		if node.NodeType != "" && node.NodeType != NodeTypeBackend {
			continue
		}
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert go developer with deep knowledge of security, performance, and best practices.

### Context

File: core/services/nodes/managers_distributed.go
Lines: 237-237
Issue Type: functional-medium
Severity: medium

Issue Description:
ListBackends now queries every non-pending, non-offline, non-draining node, including non-backend workers that do not subscribe to backend subjects; skip non-backend node types here the same way enqueueAndDrainBackendOp does.

Current Code:
		if node.Status == StatusPending || node.Status == StatusOffline || node.Status == StatusDraining {

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow go best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

Security Scan Summary

Metric Value
Vulnerabilities Critical: 0
Overall Risk Clean
Files Scanned 21

No critical security issues detected

Scan completed in 43.5s

Security scan powered by Codity.ai

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

Policy Check Failed

✗ 3/3 policy checks failed:

• Need 2 more approval(s) (0/2) — comment LGTM or approve via review
• Missing ticket reference (expected: JIRA-, ENG-, #*)
• 20 code file(s) changed but no test files added


To merge this PR:

  1. Address the failed checks listed above
  2. Ensure branch protection requires the codity/policy-check status

Configure policies in your dashboard

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

PR Summary

What Changed

  • Fixed distributed-mode backend upgrade detection by routing checks through the active backend manager instead of empty local filesystem
  • Added durable intent queue for backend lifecycle operations (install/delete/upgrade) with automatic retry and dead-lettering
  • Built cluster drift visibility into the React UI with node distribution chips, version mismatch detection, and per-node status tracking

Key Changes by Area

Distributed Backend Management

  • PendingBackendOp table persists operations across restarts with exponential backoff retry (core/services/nodes/registry.go:107-130)
  • enqueueAndDrainBackendOp applies immediately on healthy nodes, queues for offline/unhealthy ones (core/services/nodes/managers_distributed.go:97-167)
  • drainPendingBackendOps background process retries due operations and marks nodes unhealthy on ErrNoResponders (core/services/nodes/reconciler.go:171-242)

Upgrade Detection

  • CheckUpgradesAgainst() accepts pre-aggregated SystemBackends for cluster-wide digest comparison (core/gallery/upgrade.go)
  • NodeDriftInfo and NodeBackendRef types track per-node version/digest tuples for drift detection
  • Workers now report Version, URI, Digest in backend metadata (core/cli/worker.go)

UI/UX

  • NodeDistributionChip shows inline chips (≤3 nodes) or summary popover with health/drift/offline counts (core/http/react-ui/src/components/NodeDistributionChip.jsx)
  • URL-persisted filters (mq, mf, bq, bf) preserve state across tab switches (core/http/react-ui/src/pages/Manage.jsx)
  • Auto-refresh models every 10s in distributed mode with "last synced" indicator

State Reconciliation

  • Split reconcileOnce into advisory-locked passes: replica scaling and state reconciliation with try-lock semantics (core/services/nodes/reconciler.go)
  • ModelProber interface with gRPC health checks detects orphaned models; probes only stale records (>2min) (core/services/nodes/reconciler.go)

Files Changed

File Changes Summary
core/application/distributed.go ReplicaReconciler receives Adapter and RegistrationToken for state reconciliation and health probe auth
core/application/startup.go Startup cleanup prunes malformed pending ops targeting agents or non-existent nodes
core/application/upgrade_checker.go Added backendManagerFn lazy getter; reduced initial check delay from 30s to 10s
core/cli/worker.go Worker reports Version, URI, Digest in backend metadata for drift detection
core/gallery/backends.go Added Nodes []NodeBackendRef to SystemBackend for per-node attribution
core/gallery/upgrade.go Added CheckUpgradesAgainst() with cluster drift detection; new NodeDriftInfo/NodeBackendRef types
core/gallery/upgrade_test.go 3 test cases for drift detection, unanimous cluster, empty-version path
core/http/react-ui/src/App.css Added utilities: .badge-accent, .cell-stack, .node-status, .stat-grid, .filter-bar-group, popover animations
core/http/react-ui/src/components/FilterBar.jsx Shared search/filter/toggle control strip with keyboard navigation and extensible slots
core/http/react-ui/src/components/NodeDistributionChip.jsx Reusable chip with inline/summary modes, health/drift/offline display, popover details
core/http/react-ui/src/components/Popover.jsx Accessible popover with auto-positioning, outside-click/Escape close, focus management
core/http/react-ui/src/pages/Manage.jsx URL-persisted filters, auto-refresh, NodeDistributionChip, upgrade banner, drift indicators, "Adopted" badge
core/http/react-ui/src/pages/Nodes.jsx StatCard with CSS variables, collapsible rows, "Register a new worker" button styling
core/http/routes/ui_api.go /api/models returns loaded_on array; emits registry-only entries for ghost models
core/services/advisorylock/keys.go Added KeyStateReconciler lock key for split reconciliation passes
core/services/galleryop/service.go Exposed BackendManager() getter for distributed upgrade checking
core/services/messaging/subjects.go Added Version, URI, Digest to NodeBackendInfo for cluster-wide metadata propagation
core/services/nodes/managers_distributed.go enqueueAndDrainBackendOp with intent queue; skips agent nodes; health-aware fan-out
core/services/nodes/reconciler.go Split reconciliation passes; drainPendingBackendOps; ModelProber with stale-record probing
core/services/nodes/reconciler_test.go fakeProber and tests for model probing behavior; health monitor test updates
core/services/nodes/registry.go PendingBackendOp table; upsert/list/record/count methods; startup pruning migration

Review Focus Areas

  • Retry logic in RecordPendingBackendOpFailure: Verify exponential backoff caps at 15m and max 10 attempts before dead-lettering
  • Ghost model detection: Confirm registry-only entries in /api/models don't break existing UI assumptions about model sources
  • Advisory lock splitting: Check that KeyStateReconciler try-lock prevents deadlock when replica scaling holds replica-reconciler lock

Architecture

Design Decisions

  • Intent queue over synchronous fan-out: Operations must survive node restarts and retry automatically. Tradeoff: increased database writes and complexity vs. reliability in partitioned clusters.
  • Split reconciliation locks: Prevents long-running state reconciliation from blocking urgent replica scaling. KeyStateReconciler uses try-lock to avoid queue buildup.
  • Stale-record probing only: Model health checks run only on records >2min old to avoid hammering workers with gRPC calls.

Scalability & Extensibility

  • NodeDistributionChip context prop (models vs backends) allows reuse across tabs with different data shapes
  • FilterBar extensible right-side slots support future tab-specific controls without component forks
  • Out of scope: Global operation ordering (install A before B); current queue is per-backend independent

Risks

  • Intentional: Dead-lettered operations after 10 attempts require manual intervention. Acceptable for v1; monitoring hook not implemented.
  • Unintentional: ErrNoResponders marking nodes unhealthy may over-mark during transient NATS hiccups. Review if health flapping occurs in production.
  • Intentional: Agent node filtering skips backend ops entirely. Agents must never subscribe to backend subjects by design.

Merge Status

NOT MERGEABLE — PR Score 8/100, below threshold (50)

  • [H1] 1 CRITICAL security vulnerability found (SEC-001)
  • [H4] PR quality score (8) is below merge floor (50)
  • [H5] 10 HIGH-severity inline review findings need resolution (threshold: 3)
  • [H6] Code quality raw score (14) is below merge floor (40)

Comment thread core/gallery/upgrade.go
Comment on lines +59 to +60
func CheckUpgradesAgainst(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, installedBackends SystemBackends) (map[string]UpgradeInfo, error) {
galleryBackends, err := AvailableBackends(galleries, systemState)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional High

The new ctx parameter is ignored, so upgrade checks can block indefinitely on gallery resolution or remote digest fetches; thread the context into the called functions or remove it until cancellation is supported.

Suggested fix
func CheckUpgradesAgainst(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, installedBackends SystemBackends) (map[string]UpgradeInfo, error) {
	_ = ctx
	galleryBackends, err := AvailableBackends(galleries, systemState)
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert go developer with deep knowledge of security, performance, and best practices.

### Context

File: core/gallery/upgrade.go
Lines: 59-60
Issue Type: functional-high
Severity: high

Issue Description:
The new ctx parameter is ignored, so upgrade checks can block indefinitely on gallery resolution or remote digest fetches; thread the context into the called functions or remove it until cancellation is supported.

Current Code:
func CheckUpgradesAgainst(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, installedBackends SystemBackends) (map[string]UpgradeInfo, error) {
	galleryBackends, err := AvailableBackends(galleries, systemState)

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow go best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

Comment on lines +97 to +162
func (d *DistributedBackendManager) enqueueAndDrainBackendOp(ctx context.Context, op, backend string, galleriesJSON []byte, apply func(node BackendNode) error) (BackendOpResult, error) {
allNodes, err := d.registry.List(ctx)
if err != nil {
return BackendOpResult{}, err
}
var errs []error

result := BackendOpResult{Nodes: make([]NodeOpStatus, 0, len(allNodes))}
for _, node := range allNodes {
// Pending nodes haven't been approved yet — no intent to apply.
if node.Status == StatusPending {
continue
}
// Backend lifecycle ops only make sense on backend-type workers.
// Agent workers don't subscribe to backend.install/delete/list, so
// enqueueing for them guarantees a forever-retrying row that the
// reconciler can never drain. Silently skip — they aren't consumers.
if node.NodeType != "" && node.NodeType != NodeTypeBackend {
continue
}
if err := d.registry.UpsertPendingBackendOp(ctx, node.ID, backend, op, galleriesJSON); err != nil {
xlog.Warn("Failed to enqueue backend op", "op", op, "node", node.Name, "backend", backend, "error", err)
result.Nodes = append(result.Nodes, NodeOpStatus{
NodeID: node.ID, NodeName: node.Name, Status: "error",
Error: fmt.Sprintf("enqueue failed: %v", err),
})
continue
}

if node.Status != StatusHealthy {
// Intent is recorded; reconciler will retry when the node recovers.
result.Nodes = append(result.Nodes, NodeOpStatus{
NodeID: node.ID, NodeName: node.Name, Status: "queued",
Error: fmt.Sprintf("node %s, will retry when healthy", node.Status),
})
continue
}
if _, delErr := d.adapter.DeleteBackend(node.ID, name); delErr != nil {
if errors.Is(delErr, nats.ErrNoResponders) {
// Node's NATS subscription is gone — likely restarted with a new ID.
// Mark it unhealthy so future fan-outs skip it.
xlog.Warn("No NATS responders for node, marking unhealthy", "node", node.Name, "nodeID", node.ID)
d.registry.MarkUnhealthy(context.Background(), node.ID)
continue

applyErr := apply(node)
if applyErr == nil {
// Find the row we just upserted and delete it; cheap but requires
// a lookup since UpsertPendingBackendOp doesn't return the ID.
if err := d.deletePendingRow(ctx, node.ID, backend, op); err != nil {
xlog.Debug("Failed to clear pending backend op after success", "error", err)
}
xlog.Warn("Failed to propagate backend deletion to worker", "node", node.Name, "backend", name, "error", delErr)
errs = append(errs, fmt.Errorf("node %s: %w", node.Name, delErr))
result.Nodes = append(result.Nodes, NodeOpStatus{
NodeID: node.ID, NodeName: node.Name, Status: "success",
})
continue
}

// Record failure for backoff. If it's an ErrNoResponders, the node's
// gone AWOL — mark unhealthy so the router stops picking it too.
errMsg := applyErr.Error()
if errors.Is(applyErr, nats.ErrNoResponders) {
xlog.Warn("No NATS responders for node, marking unhealthy", "node", node.Name, "nodeID", node.ID)
d.registry.MarkUnhealthy(ctx, node.ID)
}
if id, err := d.findPendingRow(ctx, node.ID, backend, op); err == nil {
_ = d.registry.RecordPendingBackendOpFailure(ctx, id, errMsg)
}
result.Nodes = append(result.Nodes, NodeOpStatus{
NodeID: node.ID, NodeName: node.Name, Status: "error", Error: errMsg,
})
}
return result, nil
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional High

DeleteBackend now always returns nil when some or all nodes fail because enqueueAndDrainBackendOp records per-node errors but never surfaces an aggregated error; return a joined error when any node status is "error".

Suggested fix
func (d *DistributedBackendManager) enqueueAndDrainBackendOp(ctx context.Context, op, backend string, galleriesJSON []byte, apply func(node BackendNode) error) (BackendOpResult, error) {
	allNodes, err := d.registry.List(ctx)
	if err != nil {
		return BackendOpResult{}, err
	}

	result := BackendOpResult{Nodes: make([]NodeOpStatus, 0, len(allNodes))}
	var errs []error
	for _, node := range allNodes {
		if node.Status == StatusPending {
			continue
		}
		if node.NodeType != "" && node.NodeType != NodeTypeBackend {
			continue
		}
		if err := d.registry.UpsertPendingBackendOp(ctx, node.ID, backend, op, galleriesJSON); err != nil {
			xlog.Warn("Failed to enqueue backend op", "op", op, "node", node.Name, "backend", backend, "error", err)
			result.Nodes = append(result.Nodes, NodeOpStatus{
				NodeID: node.ID, NodeName: node.Name, Status: "error",
				Error: fmt.Sprintf("enqueue failed: %v", err),
			})
			errs = append(errs, fmt.Errorf("node %s: enqueue failed: %w", node.Name, err))
			continue
		}

		if node.Status != StatusHealthy {
			result.Nodes = append(result.Nodes, NodeOpStatus{
				NodeID: node.ID, NodeName: node.Name, Status: "queued",
				Error: fmt.Sprintf("node %s, will retry when healthy", node.Status),
			})
			continue
		}

		applyErr := apply(node)
		if applyErr == nil {
			if err := d.deletePendingRow(ctx, node.ID, backend, op); err != nil {
				xlog.Debug("Failed to clear pending backend op after success", "error", err)
			}
			result.Nodes = append(result.Nodes, NodeOpStatus{
				NodeID: node.ID, NodeName: node.Name, Status: "success",
			})
			continue
		}

		errMsg := applyErr.Error()
		if errors.Is(applyErr, nats.ErrNoResponders) {
			xlog.Warn("No NATS responders for node, marking unhealthy", "node", node.Name, "nodeID", node.ID)
			d.registry.MarkUnhealthy(ctx, node.ID)
		}
		if id, err := d.findPendingRow(ctx, node.ID, backend, op); err == nil {
			_ = d.registry.RecordPendingBackendOpFailure(ctx, id, errMsg)
		}
		result.Nodes = append(result.Nodes, NodeOpStatus{
			NodeID: node.ID, NodeName: node.Name, Status: "error", Error: errMsg,
		})
		errs = append(errs, fmt.Errorf("node %s: %w", node.Name, applyErr))
	}
	return result, errors.Join(errs...)
}
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert go developer with deep knowledge of security, performance, and best practices.

### Context

File: core/services/nodes/managers_distributed.go
Lines: 97-162
Issue Type: functional-high
Severity: high

Issue Description:
DeleteBackend now always returns nil when some or all nodes fail because enqueueAndDrainBackendOp records per-node errors but never surfaces an aggregated error; return a joined error when any node status is "error".

Current Code:
func (d *DistributedBackendManager) enqueueAndDrainBackendOp(ctx context.Context, op, backend string, galleriesJSON []byte, apply func(node BackendNode) error) (BackendOpResult, error) {
	allNodes, err := d.registry.List(ctx)
	if err != nil {
		return BackendOpResult{}, err
	}

	result := BackendOpResult{Nodes: make([]NodeOpStatus, 0, len(allNodes))}
	for _, node := range allNodes {
		// Pending nodes haven't been approved yet — no intent to apply.
		if node.Status == StatusPending {
			continue
		}
		// Backend lifecycle ops only make sense on backend-type workers.
		// Agent workers don't subscribe to backend.install/delete/list, so
		// enqueueing for them guarantees a forever-retrying row that the
		// reconciler can never drain. Silently skip — they aren't consumers.
		if node.NodeType != "" && node.NodeType != NodeTypeBackend {
			continue
		}
		if err := d.registry.UpsertPendingBackendOp(ctx, node.ID, backend, op, galleriesJSON); err != nil {
			xlog.Warn("Failed to enqueue backend op", "op", op, "node", node.Name, "backend", backend, "error", err)
			result.Nodes = append(result.Nodes, NodeOpStatus{
				NodeID: node.ID, NodeName: node.Name, Status: "error",
				Error: fmt.Sprintf("enqueue failed: %v", err),
			})
			continue
		}

		if node.Status != StatusHealthy {
			// Intent is recorded; reconciler will retry when the node recovers.
			result.Nodes = append(result.Nodes, NodeOpStatus{
				NodeID: node.ID, NodeName: node.Name, Status: "queued",
				Error: fmt.Sprintf("node %s, will retry when healthy", node.Status),
			})
			continue
		}

		applyErr := apply(node)
		if applyErr == nil {
			// Find the row we just upserted and delete it; cheap but requires
			// a lookup since UpsertPendingBackendOp doesn't return the ID.
			if err := d.deletePendingRow(ctx, node.ID, backend, op); err != nil {
				xlog.Debug("Failed to clear pending backend op after success", "error", err)
			}
			result.Nodes = append(result.Nodes, NodeOpStatus{
				NodeID: node.ID, NodeName: node.Name, Status: "success",
			})
			continue
		}

		// Record failure for backoff. If it's an ErrNoResponders, the node's
		// gone AWOL — mark unhealthy so the router stops picking it too.
		errMsg := applyErr.Error()
		if errors.Is(applyErr, nats.ErrNoResponders) {
			xlog.Warn("No NATS responders for node, marking unhealthy", "node", node.Name, "nodeID", node.ID)
			d.registry.MarkUnhealthy(ctx, node.ID)
		}
		if id, err := d.findPendingRow(ctx, node.ID, backend, op); err == nil {
			_ = d.registry.RecordPendingBackendOpFailure(ctx, id, errMsg)
		}
		result.Nodes = append(result.Nodes, NodeOpStatus{
			NodeID: node.ID, NodeName: node.Name, Status: "error", Error: errMsg,
		})
	}
	return result, nil
}

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow go best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

Comment on lines +147 to +150
_, _ = advisorylock.TryWithLockCtx(ctx, rc.db, advisorylock.KeyStateReconciler, func() error {
rc.reconcileState(ctx)
return nil
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Robustness High

Ignoring the state-reconciler advisory lock errors hides database or lock failures; log or handle the returned errors so reconciliation does not silently stop running.

Suggested fix
		if acquired, err := advisorylock.TryWithLockCtx(ctx, rc.db, advisorylock.KeyStateReconciler, func() error {
			rc.reconcileState(ctx)
			return nil
		}); err != nil {
			xlog.Warn("Reconciler: state reconciliation lock failed", "error", err)
		} else if !acquired {
			xlog.Debug("Reconciler: state reconciliation skipped because lock is held")
		}
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert go developer with deep knowledge of security, performance, and best practices.

### Context

File: core/services/nodes/reconciler.go
Lines: 147-150
Issue Type: robustness-high
Severity: high

Issue Description:
Ignoring the state-reconciler advisory lock errors hides database or lock failures; log or handle the returned errors so reconciliation does not silently stop running.

Current Code:
		_, _ = advisorylock.TryWithLockCtx(ctx, rc.db, advisorylock.KeyStateReconciler, func() error {
			rc.reconcileState(ctx)
			return nil
		})

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow go best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

Comment on lines +28 to +33
func (g grpcModelProber) IsAlive(ctx context.Context, address string) bool {
client := grpcclient.NewClientWithToken(address, false, nil, false, g.token)
probeCtx, cancel := context.WithTimeout(ctx, 1*time.Second)
defer cancel()
ok, _ := client.HealthCheck(probeCtx)
return ok

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Robustness High

The prober treats every health-check error as a dead backend and deletes the model row; return and inspect the probe error so transient auth or network failures do not orphan healthy models.

Suggested fix
func (g grpcModelProber) IsAlive(ctx context.Context, address string) (bool, error) {
	client := grpcclient.NewClientWithToken(address, false, nil, false, g.token)
	probeCtx, cancel := context.WithTimeout(ctx, 1*time.Second)
	defer cancel()
	return client.HealthCheck(probeCtx)
}
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert go developer with deep knowledge of security, performance, and best practices.

### Context

File: core/services/nodes/reconciler.go
Lines: 28-33
Issue Type: robustness-high
Severity: high

Issue Description:
The prober treats every health-check error as a dead backend and deletes the model row; return and inspect the probe error so transient auth or network failures do not orphan healthy models.

Current Code:
func (g grpcModelProber) IsAlive(ctx context.Context, address string) bool {
	client := grpcclient.NewClientWithToken(address, false, nil, false, g.token)
	probeCtx, cancel := context.WithTimeout(ctx, 1*time.Second)
	defer cancel()
	ok, _ := client.HealthCheck(probeCtx)
	return ok
}

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow go best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

Comment on lines +273 to +283
if rc.prober.IsAlive(ctx, m.Address) {
// Bump updated_at so we don't probe this row again immediately.
_ = rc.registry.db.WithContext(ctx).Model(&NodeModel{}).
Where("id = ?", m.ID).Update("updated_at", time.Now()).Error
continue
}
if err := rc.registry.RemoveNodeModel(ctx, m.NodeID, m.ModelName); err != nil {
xlog.Warn("Reconciler: failed to remove unreachable model", "node", m.NodeID, "model", m.ModelName, "error", err)
continue
}
xlog.Warn("Reconciler: model unreachable, removed from registry",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Robustness High

Deleting the registry row on any failed probe can remove healthy models during temporary transport or auth failures; only remove after a definitive unhealthy result and log probe errors separately.

Suggested fix
		alive, err := rc.prober.IsAlive(ctx, m.Address)
		if err != nil {
			xlog.Warn("Reconciler: probe failed", "node", m.NodeID, "model", m.ModelName, "address", m.Address, "error", err)
			continue
		}
		if alive {
			// Bump updated_at so we don't probe this row again immediately.
			_ = rc.registry.db.WithContext(ctx).Model(&NodeModel{}).
				Where("id = ?", m.ID).Update("updated_at", time.Now()).Error
			continue
		}
		if err := rc.registry.RemoveNodeModel(ctx, m.NodeID, m.ModelName); err != nil {
			xlog.Warn("Reconciler: failed to remove unreachable model", "node", m.NodeID, "model", m.ModelName, "error", err)
			continue
		}
		xlog.Warn("Reconciler: model unreachable, removed from registry",
			"node", m.NodeID, "model", m.ModelName, "address", m.Address)
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert go developer with deep knowledge of security, performance, and best practices.

### Context

File: core/services/nodes/reconciler.go
Lines: 273-283
Issue Type: robustness-high
Severity: high

Issue Description:
Deleting the registry row on any failed probe can remove healthy models during temporary transport or auth failures; only remove after a definitive unhealthy result and log probe errors separately.

Current Code:
		if rc.prober.IsAlive(ctx, m.Address) {
			// Bump updated_at so we don't probe this row again immediately.
			_ = rc.registry.db.WithContext(ctx).Model(&NodeModel{}).
				Where("id = ?", m.ID).Update("updated_at", time.Now()).Error
			continue
		}
		if err := rc.registry.RemoveNodeModel(ctx, m.NodeID, m.ModelName); err != nil {
			xlog.Warn("Reconciler: failed to remove unreachable model", "node", m.NodeID, "model", m.ModelName, "error", err)
			continue
		}
		xlog.Warn("Reconciler: model unreachable, removed from registry",
			"node", m.NodeID, "model", m.ModelName, "address", m.Address)

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow go best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

Comment thread core/gallery/backends.go
NodeStatus string `json:"node_status"` // healthy | unhealthy | offline | draining | pending
Version string `json:"version,omitempty"`
Digest string `json:"digest,omitempty"`
URI string `json:"uri,omitempty"`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security Medium

URI is added to per-node API output and metadata without any redaction, so registry credentials or internal endpoints embedded in backend URIs should be stripped before serialization.

Suggested fix
	// Store only a sanitized URI here to avoid exposing credentials or internal-only endpoints.
	URI         string `json:"uri,omitempty"`
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert go developer with deep knowledge of security, performance, and best practices.

### Context

File: core/gallery/backends.go
Lines: 412-412
Issue Type: security-medium
Severity: medium

Issue Description:
URI is added to per-node API output and metadata without any redaction, so registry credentials or internal endpoints embedded in backend URIs should be stripped before serialization.

Current Code:
	URI         string `json:"uri,omitempty"`

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow go best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

<tr key={n.node_id ?? n.NodeID ?? getName(n)}>
<td className="cell-mono">{getName(n)}</td>
<td>
<span className={`badge ${getStatus(n) === 'healthy' ? 'badge-success' : 'badge-warning'}`}>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional Medium

Rows with status 'draining' are rendered as warning instead of info, so map draining to its own badge variant here as in the inline chip view.

Suggested fix
                    <span className={`badge ${getStatus(n) === 'healthy'
                      ? 'badge-success'
                      : getStatus(n) === 'draining'
                        ? 'badge-info'
                        : 'badge-warning'}`}>
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert javascript developer with deep knowledge of security, performance, and best practices.

### Context

File: core/http/react-ui/src/components/NodeDistributionChip.jsx
Lines: 118-118
Issue Type: functional-medium
Severity: medium

Issue Description:
Rows with status 'draining' are rendered as warning instead of info, so map draining to its own badge variant here as in the inline chip view.

Current Code:
                    <span className={`badge ${getStatus(n) === 'healthy' ? 'badge-success' : 'badge-warning'}`}>

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow javascript best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

Comment on lines +1060 to +1069
func (r *NodeRegistry) RecordPendingBackendOpFailure(ctx context.Context, id uint, errMsg string) error {
return r.db.WithContext(ctx).Transaction(func(tx *gorm.DB) error {
var row PendingBackendOp
if err := tx.First(&row, id).Error; err != nil {
return err
}
row.Attempts++
row.LastError = errMsg
row.NextRetryAt = time.Now().Add(backoffForAttempt(row.Attempts))
return tx.Save(&row).Error

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security Medium

LastError stores arbitrary backend error text without any length limit, so a malicious or noisy peer can persist unbounded log-sized payloads; truncate or cap errMsg before saving.

Suggested fix
func (r *NodeRegistry) RecordPendingBackendOpFailure(ctx context.Context, id uint, errMsg string) error {
	const maxErrLen = 4096
	if len(errMsg) > maxErrLen {
		errMsg = errMsg[:maxErrLen]
	}
	return r.db.WithContext(ctx).Transaction(func(tx *gorm.DB) error {
		var row PendingBackendOp
		if err := tx.First(&row, id).Error; err != nil {
			return err
		}
		row.Attempts++
		row.LastError = errMsg
		row.NextRetryAt = time.Now().Add(backoffForAttempt(row.Attempts))
		return tx.Save(&row).Error
	})
}
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert go developer with deep knowledge of security, performance, and best practices.

### Context

File: core/services/nodes/registry.go
Lines: 1060-1069
Issue Type: security-medium
Severity: medium

Issue Description:
LastError stores arbitrary backend error text without any length limit, so a malicious or noisy peer can persist unbounded log-sized payloads; truncate or cap errMsg before saving.

Current Code:
func (r *NodeRegistry) RecordPendingBackendOpFailure(ctx context.Context, id uint, errMsg string) error {
	return r.db.WithContext(ctx).Transaction(func(tx *gorm.DB) error {
		var row PendingBackendOp
		if err := tx.First(&row, id).Error; err != nil {
			return err
		}
		row.Attempts++
		row.LastError = errMsg
		row.NextRetryAt = time.Now().Add(backoffForAttempt(row.Attempts))
		return tx.Save(&row).Error
	})
}

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow go best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

Security Scan Summary

Metric Value
Vulnerabilities Critical: 0
Overall Risk Clean
Files Scanned 21

No critical security issues detected

Scan completed in 47.3s

Security scan powered by Codity.ai

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

Dependency vulnerability scanning

Metric Value
Vulnerabilities Found 6
Scanner npm audit
View vulnerability details (6 items)

1. brace-expansion various

CVE: GHSA-f886-m6hf-6m8v
Severity: MODERATE
Fixed in: True

brace-expansion: Zero-step sequence causes process hang and memory exhaustion


2. express-rate-limit various

CVE: N/A
Severity: MODERATE
Fixed in: True

Vulnerability in express-rate-limit


3. fast-uri various

CVE: GHSA-q3j6-qgpj-74h6, GHSA-v39h-62p7-jpjc
Severity: HIGH
Fixed in: True

fast-uri vulnerable to path traversal via percent-encoded dot segments


4. hono various

CVE: GHSA-qp7p-654g-cw7p, GHSA-hm8q-7f3q-5f36, GHSA-p77w-8qqv-26rm, GHSA-9vqf-7f2p-gf9v, GHSA-69xw-7hcm-h432
Severity: MODERATE
Fixed in: True

Hono has CSS Declaration Injection via Style Object Values in JSX SSR


5. ip-address various

CVE: GHSA-v2v4-37r5-5v8g
Severity: MODERATE
Fixed in: True

ip-address has XSS in Address6 HTML-emitting methods


6. postcss various

CVE: GHSA-qx2v-qp2m-jg93
Severity: MODERATE
Fixed in: True

PostCSS has XSS via Unescaped </style> in its CSS Stringify Output

Powered by Codity.ai · Docs

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

Dependency vulnerability scanning

Metric Value
Vulnerabilities Found 4
Scanner pip-audit
View vulnerability details (4 items)

1. pip 24.0

CVE: GHSA-4xh5-x5gv-qwph
Fixed in: 25.3

When extracting a tar archive pip may not check symbolic links point into the extraction directory if the tarfile module doesn't implement PEP 706. Note that upgrading pip to a "fixed" version for thi


2. pip 24.0

CVE: GHSA-6vgw-5pg2-w6jp
Fixed in: 26.0

When pip is installing and extracting a maliciously crafted wheel archive, files may be extracted outside the installation directory. The path traversal is limited to prefixes of the installation dire


3. pip 24.0

CVE: GHSA-58qw-9mgm-455v
Fixed in:

pip handles concatenated tar and ZIP files as ZIP files regardless of filename or whether a file is both a tar and ZIP file. This behavior could result in confusing installation behavior, such as inst


4. pip 24.0

CVE: GHSA-jp4c-xjxw-mgf9
Fixed in: 26.1

pip prior to version 26.1 would run self-update check functionality after installing wheel files which required importing well-known Python modules names. These module imports were intentionally defer

Powered by Codity.ai · Docs

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

License Compliance Scan

Metric Value
Packages Scanned 778
High Risk (Strong Copyleft) 0
Medium Risk (Weak Copyleft) 6
Low Risk (Permissive) 677
Unknown License 95

Weak copyleft licenses found - verify compatibility

Some packages have unknown licenses - manual review required

Medium Risk Licenses - 6 packages

(MPL-2.0 OR Apache-2.0) (1 packages):

  • dompurify 3.4.0

MPL-2.0 (5 packages):

  • github.com/philippgille/chromem-go 0.7.0
  • github.com/libp2p/go-yamux/v5 5.0.1
  • github.com/hashicorp/golang-lru 1.0.2
  • github.com/hashicorp/golang-lru/v2 2.0.7
  • github.com/shoenig/go-m1cpu 0.1.6
Unknown Licenses - 95 packages
  • github.com/cloudflare/circl 1.6.1
  • github.com/bwmarrin/discordgo 0.29.0
  • github.com/dslipak/pdf 0.0.2
  • github.com/cyphar/filepath-securejoin 0.5.1
  • github.com/eritikass/githubmarkdownconvertergo 0.1.10
  • github.com/go-git/go-git/v5 5.16.4
  • github.com/go-git/go-billy/v5 5.6.2
  • github.com/go-git/gcfg 1.5.1-0.20230307220236-3a3c6141e376
  • github.com/gocolly/colly 1.2.0
  • github.com/golang/protobuf 1.5.4
  • github.com/google/go-querystring 1.1.0
  • github.com/google/go-github/v69 69.2.0
  • github.com/kennygrant/sanitize 1.2.4
  • github.com/mschoch/smat 0.2.0
  • github.com/mudler/skillserver 0.0.6
  • github.com/olekukonko/tablewriter 0.0.5
  • github.com/pjbgf/sha1cd 0.3.2
  • github.com/segmentio/asm 1.1.3
  • github.com/saintfish/chardet 0.0.0-20230101081208-5e3ef4b5456d
  • github.com/sergi/go-diff 1.4.0

...and 75 more

Powered by Codity.ai · Docs

@codity-dm

codity-dm Bot commented May 20, 2026

Copy link
Copy Markdown

Code Quality Report — test-org-codity/LocalAI · PR #7

Scanned: 2026-05-20 17:49 UTC | Score: 14/100 | Provider: github

Executive Summary

Severity Count
Critical 0
High 1
Medium 4
Low 211
Top Findings

[CQ-LLM-003] core/application/upgrade_checker.go:146 (Error_Handling · HIGH)

Issue: Potentially swallowing errors when backendManagerFn is nil.
Suggestion: Ensure that errors are logged or handled appropriately when backendManagerFn is nil.

if uc.backendManagerFn != nil {

[CQ-LLM-001] core/application/distributed.go:243 (Complexity · MEDIUM)

Issue: The function initDistributed has increased cyclomatic complexity due to additional parameters and logic.
Suggestion: Consider breaking down the function into smaller, more manageable functions.

func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB) (*Distribut

[CQ-LLM-002] core/application/distributed.go:243 (Documentation · MEDIUM)

Issue: Missing detailed documentation for the new parameters in the ReplicaReconciler.
Suggestion: Add JSDoc comments explaining the purpose of each new parameter.

// Create ReplicaReconciler for auto-scaling model replicas. Adapter +

[CQ-LLM-004] core/application/upgrade_checker.go:146 (Performance · MEDIUM)

Issue: The runCheck function may lead to performance issues due to unnecessary checks if backendManagerFn is nil.
Suggestion: Optimize the logic to avoid unnecessary calls when backendManagerFn is not set.

if uc.backendManagerFn != nil {

[CQ-LLM-006] core/gallery/upgrade.go:24 (Documentation · MEDIUM)

Issue: Missing documentation for the new NodeDrift field in UpgradeInfo.
Suggestion: Add a comment explaining the purpose of NodeDrift.

// NodeDrift lists nodes whose installed version or digest differs from

[CQ-008] core/application/distributed.go:256 (Maintainability · LOW)

Issue: Magic number 30 in code
Suggestion: Extract to a named constant

Interval:          30 * time.Second,

[CQ-009] core/application/upgrade_checker.go:53 (Style · LOW)

Issue: Line exceeds 120 characters (163 chars)
Suggestion: Break long lines into multiple lines for readability

func NewUpgradeChecker(appConfig *config.ApplicationConfig, ml *model.ModelLoader, db *gorm.DB, backendManagerFn func() ...

[CQ-008] core/application/upgrade_checker.go:81 (Maintainability · LOW)

Issue: Magic number 10 in code
Suggestion: Extract to a named constant

initialDelay := 10 * time.Second

[CQ-LLM-005] core/gallery/backends.go:395 (Maintainability · LOW)

Issue: The use of magic strings for NodeStatus can lead to maintainability issues.
Suggestion: Consider using constants for the different node statuses.

NodeStatus  string `json:"node_status"` // healthy | unhealthy | offline | draining | pending

[CQ-009] core/gallery/upgrade.go:59 (Style · LOW)

Issue: Line exceeds 120 characters (175 chars)
Suggestion: Break long lines into multiple lines for readability

func CheckUpgradesAgainst(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, installedBac...

Per-File Breakdown

File Critical High Medium Low Total
core/application/distributed.go 0 0 2 1 3
core/application/upgrade_checker.go 0 1 1 2 4
core/gallery/backends.go 0 0 0 1 1
core/gallery/upgrade.go 0 0 1 1 2
core/http/react-ui/src/App.css 0 0 0 14 14
core/http/react-ui/src/components/FilterBar.jsx 0 0 0 1 1
core/http/react-ui/src/components/NodeDistributionChip.jsx 0 0 0 11 11
core/http/react-ui/src/components/Popover.jsx 0 0 0 2 2
core/http/react-ui/src/pages/Manage.jsx 0 0 0 122 122
core/http/react-ui/src/pages/Nodes.jsx 0 0 0 36 36
core/services/advisorylock/keys.go 0 0 0 1 1
core/services/nodes/managers_distributed.go 0 0 0 1 1
core/services/nodes/reconciler.go 0 0 0 2 2
core/services/nodes/reconciler_test.go 0 0 0 9 9
core/services/nodes/registry.go 0 0 0 7 7

Recommendations

  1. Resolve High severity issues, especially error handling gaps and performance bottlenecks.
  • Run automated tests after applying fixes to verify no regressions.

@DhirenMhatre

Copy link
Copy Markdown
Author

@codity review

@codity-dm

codity-dm Bot commented May 24, 2026

Copy link
Copy Markdown

Policy Check Failed

✗ 3/3 policy checks failed:

• Need 2 more approval(s) (0/2) — comment LGTM or approve via review
• Missing ticket reference (expected: JIRA-, ENG-, #*)
• 20 code file(s) changed but no test files added


To merge this PR:

  1. Address the failed checks listed above
  2. Ensure branch protection requires the codity/policy-check status

Configure policies in your dashboard

@codity-dm

codity-dm Bot commented May 24, 2026

Copy link
Copy Markdown

PR Summary

What Changed

  • Fixed distributed-mode upgrade detection by routing checks through the active backend manager instead of empty local filesystem.
  • Added durable pending_backend_ops queue for backend lifecycle operations (install/upgrade/delete) with retry logic and dead-lettering.
  • Built cluster drift visibility into the UI: per-node version comparison, ghost model cleanup, and node distribution chips.

Key Changes by Area

Backend Lifecycle

  • Replaced fire-and-forget operations with persistent queue (pending_backend_ops table) and exponential backoff retry.
  • Added CheckUpgradesAgainst() for cluster-wide upgrade detection with NodeDrift outlier reporting.

Distributed Coordination

  • ReplicaReconciler now health-probes stale models and drains pending backend ops under advisory lock KeyStateReconciler.

Frontend

  • Added NodeDistributionChip component showing model/backend distribution across nodes with drift indicators.
  • Added URL-persisted filters, auto-refresh, and upgrade banners in Manage/Nodes pages.

Files Changed

File Changes Summary
core/application/distributed.go ReplicaReconciler receives auth config for worker health probes
core/application/startup.go Startup pruning of malformed pending backend ops
core/application/upgrade_checker.go Added backendManagerFn lazy getter for distributed routing
core/cli/worker.go Worker reports Version, URI, Digest in lifecycle events
core/gallery/backends.go Extended SystemBackend with Nodes []NodeBackendRef for attribution
core/gallery/upgrade.go Added CheckUpgradesAgainst() with drift detection
core/gallery/upgrade_test.go Tests for drift detection, alignment, empty versions
core/http/react-ui/src/App.css Utility classes for badges, tables, popovers, status indicators
core/http/react-ui/src/components/FilterBar.jsx Reusable search + filter + toggle control strip
core/http/react-ui/src/components/NodeDistributionChip.jsx Smart distribution display with popover for large clusters
core/http/react-ui/src/components/Popover.jsx Accessible popover with auto-positioning and focus management
core/http/react-ui/src/pages/Manage.jsx Drift badges, filters, auto-refresh, upgrade actions
core/http/react-ui/src/pages/Nodes.jsx Stat cards, backend distribution column, worker registration UX
core/http/routes/ui_api.go Added LoadedOn and Source=registry-only for ghost models
core/services/advisorylock/keys.go Added KeyStateReconciler advisory lock
core/services/galleryop/service.go Exposed BackendManager() getter for upgrade checker
core/services/messaging/subjects.go Extended NodeBackendInfo with version/digest/URI
core/services/nodes/managers_distributed.go enqueueAndDrainBackendOp with durable queue integration
core/services/nodes/reconciler.go probeLoadedModels and drainPendingBackendOps with stale probing
core/services/nodes/reconciler_test.go Tests for ghost detection and health probe behavior
core/services/nodes/registry.go PendingBackendOp model with upsert/list/retry/dead-letter methods

Review Focus Areas

  • Dead-letter logic in RecordPendingBackendOpFailure: verify 10-attempt cap and 15m backoff ceiling.
  • Advisory lock splitting: confirm KeyStateReconciler prevents stampede without deadlocking replica scaling.
  • probeStaleAfter default (2m): check this balances ghost cleanup vs. probe overhead for your cluster size.

Architecture

Design Decisions

  • Queue-based operations trade immediate feedback for durability. UI shows "pending on N nodes" to compensate.
  • Separate advisory locks for state reconciliation vs. replica scaling: intentional to allow concurrent scaling during queue drainage.
  • Lazy backendManagerFn in upgrade checker: avoids circular dependency between gallery service and distributed manager initialization.

Scalability & Extensibility

  • NodeDistributionChip collapses at >3 nodes: threshold is hardcoded, may need tuning for dense clusters.
  • Out of scope: cross-region backend replication, pluggable drift policies.

Risks

  • Intentional: Ghost model probing adds gRPC load; mitigated by 2m staleness threshold but no rate limiting between reconciler runs.
  • Unintentional: UpsertPendingBackendOp composite index on (node_id, backend, op) may hotspot on popular backends during mass upgrades.

Merge Status

NOT MERGEABLE — PR Score 9/100, below threshold (50)

  • [H1] 1 CRITICAL security vulnerability found (SEC-001)
  • [H4] PR quality score (9) is below merge floor (50)
  • [H5] 10 HIGH-severity inline review findings need resolution (threshold: 3)
  • [H6] Code quality raw score (15) is below merge floor (40)

Comment on lines +64 to +71
useEffect(() => {
if (!open && anchor?.current) {
// requestAnimationFrame so the close is painted before focus jumps;
// otherwise screen readers announce the trigger mid-transition.
const raf = requestAnimationFrame(() => anchor.current?.focus?.())
return () => cancelAnimationFrame(raf)
}
}, [open, anchor])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional High

The focus-restoration useEffect fires on initial mount, not only when the popover closes. When the component first mounts open is false and anchor.current is already set (React attaches refs in the commit phase, before passive effects run). The condition !open && anchor?.current is therefore true on the very first render, so requestAnimationFrame(() => anchor.current?.focus?.()) is scheduled immediately, stealing keyboard focus from whatever the user was doing. Every NodeDistributionChip rendered in summary mode (4+ nodes) triggers this on page load, causing the last chip's trigger button to unexpectedly receive focus.

Suggested fix
  const popoverWasOpenRef = useRef(false)
  useEffect(() => {
    const wasOpen = popoverWasOpenRef.current
    popoverWasOpenRef.current = open
    if (!open && wasOpen && anchor?.current) {
      const raf = requestAnimationFrame(() => anchor.current?.focus?.())
      return () => cancelAnimationFrame(raf)
    }
  }, [open, anchor])
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert javascript developer with deep knowledge of security, performance, and best practices.

### Context

File: core/http/react-ui/src/components/Popover.jsx
Lines: 64-71
Issue Type: functional-high
Severity: high

Issue Description:
The focus-restoration `useEffect` fires on **initial mount**, not only when the popover closes. When the component first mounts `open` is `false` and `anchor.current` is already set (React attaches refs in the commit phase, before passive effects run). The condition `!open && anchor?.current` is therefore `true` on the very first render, so `requestAnimationFrame(() => anchor.current?.focus?.())` is scheduled immediately, stealing keyboard focus from whatever the user was doing. Every `NodeDistributionChip` rendered in summary mode (4+ nodes) triggers this on page load, causing the last chip's trigger button to unexpectedly receive focus.

Current Code:
  useEffect(() => {
    if (!open && anchor?.current) {
      // requestAnimationFrame so the close is painted before focus jumps;
      // otherwise screen readers announce the trigger mid-transition.
      const raf = requestAnimationFrame(() => anchor.current?.focus?.())
      return () => cancelAnimationFrame(raf)
    }
  }, [open, anchor])

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow javascript best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue

role="tab"
aria-selected={activeFilter === f.key}
className={`filter-btn ${activeFilter === f.key ? 'active' : ''}`}
onClick={() => onFilterChange(f.key)}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Robustness Medium

onFilterChange is invoked unconditionally inside the onClick handler whenever hasFilters is true, but the prop has no default value and is not declared as required. If a caller passes a non-empty filters array while omitting onFilterChange (a realistic mistake when wiring up a read-only filter strip), clicking any chip throws TypeError: onFilterChange is not a function. The optional-chaining call operator ?.() is the idiomatic zero-cost guard here.

Suggested change
onClick={() => onFilterChange(f.key)}
onClick={() => onFilterChange?.(f.key)}
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert javascript developer with deep knowledge of security, performance, and best practices.

### Context

File: core/http/react-ui/src/components/FilterBar.jsx
Lines: 59-59
Issue Type: robustness-medium
Severity: medium

Issue Description:
`onFilterChange` is invoked unconditionally inside the `onClick` handler whenever `hasFilters` is true, but the prop has no default value and is not declared as required. If a caller passes a non-empty `filters` array while omitting `onFilterChange` (a realistic mistake when wiring up a read-only filter strip), clicking any chip throws `TypeError: onFilterChange is not a function`. The optional-chaining call operator `?.()` is the idiomatic zero-cost guard here.

Current Code:
                  onClick={() => onFilterChange(f.key)}

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow javascript best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue

Comment on lines +269 to +278
setUpgradingAll(true)
try {
// Serial upgrade — matches the gallery's Upgrade All behavior.
// Each backend upgrade is itself a cluster-wide fan-out, so parallel
// calls would multiply load on every worker.
for (const name of names) {
try {
await backendsApi.upgrade(name)
} catch (err) {
addToast(`Upgrade failed for ${name}: ${err.message}`, 'error')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional Medium

handleUpgradeAll shows the success info-toast unconditionally after the loop, even when every individual upgrade threw an error. An operator who clicks "Upgrade all" and gets N error toasts followed by "Upgrade started for N backends" receives conflicting feedback and has no reliable way to know whether any upgrade was actually initiated. Additionally err.message is undefined for non-Error rejections (e.g. a plain string throw) so the per-upgrade error toasts can display "Upgrade failed for foo: undefined".

Suggested fix
      let succeeded = 0
      for (const name of names) {
        try {
          await backendsApi.upgrade(name)
          succeeded++
        } catch (err) {
          addToast(`Upgrade failed for ${name}: ${err.message ?? err}`, 'error')
        }
      }
      if (succeeded > 0) {
        addToast(`Upgrade started for ${succeeded} backend${succeeded === 1 ? '' : 's'}`, 'info')
      }
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert javascript developer with deep knowledge of security, performance, and best practices.

### Context

File: core/http/react-ui/src/pages/Manage.jsx
Lines: 269-278
Issue Type: functional-medium
Severity: medium

Issue Description:
`handleUpgradeAll` shows the success info-toast unconditionally after the loop, even when every individual upgrade threw an error. An operator who clicks "Upgrade all" and gets N error toasts followed by "Upgrade started for N backends" receives conflicting feedback and has no reliable way to know whether any upgrade was actually initiated. Additionally `err.message` is `undefined` for non-`Error` rejections (e.g. a plain string `throw`) so the per-upgrade error toasts can display "Upgrade failed for foo: undefined".

Current Code:
      for (const name of names) {
        try {
          await backendsApi.upgrade(name)
        } catch (err) {
          addToast(`Upgrade failed for ${name}: ${err.message}`, 'error')
        }
      }
      addToast(`Upgrade started for ${names.length} backend${names.length === 1 ? '' : 's'}`, 'info')

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow javascript best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue

@codity-dm

codity-dm Bot commented May 24, 2026

Copy link
Copy Markdown

Security Scan Summary

Metric Value
Vulnerabilities Critical: 0
Overall Risk Clean
Files Scanned 21

No critical security issues detected

Scan completed in 40.6s

Security scan powered by Codity.ai

@codity-dm

codity-dm Bot commented May 24, 2026

Copy link
Copy Markdown

Dependency vulnerability scanning

Metric Value
Vulnerabilities Found 7
Scanner npm audit
View vulnerability details (7 items)

1. brace-expansion various

CVE: GHSA-f886-m6hf-6m8v
Severity: MODERATE
Fixed in: True

brace-expansion: Zero-step sequence causes process hang and memory exhaustion


2. express-rate-limit various

CVE: N/A
Severity: MODERATE
Fixed in: True

Vulnerability in express-rate-limit


3. fast-uri various

CVE: GHSA-q3j6-qgpj-74h6, GHSA-v39h-62p7-jpjc
Severity: HIGH
Fixed in: True

fast-uri vulnerable to path traversal via percent-encoded dot segments


4. hono various

CVE: GHSA-qp7p-654g-cw7p, GHSA-hm8q-7f3q-5f36, GHSA-p77w-8qqv-26rm, GHSA-9vqf-7f2p-gf9v, GHSA-69xw-7hcm-h432
Severity: MODERATE
Fixed in: True

Hono has CSS Declaration Injection via Style Object Values in JSX SSR


5. ip-address various

CVE: GHSA-v2v4-37r5-5v8g
Severity: MODERATE
Fixed in: True

ip-address has XSS in Address6 HTML-emitting methods


6. postcss various

CVE: GHSA-qx2v-qp2m-jg93
Severity: MODERATE
Fixed in: True

PostCSS has XSS via Unescaped </style> in its CSS Stringify Output


7. qs various

CVE: GHSA-q8mj-m7cp-5q26
Severity: MODERATE
Fixed in: True

qs has a remotely triggerable DoS: qs.stringify crashes with TypeError on null/undefined entries in comma-format arrays when encodeValuesOnly is set

Powered by Codity.ai · Docs

@codity-dm

codity-dm Bot commented May 24, 2026

Copy link
Copy Markdown

Dependency vulnerability scanning

Metric Value
Vulnerabilities Found 4
Scanner pip-audit
View vulnerability details (4 items)

1. pip 24.0

CVE: GHSA-4xh5-x5gv-qwph
Fixed in: 25.3

When extracting a tar archive pip may not check symbolic links point into the extraction directory if the tarfile module doesn't implement PEP 706. Note that upgrading pip to a "fixed" version for thi


2. pip 24.0

CVE: GHSA-6vgw-5pg2-w6jp
Fixed in: 26.0

When pip is installing and extracting a maliciously crafted wheel archive, files may be extracted outside the installation directory. The path traversal is limited to prefixes of the installation dire


3. pip 24.0

CVE: GHSA-58qw-9mgm-455v
Fixed in: 26.1

pip handles concatenated tar and ZIP files as ZIP files regardless of filename or whether a file is both a tar and ZIP file. This behavior could result in confusing installation behavior, such as inst


4. pip 24.0

CVE: GHSA-jp4c-xjxw-mgf9
Fixed in: 26.1

pip prior to version 26.1 would run self-update check functionality after installing wheel files which required importing well-known Python modules names. These module imports were intentionally defer

Powered by Codity.ai · Docs

@codity-dm

codity-dm Bot commented May 24, 2026

Copy link
Copy Markdown

License Compliance Scan

Metric Value
Packages Scanned 778
High Risk (Strong Copyleft) 0
Medium Risk (Weak Copyleft) 6
Low Risk (Permissive) 646
Unknown License 126

Weak copyleft licenses found - verify compatibility

Some packages have unknown licenses - manual review required

Medium Risk Licenses - 6 packages

(MPL-2.0 OR Apache-2.0) (1 packages):

  • dompurify 3.4.0

MPL-2.0 (5 packages):

  • github.com/philippgille/chromem-go 0.7.0
  • github.com/libp2p/go-yamux/v5 5.0.1
  • github.com/hashicorp/golang-lru 1.0.2
  • github.com/hashicorp/golang-lru/v2 2.0.7
  • github.com/shoenig/go-m1cpu 0.1.6
Unknown Licenses - 126 packages
  • github.com/moby/moby/api 1.54.1
  • github.com/moby/moby/client 0.4.0
  • github.com/nats-io/nkeys 0.4.15
  • github.com/nats-io/nuid 1.0.1
  • github.com/wk8/go-ordered-map/v2 2.1.8
  • github.com/ProtonMail/go-crypto 1.1.6
  • github.com/PuerkitoBio/goquery 1.10.3
  • github.com/RoaringBitmap/roaring/v2 2.4.5
  • github.com/andybalholm/cascadia 1.3.3
  • github.com/antchfx/htmlquery 1.3.4
  • github.com/antchfx/xmlquery 1.4.4
  • github.com/antchfx/xpath 1.3.4
  • github.com/bits-and-blooms/bitset 1.24.0
  • github.com/blevesearch/bleve/v2 2.5.7
  • github.com/blevesearch/bleve_index_api 1.2.11
  • github.com/blevesearch/geo 0.2.4
  • github.com/blevesearch/go-porterstemmer 1.0.3
  • github.com/blevesearch/gtreap 0.1.1
  • github.com/blevesearch/mmap-go 1.0.4
  • github.com/blevesearch/scorch_segment_api/v2 2.3.13

...and 106 more

Powered by Codity.ai · Docs

@codity-dm

codity-dm Bot commented May 24, 2026

Copy link
Copy Markdown

Code Quality Report — test-org-codity/LocalAI · PR #7

Scanned: 2026-05-24 20:20 UTC | Score: 15/100 | Provider: github

Executive Summary

Severity Count
Critical 0
High 1
Medium 3
Low 211
Top Findings

[CQ-LLM-003] core/gallery/upgrade.go:25 (Error_Handling · HIGH)

Issue: The CheckBackendUpgrades function does not handle the case where ListSystemBackends returns a nil result.
Suggestion: Add error handling to ensure that the function gracefully handles nil results from ListSystemBackends.

installed, err := ListSystemBackends(systemState)

[CQ-LLM-001] core/application/distributed.go:243 (Complexity · MEDIUM)

Issue: The function initDistributed has increased cyclomatic complexity due to the addition of multiple parameters and logic.
Suggestion: Consider breaking down the function into smaller, more manageable functions to reduce complexity.

func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB) (*Distribut

[CQ-LLM-004] core/application/upgrade_checker.go:41 (Documentation · MEDIUM)

Issue: Missing documentation for the backendManagerFn parameter in NewUpgradeChecker.
Suggestion: Add a docstring to explain the purpose of the backendManagerFn parameter.

func NewUpgradeChecker(appConfig *config.ApplicationConfig, ml *model.ModelLoader, db *gorm.DB, backendManagerFn func() galleryop.BackendManager) *UpgradeChecker {

[CQ-LLM-002] core/application/upgrade_checker.go:66 (Duplication · MEDIUM)

Issue: The logic for checking upgrades is duplicated in the runCheck method, which could lead to maintenance issues.
Suggestion: Refactor the upgrade checking logic into a separate function to adhere to the DRY principle.

if upgrades == nil && err == nil { upgrades, err = gallery.CheckBackendUpgrades(ctx, uc.galleries, uc.systemState) }

[CQ-008] core/application/distributed.go:256 (Maintainability · LOW)

Issue: Magic number 30 in code
Suggestion: Extract to a named constant

Interval:          30 * time.Second,

[CQ-009] core/application/upgrade_checker.go:53 (Style · LOW)

Issue: Line exceeds 120 characters (163 chars)
Suggestion: Break long lines into multiple lines for readability

func NewUpgradeChecker(appConfig *config.ApplicationConfig, ml *model.ModelLoader, db *gorm.DB, backendManagerFn func() ...

[CQ-008] core/application/upgrade_checker.go:81 (Maintainability · LOW)

Issue: Magic number 10 in code
Suggestion: Extract to a named constant

initialDelay := 10 * time.Second

[CQ-LLM-005] core/gallery/backends.go:396 (Maintainability · LOW)

Issue: The Nodes field in SystemBackend is not well documented, making it harder to maintain.
Suggestion: Add a comment explaining the purpose and usage of the Nodes field.

Nodes []NodeBackendRef `json:"nodes,omitempty"`

[CQ-009] core/gallery/upgrade.go:59 (Style · LOW)

Issue: Line exceeds 120 characters (175 chars)
Suggestion: Break long lines into multiple lines for readability

func CheckUpgradesAgainst(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, installedBac...

[CQ-008] core/http/react-ui/src/App.css:1569 (Maintainability · LOW)

Issue: Magic number 400 in code
Suggestion: Extract to a named constant

font-weight: 400;

Per-File Breakdown

File Critical High Medium Low Total
core/application/distributed.go 0 0 1 1 2
core/application/upgrade_checker.go 0 0 2 2 4
core/gallery/backends.go 0 0 0 1 1
core/gallery/upgrade.go 0 1 0 1 2
core/http/react-ui/src/App.css 0 0 0 14 14
core/http/react-ui/src/components/FilterBar.jsx 0 0 0 1 1
core/http/react-ui/src/components/NodeDistributionChip.jsx 0 0 0 11 11
core/http/react-ui/src/components/Popover.jsx 0 0 0 2 2
core/http/react-ui/src/pages/Manage.jsx 0 0 0 122 122
core/http/react-ui/src/pages/Nodes.jsx 0 0 0 36 36
core/services/advisorylock/keys.go 0 0 0 1 1
core/services/nodes/managers_distributed.go 0 0 0 1 1
core/services/nodes/reconciler.go 0 0 0 2 2
core/services/nodes/reconciler_test.go 0 0 0 9 9
core/services/nodes/registry.go 0 0 0 7 7

Recommendations

  1. Resolve High severity issues, especially error handling gaps and performance bottlenecks.
  • Run automated tests after applying fixes to verify no regressions.

@DhirenMhatre

Copy link
Copy Markdown
Author

@codity review

1 similar comment
@DhirenMhatre

Copy link
Copy Markdown
Author

@codity review

@codity-dm

codity-dm Bot commented May 24, 2026

Copy link
Copy Markdown

Policy Check Failed

✗ 3/3 policy checks failed:

• Need 2 more approval(s) (0/2) — comment LGTM or approve via review
• Missing ticket reference (expected: JIRA-, ENG-, #*)
• 20 code file(s) changed but no test files added


To merge this PR:

  1. Address the failed checks listed above
  2. Ensure branch protection requires the codity/policy-check status

Configure policies in your dashboard

@codity-dm

codity-dm Bot commented May 24, 2026

Copy link
Copy Markdown

PR Summary

What Changed

  • Fixed distributed upgrade detection to aggregate backend state from all workers instead of checking the frontend's empty filesystem.
  • Added durable queue for backend lifecycle operations (install/upgrade/delete) with retry logic and dead-lettering after 10 attempts.
  • Built new React components for cluster visibility: node distribution chips, popovers, and filter bars with URL-persisted state.

Key Changes by Area

Distributed Backend Management

  • UpgradeChecker now routes through DistributedBackendManager to aggregate SystemBackends from all workers (core/application/upgrade_checker.go:47-52).
  • Added CheckUpgradesAgainst() entry point accepting pre-aggregated state instead of local filesystem reads (core/gallery/upgrade.go:43-68).
  • Implemented summarizeNodeDrift() to detect version/digest mismatches across nodes and flag upgrades when nodes disagree (core/gallery/upgrade.go:159-198).

Backend Operation Queue

  • New PendingBackendOp table with composite key (node_id, backend, op) and exponential backoff (core/services/nodes/registry.go:107-130).
  • enqueueAndDrainBackendOp records intent before NATS calls, queues operations for offline nodes, and retries healthy nodes with backoff (core/services/nodes/managers_distributed.go:97-167).

State Reconciliation

  • Split ReplicaReconciler into separate advisory-locked phases for replica scaling vs. state reconciliation (core/services/nodes/reconciler.go:113-130).
  • Added drainPendingBackendOps() with retry/dead-letter logic (reconciler.go:171-242).
  • Added probeLoadedModels() to health-check stale models via gRPC and remove unreachable entries (reconciler.go:242-301).

Cluster-Aware API

  • Extended /api/models with LoadedOn field showing worker node distribution (core/http/routes/ui_api.go).
  • Added Source: "registry-only" for ghost models discovered by reconciler but missing local config.

React UI

  • NodeDistributionChip: inline chips for small clusters, summary+popover for larger ones, with drift detection (core/http/react-ui/src/components/NodeDistributionChip.jsx).
  • FilterBar: shared search/filter/toggle control with URL persistence (core/http/react-ui/src/components/FilterBar.jsx).
  • Manage.jsx: 10s auto-refresh in distributed mode, backend upgrade banner, filter counts, "Adopted" badge for registry-only models.
  • Added CSS utilities for table cells, stat cards, and node status indicators (core/http/react-ui/src/App.css:1529-1560).

Files Changed

File Changes Summary
core/application/distributed.go Added Adapter, RegistrationToken, ProbeStaleAfter to ReplicaReconcilerOptions
core/application/startup.go Wired backendManagerFn for distributed upgrade checks
core/application/upgrade_checker.go Fixed to use lazy backendManagerFn routing through DistributedBackendManager
core/cli/worker.go Extended worker lifecycle events to report Version, URI, Digest metadata
core/gallery/backends.go Added NodeDriftInfo and NodeDrift fields to UpgradeInfo and SystemBackend
core/gallery/upgrade.go Added CheckUpgradesAgainst() and summarizeNodeDrift() algorithm
core/gallery/upgrade_test.go Added tests for drift detection, cluster agreement, empty-version scenarios
core/http/react-ui/src/App.css Added badge, cell-stack, stat-grid, filter-bar, popover, node-status CSS classes
core/http/react-ui/src/components/FilterBar.jsx New shared search/filter/toggle component
core/http/react-ui/src/components/NodeDistributionChip.jsx New component for node distribution visualization with drift detection
core/http/react-ui/src/components/Popover.jsx New accessible popover with auto-positioning and focus management
core/http/react-ui/src/pages/Manage.jsx Integrated distribution chips, URL-persisted filters, auto-refresh, upgrade banner
core/http/react-ui/src/pages/Nodes.jsx Added stat cards, CSS class refactoring, improved worker registration UX
core/http/routes/ui_api.go Added LoadedOn field and registry-only source for ghost models
core/services/advisorylock/keys.go Added KeyStateReconciler advisory lock key
core/services/galleryop/service.go Exposed BackendManager() getter for upgrade checker integration
core/services/messaging/subjects.go Added Version, URI, Digest to NodeBackendInfo for upgrade detection
core/services/nodes/managers_distributed.go Added enqueueAndDrainBackendOp with pending-ops queue pattern
core/services/nodes/reconciler.go Split reconciliation phases, added drain and probe logic with advisory locks
core/services/nodes/reconciler_test.go Added fakeProber, tests for state reconciliation and model probing
core/services/nodes/registry.go Added PendingBackendOp table, queue management methods, startup cleanup

Review Focus Areas

  • Dead-letter behavior: Operations abandoned after 10 attempts in drainPendingBackendOps - confirm this matches operational expectations.
  • Advisory lock scope: Verify KeyStateReconciler separation from replica scaling prevents blocking without creating lock ordering issues.
  • Ghost model probing: grpcModelProber health checks run against model gRPC addresses - ensure timeout handling aligns with ProbeStaleAfter config.

Architecture

  • Design Decisions: The pending-ops queue trades immediate feedback for durability. Operations on unhealthy nodes queue silently rather than fail fast. This is intentional for distributed deployments where nodes may be temporarily unavailable. The split advisory locks (replica vs. state) accept that state reconciliation may lag behind replica scaling to prevent head-of-line blocking.

  • Scalability & Extensibility: CheckUpgradesAgainst accepts pre-aggregated SystemBackends, allowing future aggregation strategies (caching, incremental updates) without changing the drift detection algorithm. The ModelProber interface permits non-gRPC health checks if needed later.

  • Risks:

    • Intentional: Dead-lettered operations require manual intervention. Exponential backoff caps at 15 minutes, which may delay recovery for long partitions.
    • Unintentional: probeLoadedModels bumps updated_at on successful health checks, which may mask actual staleness if the prober returns false positives.

@codity-dm

codity-dm Bot commented May 24, 2026

Copy link
Copy Markdown

Security Scan Summary

Metric Value
Vulnerabilities Critical: 0
Overall Risk Clean
Files Scanned 21

No critical security issues detected

Scan completed in 39.4s

Security scan powered by Codity.ai

@DhirenMhatre

Copy link
Copy Markdown
Author

@codity review

@codity-ai

codity-ai Bot commented May 25, 2026

Copy link
Copy Markdown

PR Summary

What Changed

  • Fixed distributed upgrade detection so frontends query workers instead of checking empty local state. Added cluster drift detection when nodes disagree on backend versions.
  • Added durable queue for backend lifecycle operations (install/upgrade/delete) with retry and dead-letter handling instead of skipping offline nodes.
  • Built new React components for cluster visibility: node distribution chips, popovers, and filter bars with URL-persisted state.

Key Changes by Area

Distributed Backend Management

  • core/services/nodes/managers_distributed.go: Replaced fire-and-forget operations with enqueueAndDrainBackendOp queue pattern. Non-healthy nodes get "queued" status with reconciler retry.
  • core/services/nodes/reconciler.go: Added probeLoadedModels() for ghost entry cleanup and drainPendingBackendOps() for retry processing with exponential backoff.

Upgrade Detection

  • core/gallery/upgrade.go: New CheckUpgradesAgainst() compares gallery against aggregated worker state. Detects NodeDrift when cluster nodes disagree.
  • core/application/upgrade_checker.go: Added backendManagerFn lazy getter to route through active backend manager.

Cluster Data Model

  • core/gallery/backends.go: Extended SystemBackend with Nodes []NodeBackendRef for per-node version/digest tracking.
  • core/services/nodes/registry.go: Added PendingBackendOp table with upsert semantics, exponential backoff, and startup pruning.

UI Components

  • NodeDistributionChip.jsx: Renders inline chips (≤3 nodes) or summary popover showing health, drift, and offline counts.
  • Popover.jsx: Accessible anchored surface with auto-positioning and focus management.
  • FilterBar.jsx: Reusable search + filter chip + toggle strip.

Management UI

  • Manage.jsx: URL-persisted filters (mq, mf, bq, bf), 10s auto-refresh in distributed mode, upgrade badges, NodeDistributionChip integration.
  • Nodes.jsx: Backend version/digest display with drift detection in expandable rows, "Register a new worker" flow.

Files Changed

File Changes Summary
core/application/distributed.go Added Adapter, RegistrationToken, ProbeStaleAfter to ReplicaReconcilerOptions
core/application/startup.go (Inferred: reconciler initialization with new options)
core/application/upgrade_checker.go Added backendManagerFn lazy getter for distributed routing
core/cli/worker.go Extended backend reporting with Version, URI, Digest
core/gallery/backends.go Added Nodes []NodeBackendRef to SystemBackend for drift tracking
core/gallery/upgrade.go New CheckUpgradesAgainst() with cluster aggregation and drift detection
core/gallery/upgrade_test.go Test suite for drift detection, unanimous clusters, edge cases
core/http/react-ui/src/App.css Table utilities, stat cards, popovers, status indicators, tab badges
core/http/react-ui/src/components/FilterBar.jsx New reusable search/filter/toggle control strip
core/http/react-ui/src/components/NodeDistributionChip.jsx New component for node distribution display with popover
core/http/react-ui/src/components/Popover.jsx New accessible anchored popover with focus management
core/http/react-ui/src/pages/Manage.jsx URL filters, auto-refresh, upgrade badges, NodeDistributionChip integration
core/http/react-ui/src/pages/Nodes.jsx Version/digest display, drift detection, worker registration polish
core/http/routes/ui_api.go /api/models joins node registry for LoadedOn, registry-only ghost detection
core/services/advisorylock/keys.go Added KeyStateReconciler advisory lock key
core/services/galleryop/service.go Exposed BackendManager() getter for upgrade checker
core/services/messaging/subjects.go Extended NodeBackendInfo with Version, URI, Digest
core/services/nodes/managers_distributed.go enqueueAndDrainBackendOp queue pattern, DeleteBackendDetailed variant
core/services/nodes/reconciler.go probeLoadedModels(), drainPendingBackendOps(), split advisory locks
core/services/nodes/reconciler_test.go fakeProber helper, tests for model probing and pending ops
core/services/nodes/registry.go PendingBackendOp table, upsert/queue/retry methods, startup pruning

Review Focus Areas

  • Pending ops dead-letter logic: Verify 10-attempt cap and backoff progression in drainPendingBackendOps.
  • Drift detection accuracy: Check CheckUpgradesAgainst handles edge cases where all nodes report empty versions.
  • Reconciler lock splitting: Confirm replica-reconciler vs KeyStateReconciler locks prevent deadlock during state reconciliation.

Architecture

Design Decisions

  • Queue pattern for backend ops accepts temporary inconsistency (nodes may lag) in exchange for durability. Intentionally skips agent nodes and unapproved nodes to prevent stuck operations.
  • Split advisory locks allow state reconciliation to run opportunistically (non-blocking try-lock) without blocking replica scaling operations.
  • Lazy backendManagerFn in upgrade checker defers distributed vs local decision to call time, avoiding circular initialization.

Scalability & Extensibility

  • PendingBackendOp upsert semantics prevent queue explosion under reissue storms. Composite index on (node_id, backend, op) ensures single entry per target.
  • NodeDistributionChip scales display: inline for ≤3 nodes, popover summary for larger clusters. Out of scope: server-side pagination for very large clusters (>100 nodes).

Risks

  • Intentional: Startup pruning of malformed queue rows may drop legitimate ops if node registry state is inconsistent at boot. Acceptable given crash-recovery scenario.
  • Unintentional: ErrNoResponders marks node unhealthy immediately. May over-trigger if NATS transiently unavailable. Should review thresholding.

Merge Status

NOT MERGEABLE — PR Score 8/100, below threshold (50)

  • [H1] 1 CRITICAL security vulnerability found (SEC-001)
  • [H4] PR quality score (8) is below merge floor (50)
  • [H5] 8 HIGH-severity inline review findings need resolution (threshold: 3)
  • [H6] Code quality raw score (14) is below merge floor (40)

@codity-ai

codity-ai Bot commented May 25, 2026

Copy link
Copy Markdown

Workflow Diagrams

Automatically generated sequence diagrams showing the workflows in this PR

1. ## Analysis

This PR introduces a sophisticated distribut...

Complex complexity • Components: UpgradeChecker, DistributedBackendManager, CheckUpgradesAgainst

sequenceDiagram
    title Distributed Backend Upgrade Detection with Drift Awareness

    participant UC as UpgradeChecker
    participant AL as AdvisoryLock
    participant DB as PostgreSQL
    participant DBM as DistributedBackendManager
    participant Worker as WorkerNode
    participant Gallery as GalleryService
    participant UI as ReactUI

    Note over UC: Periodic check triggered every 6 hours

    UC->>AL: Acquire lock for upgrade check
    alt Lock acquired this is the leader instance
        AL->>DB: SELECT pg_try_advisory_lock(key)
        DB-->>AL: true
        
        UC->>DBM: CheckUpgrades()
        Note over DBM: Aggregates from all workers instead of empty frontend filesystem
        
        DBM->>Worker: ListBackends() via NATS
        Worker-->>DBM: NodeBackendRef with version and digest
        
        DBM->>DBM: Aggregate into SystemBackends with per-node metadata
        
        DBM->>Gallery: CheckUpgradesAgainst(galleries, systemState, installedBackends)
        
        Gallery->>Gallery: summarizeNodeDrift(nodes)
        Note over Gallery: Detects version or digest divergence across cluster
        
        Gallery->>Gallery: Compare gallery version vs majority cluster version
        
        alt Cluster drift detected
            Gallery-->>DBM: UpgradeInfo with NodeDrift populated
        else All nodes agree
            Gallery-->>DBM: UpgradeInfo with empty NodeDrift
        end
        
        DBM-->>UC: map of available upgrades
        
        UC->>UC: Cache results in lastUpgrades
        
        alt Auto-upgrade enabled
            UC->>Gallery: UpgradeBackend()
            Gallery->>DBM: UpgradeBackend(name)
            DBM->>Worker: InstallBackend via NATS
            Worker-->>DBM: success
        end
        
    else Lock not acquired another instance is leader
        AL->>DB: SELECT pg_try_advisory_lock(key)
        DB-->>AL: false
        Note over UC: Skip periodic check this instance is not leader
    end

    UC->>UI: GetAvailableUpgrades() API call
    UI-->>UC: Cached upgrade results including NodeDrift

    Note over UI: Displays upgrade banner with cluster drift indicators
Loading

Note: Diagrams show detected patterns only. Complex workflows may require manual review.

@codity-ai

codity-ai Bot commented May 25, 2026

Copy link
Copy Markdown

Security Scan Summary

Metric Value
Vulnerabilities Critical: 0
Overall Risk Clean
Files Scanned 21

No critical security issues detected

Scan completed in 58.0s

Security scan powered by Codity.ai

@codity-ai

codity-ai Bot commented May 25, 2026

Copy link
Copy Markdown

Dependency vulnerability scanning

Metric Value
Vulnerabilities Found 7
Scanner npm audit
View vulnerability details (7 items)

1. brace-expansion various

CVE: GHSA-f886-m6hf-6m8v
Severity: MODERATE
Fixed in: True

brace-expansion: Zero-step sequence causes process hang and memory exhaustion


2. express-rate-limit various

CVE: N/A
Severity: MODERATE
Fixed in: True

Vulnerability in express-rate-limit


3. fast-uri various

CVE: GHSA-q3j6-qgpj-74h6, GHSA-v39h-62p7-jpjc
Severity: HIGH
Fixed in: True

fast-uri vulnerable to path traversal via percent-encoded dot segments


4. hono various

CVE: GHSA-qp7p-654g-cw7p, GHSA-hm8q-7f3q-5f36, GHSA-p77w-8qqv-26rm, GHSA-9vqf-7f2p-gf9v, GHSA-69xw-7hcm-h432
Severity: MODERATE
Fixed in: True

Hono has CSS Declaration Injection via Style Object Values in JSX SSR


5. ip-address various

CVE: GHSA-v2v4-37r5-5v8g
Severity: MODERATE
Fixed in: True

ip-address has XSS in Address6 HTML-emitting methods


6. postcss various

CVE: GHSA-qx2v-qp2m-jg93
Severity: MODERATE
Fixed in: True

PostCSS has XSS via Unescaped </style> in its CSS Stringify Output


7. qs various

CVE: GHSA-q8mj-m7cp-5q26
Severity: MODERATE
Fixed in: True

qs has a remotely triggerable DoS: qs.stringify crashes with TypeError on null/undefined entries in comma-format arrays when encodeValuesOnly is set

Powered by Codity.ai · Docs

@codity-ai

codity-ai Bot commented May 25, 2026

Copy link
Copy Markdown

Dependency vulnerability scanning

Metric Value
Vulnerabilities Found 4
Scanner pip-audit
View vulnerability details (4 items)

1. pip 24.0

CVE: GHSA-4xh5-x5gv-qwph
Fixed in: 25.3

When extracting a tar archive pip may not check symbolic links point into the extraction directory if the tarfile module doesn't implement PEP 706. Note that upgrading pip to a "fixed" version for thi


2. pip 24.0

CVE: GHSA-6vgw-5pg2-w6jp
Fixed in: 26.0

When pip is installing and extracting a maliciously crafted wheel archive, files may be extracted outside the installation directory. The path traversal is limited to prefixes of the installation dire


3. pip 24.0

CVE: GHSA-58qw-9mgm-455v
Fixed in: 26.1

pip handles concatenated tar and ZIP files as ZIP files regardless of filename or whether a file is both a tar and ZIP file. This behavior could result in confusing installation behavior, such as inst


4. pip 24.0

CVE: GHSA-jp4c-xjxw-mgf9
Fixed in: 26.1

pip prior to version 26.1 would run self-update check functionality after installing wheel files which required importing well-known Python modules names. These module imports were intentionally defer

Powered by Codity.ai · Docs

@codity-ai

codity-ai Bot commented May 25, 2026

Copy link
Copy Markdown

License Compliance Scan

Metric Value
Packages Scanned 778
High Risk (Strong Copyleft) 0
Medium Risk (Weak Copyleft) 6
Low Risk (Permissive) 618
Unknown License 154

Weak copyleft licenses found - verify compatibility

Some packages have unknown licenses - manual review required

Medium Risk Licenses - 6 packages

(MPL-2.0 OR Apache-2.0) (1 packages):

  • dompurify 3.4.0

MPL-2.0 (5 packages):

  • github.com/philippgille/chromem-go 0.7.0
  • github.com/libp2p/go-yamux/v5 5.0.1
  • github.com/hashicorp/golang-lru/v2 2.0.7
  • github.com/hashicorp/golang-lru 1.0.2
  • github.com/shoenig/go-m1cpu 0.1.6
Unknown Licenses - 154 packages
  • github.com/emirpasic/gods 1.18.1
  • github.com/cyphar/filepath-securejoin 0.5.1
  • github.com/dslipak/pdf 0.0.2
  • github.com/go-git/go-billy/v5 5.6.2
  • github.com/go-git/gcfg 1.5.1-0.20230307220236-3a3c6141e376
  • github.com/go-git/go-git/v5 5.16.4
  • github.com/eritikass/githubmarkdownconvertergo 0.1.10
  • github.com/gocolly/colly 1.2.0
  • github.com/golang/protobuf 1.5.4
  • github.com/google/go-github/v69 69.2.0
  • github.com/google/go-querystring 1.1.0
  • github.com/kennygrant/sanitize 1.2.4
  • github.com/mschoch/smat 0.2.0
  • github.com/mudler/skillserver 0.0.6
  • github.com/olekukonko/tablewriter 0.0.5
  • github.com/pjbgf/sha1cd 0.3.2
  • github.com/saintfish/chardet 0.0.0-20230101081208-5e3ef4b5456d
  • github.com/segmentio/asm 1.1.3
  • github.com/sergi/go-diff 1.4.0
  • go.mau.fi/util 0.3.0

...and 134 more

Powered by Codity.ai · Docs

@codity-ai

codity-ai Bot commented May 25, 2026

Copy link
Copy Markdown

Code Quality Report — test-org-codity/LocalAI · PR #7

Scanned: 2026-05-25 11:19 UTC | Score: 14/100 | Provider: github

Executive Summary

Severity Count
Critical 0
High 1
Medium 4
Low 211
Top Findings

[CQ-LLM-003] core/application/upgrade_checker.go:146 (Error_Handling · HIGH)

Issue: Potentially swallowing errors when backendManagerFn is nil, leading to unhandled cases.
Suggestion: Add error handling to ensure that if backendManagerFn is nil, an appropriate error is logged or handled.

if uc.backendManagerFn != nil {

[CQ-LLM-002] core/application/distributed.go:242 (Documentation · MEDIUM)

Issue: Missing detailed documentation for the new parameters in the ReplicaReconciler.
Suggestion: Add JSDoc comments for the new parameters to clarify their purpose.

// Create ReplicaReconciler for auto-scaling model replicas. Adapter +

[CQ-LLM-001] core/application/distributed.go:243 (Complexity · MEDIUM)

Issue: The function initDistributed has increased cyclomatic complexity due to the addition of multiple parameters and logic.
Suggestion: Consider breaking down the function into smaller functions to reduce complexity.

func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB) (*Distribut

[CQ-LLM-006] core/application/startup.go:236 (Testability · MEDIUM)

Issue: The use of a hard-coded function for backend manager may hinder testability.
Suggestion: Consider using dependency injection for the backend manager to improve testability.

bmFn := func() galleryop.BackendManager { return application.GalleryService().BackendManager() }

[CQ-LLM-004] core/application/upgrade_checker.go:158 (Performance · MEDIUM)

Issue: The runCheck function may lead to performance issues due to the nested calls to CheckUpgrades.
Suggestion: Consider optimizing the logic to avoid unnecessary calls to CheckUpgrades.

if upgrades == nil && err == nil {

[CQ-008] core/application/distributed.go:256 (Maintainability · LOW)

Issue: Magic number 30 in code
Suggestion: Extract to a named constant

Interval:          30 * time.Second,

[CQ-009] core/application/upgrade_checker.go:53 (Style · LOW)

Issue: Line exceeds 120 characters (163 chars)
Suggestion: Break long lines into multiple lines for readability

func NewUpgradeChecker(appConfig *config.ApplicationConfig, ml *model.ModelLoader, db *gorm.DB, backendManagerFn func() ...

[CQ-008] core/application/upgrade_checker.go:81 (Maintainability · LOW)

Issue: Magic number 10 in code
Suggestion: Extract to a named constant

initialDelay := 10 * time.Second

[CQ-LLM-005] core/gallery/upgrade.go:25 (Maintainability · LOW)

Issue: The comment regarding NodeDrift could be clearer and more concise.
Suggestion: Refine the comment to improve clarity and maintainability.

// NodeDrift lists nodes whose installed version or digest differs from

[CQ-009] core/gallery/upgrade.go:59 (Style · LOW)

Issue: Line exceeds 120 characters (175 chars)
Suggestion: Break long lines into multiple lines for readability

func CheckUpgradesAgainst(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, installedBac...

Per-File Breakdown

File Critical High Medium Low Total
core/application/distributed.go 0 0 2 1 3
core/application/startup.go 0 0 1 0 1
core/application/upgrade_checker.go 0 1 1 2 4
core/gallery/upgrade.go 0 0 0 2 2
core/http/react-ui/src/App.css 0 0 0 14 14
core/http/react-ui/src/components/FilterBar.jsx 0 0 0 1 1
core/http/react-ui/src/components/NodeDistributionChip.jsx 0 0 0 11 11
core/http/react-ui/src/components/Popover.jsx 0 0 0 2 2
core/http/react-ui/src/pages/Manage.jsx 0 0 0 122 122
core/http/react-ui/src/pages/Nodes.jsx 0 0 0 36 36
core/services/advisorylock/keys.go 0 0 0 1 1
core/services/nodes/managers_distributed.go 0 0 0 1 1
core/services/nodes/reconciler.go 0 0 0 2 2
core/services/nodes/reconciler_test.go 0 0 0 9 9
core/services/nodes/registry.go 0 0 0 7 7

Recommendations

  1. Resolve High severity issues, especially error handling gaps and performance bottlenecks.
  • Run automated tests after applying fixes to verify no regressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants