Skip to content

OLS-3024: Add compliance audit logging to agentic operator#158

Open
vimalk78 wants to merge 2 commits into
openshift:mainfrom
vimalk78:ols-3024-audit-logging
Open

OLS-3024: Add compliance audit logging to agentic operator#158
vimalk78 wants to merge 2 commits into
openshift:mainfrom
vimalk78:ols-3024-audit-logging

Conversation

@vimalk78

@vimalk78 vimalk78 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Add compliance audit logging to the agentic operator (OLS-3271 + OLS-3272). Structured JSON audit events are emitted to stdout at every proposal phase
transition. OTEL trace spans provide distributed tracing with W3C traceparent propagation to the sandbox.

Audit logging

  • Dedicated AuditLogger (independent zap, immune to operator log level settings)
  • spec.audit.logging controls stdout JSON (default: true)
  • spec.audit.otel.endpoint controls OTEL export (independent of logging)
  • 8 audit events: proposal.received, analysis.completed, approval.received, execution.completed, verification.completed, verification.retry,
    escalation.completed, proposal.terminal
  • CR serialization (.spec + .status + select metadata) on all result events
  • Events emitted on both success and failure paths

OTEL tracing

  • Span hierarchy: proposal.received (root) → analyze, human_approval, execute, verify, escalate, terminal (children)
  • Root span ID stored in annotation for cross-reconcile parenting
  • human_approval span captures human decision wait time (manual approval only)
  • Auto-approved stages emit audit event but no span
  • traceparent header on /v1/agent/run HTTP calls
  • Service name: agentic-operator

CRD changes

  • spec.audit block on AgenticOLSConfig with logging (bool) and otel.endpoint (string)

Spec updates (second commit)

  • Root span renamed to proposal.received
  • CR serialization includes .status for result CRs
  • proposal.terminal added to child spans
  • spec.audit.enabledspec.audit.logging, logging and OTEL independent

Test plan

  • 12 integration tests (full reconciler lifecycle with audit capture)
  • 22 unit tests (helpers, JSON format, config defaults, logging/OTEL resolution)
  • make test passes
  • make api-lint passes
  • Verified on cluster: audit JSON in operator stdout
  • Verified in Jaeger: span hierarchy with Logs containing CR payloads
  • Verified manual approval: proposal.human_approval span with duration
  • Verified auto approval: audit event emitted, no span

🤖 Generated with Claude Code

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 18, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 18, 2026

Copy link
Copy Markdown

@vimalk78: This pull request references OLS-3024 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Add compliance audit logging to the agentic operator (OLS-3271 + OLS-3272). Structured JSON audit events are emitted to stdout at every proposal phase
transition. OTEL trace spans provide distributed tracing with W3C traceparent propagation to the sandbox.

Audit logging

  • Dedicated AuditLogger (independent zap, immune to operator log level settings)
  • spec.audit.logging controls stdout JSON (default: true)
  • spec.audit.otel.endpoint controls OTEL export (independent of logging)
  • 8 audit events: proposal.received, analysis.completed, approval.received, execution.completed, verification.completed, verification.retry,
    escalation.completed, proposal.terminal
  • CR serialization (.spec + .status + select metadata) on all result events
  • Events emitted on both success and failure paths

OTEL tracing

  • Span hierarchy: proposal.received (root) → analyze, human_approval, execute, verify, escalate, terminal (children)
  • Root span ID stored in annotation for cross-reconcile parenting
  • human_approval span captures human decision wait time (manual approval only)
  • Auto-approved stages emit audit event but no span
  • traceparent header on /v1/agent/run HTTP calls
  • Service name: agentic-operator

CRD changes

  • spec.audit block on AgenticOLSConfig with logging (bool) and otel.endpoint (string)

Spec updates (second commit)

  • Root span renamed to proposal.received
  • CR serialization includes .status for result CRs
  • proposal.terminal added to child spans
  • spec.audit.enabledspec.audit.logging, logging and OTEL independent

Test plan

  • 12 integration tests (full reconciler lifecycle with audit capture)
  • 22 unit tests (helpers, JSON format, config defaults, logging/OTEL resolution)
  • make test passes
  • make api-lint passes
  • Verified on cluster: audit JSON in operator stdout
  • Verified in Jaeger: span hierarchy with Logs containing CR payloads
  • Verified manual approval: proposal.human_approval span with duration
  • Verified auto approval: audit event emitted, no span

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested a review from blublinsky June 18, 2026 09:32
@openshift-ci

openshift-ci Bot commented Jun 18, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign harche for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot requested a review from xrajesh June 18, 2026 09:32
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
📝 Walkthrough

Summary by CodeRabbit

  • New Features
    • Added structured audit event logging and OpenTelemetry tracing with independent controls for JSON event emission vs trace exporting.
    • Proposal lifecycle audit coverage expanded with detailed terminal-phase reporting and improved trace correlation across restarts.
    • Automatic trace context propagation is applied across proposal workflow steps.
  • Configuration Updates
    • Added spec.audit settings to the CRD, including logging and OTEL endpoint with TLS “insecure” handling.
  • Tests
    • Added integration and unit coverage for audit event/trace ID formatting and end-to-end lifecycle emissions.

Walkthrough

Adds structured JSON audit event emission to stdout and OpenTelemetry distributed tracing across the full Proposal lifecycle. Introduces independent spec.audit.logging and spec.audit.otel.endpoint configuration controls. New AuditLogger writes compliance events; OTEL spans instrument each workflow step from proposal.received root through proposal.terminal completion. All handlers emit typed audit events tied to trace context. Wiring flows from main() startup through controller/setup.go into ProposalReconciler, with HTTP trace injection for outbound agent calls.

Changes

Audit Logging and OTEL Tracing for Proposal Lifecycle

Layer / File(s) Summary
Audit config spec and API types
.ai/spec/what/audit-logging.md, api/v1alpha1/agenticolsconfig_types.go, config/crd/bases/agentic.openshift.io_agenticolsconfigs.yaml
Spec renames root span to proposal.received, adds proposal.terminal child span, and splits audit controls into independent spec.audit.logging (default enabled) and spec.audit.otel.endpoint (OTLP when set, else no-op). API adds AuditOTELConfig and AuditConfig with nil-safe LoggingEnabled(), OTELEndpoint(), and OTELInsecure() methods. CRD reflects the schema with validation constraints.
AuditLogger and trace utilities
controller/proposal/audit.go
Implements AuditLogger (zap JSON to stdout), eight audit event name constants, trace-ID derivation from Proposal UID (hex, hyphens stripped), CR metadata serialization, emitAuditEvent (structured log + OTEL span event dual emission), typed terminal/retry payloads, W3C traceparent HTTP header carrier, lifecycleParentContext for span parenting, and common span attributes.
Result CR snapshots for audit payloads
controller/proposal/results.go
createAnalysisResult, createExecutionResult, createVerificationResult, and createEscalationResult return fully populated result CR snapshots (deep-copied, including status) to enable callers to serialize spec/status for audit payloads.
ProposalReconciler root span and lifecycle tracing
controller/proposal/reconciler.go, controller/proposal/approval.go
ProposalReconciler gains Audit field, wraps reconcile context with lifecycleParentContext, creates proposal.received root span on finalizer addition (storing span ID as annotation), emits AuditProposalReceived, and creates proposal.terminal span for terminal phases. isAutoApproved() helper evaluates ApprovalModeAutomatic per stage.
Per-step span instrumentation and audit events
controller/proposal/handlers.go, controller/proposal/helpers.go
All handlers (handleAnalysis, handleRevision, handleExecution, handleVerification, handleSuspension, handleEscalation, denyProposal) wrap step logic in scoped OTEL spans with agent/model attributes and emit corresponding audit events. handleExecution conditionally creates proposal.human_approval span for manual approvals or emits AuditApprovalReceived with mode=Automatic for auto-approvals. All terminal transitions emit AuditProposalTerminal within proposal.terminal span. failStep emits per-condition completion events and failed-phase tracking.
Trace context propagation on outbound HTTP
controller/proposal/client.go
AgentHTTPClient.Run injects OTEL trace context into outgoing HTTP headers via propagation.HeaderCarrier before agent POST, enabling end-to-end trace correlation.
Startup initialization and wiring
cmd/main.go, controller/setup.go, go.mod
main() loads AgenticOLSConfig "cluster", conditionally constructs AuditLogger (deferred Sync()), initializes OTEL TracerProvider (OTLP gRPC or no-op) with deferred shutdown, and passes logger through controller.Options.Audit into ProposalReconciler. go.mod adds OTel v1.44.0 SDK/OTLP/gRPC dependencies; promotes go-logr and zap to direct requirements.
Unit and integration tests
controller/proposal/audit_test.go, controller/proposal/audit_integration_test.go
audit_test.go covers trace-ID utilities, CR serialization, emitAuditEvent nil-safety/payload handling, config accessors, and HTTP carrier behavior. audit_integration_test.go validates 12 lifecycle scenarios: proposal received, analysis/execution/verification completion, analysis/execution failure, auto-approval, denied proposals, terminal-once semantics, trace ID format, and JSON field presence.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding compliance audit logging to the agentic operator, referencing the JIRA issue OLS-3024.
Description check ✅ Passed The description comprehensively covers the audit logging and OTEL tracing implementation with specific details about configuration, events, spans, and testing.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
controller/proposal/handlers.go (3)

690-711: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Move the escalated terminal event after the status patch.

proposal.terminal is emitted before Escalated=True is set and persisted. If statusPatch fails, audit/trace says the proposal is terminal while the API object remains non-terminal. Emit the terminal span/event only after the successful status patch, matching the other terminal paths.

Proposed fix
 	}
 	emitAuditEvent(ctx, r.Audit, AuditEscalationCompleted, traceIDFromProposal(proposal), serializeCRWithStatus(escCR, escCR.Spec, escCR.Status))
-	termCtx, termSpan := tracer.Start(parentCtx, "proposal.terminal")
-	termSpan.SetAttributes(attribute.String("phase", string(agenticv1alpha1.ProposalPhaseEscalated)))
-	emitAuditEvent(termCtx, r.Audit, AuditProposalTerminal, traceIDFromProposal(proposal), terminalPayload{Phase: string(agenticv1alpha1.ProposalPhaseEscalated)})
-	termSpan.End()
 	proposal.Status.Steps.Escalation.Results = append(proposal.Status.Steps.Escalation.Results, agenticv1alpha1.StepResultRef{Name: crName, Outcome: agenticv1alpha1.ActionOutcomeFromBool(escalationResult.Success)})
@@
 	if err := r.statusPatch(ctx, proposal, base); err != nil {
 		return ctrl.Result{}, fmt.Errorf("%s: %w", ErrUpdateToEscalated, err)
 	}
+	termCtx, termSpan := tracer.Start(parentCtx, "proposal.terminal")
+	termSpan.SetAttributes(attribute.String("phase", string(agenticv1alpha1.ProposalPhaseEscalated)))
+	emitAuditEvent(termCtx, r.Audit, AuditProposalTerminal, traceIDFromProposal(proposal), terminalPayload{Phase: string(agenticv1alpha1.ProposalPhaseEscalated)})
+	termSpan.End()
 
 	log.Info("escalation complete", LogKeySummary, escalationResult.Summary)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/handlers.go` around lines 690 - 711, The terminal span
and audit event are being emitted before the status patch is executed and
persisted. This creates a race condition where if the statusPatch call fails,
the audit log will indicate the proposal is terminal while the actual API object
remains non-terminal. Move the tracer.Start("proposal.terminal"),
termSpan.SetAttributes, emitAuditEvent with AuditProposalTerminal, and
termSpan.End() calls to occur after the statusPatch succeeds. This ensures the
terminal event is only recorded once the Escalated condition is actually
persisted to the API object, matching the pattern used in other terminal paths.

469-492: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Emit verification.completed for failed verification results too.

A VerificationResult CR is created before the pass/fail branch, but AuditVerificationCompleted is emitted only on the success path. Retry and exhausted-retry failures miss the completed event for the created result CR. Emit it immediately after result creation succeeds, then keep AuditVerificationRetry as the additional retry-specific event.

Proposed fix
 	verifyCRName, verifyCR, verifyCRErr := r.createVerificationResult(ctx, proposal, verifyResult, proposal.Status.Steps.Verification.Sandbox, startTime, &completedAt, "")
 	if verifyCRErr != nil {
 		return r.failStep(ctx, proposal, agenticv1alpha1.ProposalConditionVerified, fmt.Errorf("%s: %w", ErrCreateVerificationResult, verifyCRErr))
 	}
+	emitAuditEvent(ctx, r.Audit, AuditVerificationCompleted, traceIDFromProposal(proposal), serializeCRWithStatus(verifyCR, verifyCR.Spec, verifyCR.Status))
 	proposal.Status.Steps.Verification.Results = append(proposal.Status.Steps.Verification.Results, agenticv1alpha1.StepResultRef{Name: verifyCRName, Outcome: agenticv1alpha1.ActionOutcomeFromBool(verifyResult.Success)})
@@
-	emitAuditEvent(ctx, r.Audit, AuditVerificationCompleted, traceIDFromProposal(proposal), serializeCRWithStatus(verifyCR, verifyCR.Spec, verifyCR.Status))
 	termCtx, termSpan := tracer.Start(parentCtx, "proposal.terminal")

Also applies to: 542-542

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/handlers.go` around lines 469 - 492, The
AuditVerificationCompleted event is currently only emitted on the success path,
but a VerificationResult CR is created before the pass/fail check. Add an
emitAuditEvent call for AuditVerificationCompleted immediately after the
successful createVerificationResult call (after the verifyCRErr nil check in the
handler), passing the created verifyCR and verifyCRName. This ensures the
completion event is emitted for all verification results regardless of pass/fail
outcome, and the existing AuditVerificationRetry event will remain as an
additional retry-specific event in the failure path.

90-111: ⚠️ Potential issue | 🟠 Major

Decouple OTEL tracing from stdout audit logging.

Tracing and audit logging are currently coupled despite being independent controls. When spec.audit.logging=false, the AuditLogger is nil, which gates both:

  • Lifecycle trace parent context setup (reconciler.go, line 62: if r.Audit != nil)
  • Span event emission (audit.go, line 132: if auditLog == nil { return })

With an OTEL endpoint configured and stdout logging disabled, all span events and lifecycle context are lost. OTEL tracing should proceed independently. Move tracing setup outside the r.Audit != nil gate and allow emitAuditEvent to emit span events even when the audit logger is nil.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/handlers.go` around lines 90 - 111, The OTEL tracing and
audit logging are currently coupled through conditional gates, causing all
tracing to be disabled when audit logging is disabled. In reconciler.go around
line 62, the tracing setup context is gated behind an `if r.Audit != nil` check
- move the trace parent context initialization outside this gate so tracing
proceeds independently of audit logger availability. Additionally, in audit.go
around line 132 in the emitAuditEvent function, the early return when `auditLog
== nil` prevents span events from being emitted - modify the function to
continue processing and emitting span events to the OTEL tracer even when the
audit logger is nil, only skipping the audit logging portion when the logger is
unavailable.
🧹 Nitpick comments (5)
controller/proposal/audit_integration_test.go (5)

410-413: ⚡ Quick win

Check type assertion success before accessing the result.

The type assertion ignores the ok return value, which could cause silent test failures if the payload structure differs from expectations.

🔍 Proposed fix
-	payload, _ := terminal["payload"].(map[string]interface{})
+	payload, ok := terminal["payload"].(map[string]interface{})
+	if !ok {
+		t.Fatal("payload should be an object")
+	}
 	if payload["phase"] != "Denied" {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/audit_integration_test.go` around lines 410 - 413, The
type assertion for payload in the test at line 410 ignores the ok return value,
which can cause silent test failures if the payload structure is incorrect.
Modify the type assertion of terminal["payload"].(map[string]interface{}) to
capture both the payload value and the ok boolean, then add a check to verify
the assertion succeeded before attempting to access payload["phase"], using
t.Errorf or t.Fatalf to report a failure if the type assertion fails.

318-324: ⚡ Quick win

Check type assertion success before accessing the result.

The type assertion ignores the ok return value, which could cause silent test failures if the payload structure differs from expectations.

🔍 Proposed fix
-	payload, _ := execCompleted["payload"].(map[string]interface{})
+	payload, ok := execCompleted["payload"].(map[string]interface{})
+	if !ok {
+		t.Fatal("payload should be an object")
+	}
 	if _, hasMeta := payload["metadata"]; !hasMeta {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/audit_integration_test.go` around lines 318 - 324, The
type assertion on execCompleted["payload"] ignores the ok return value, which
means if the assertion fails, payload will be nil and the subsequent checks on
payload["metadata"] and payload["spec"] will not work as intended. Capture both
the value and the ok boolean from the type assertion, then add a check to ensure
the assertion succeeded before proceeding to access payload fields; if the
assertion fails, the test should report an error indicating that the payload
type assertion failed.

368-371: ⚡ Quick win

Check type assertion success before accessing the result.

The type assertion ignores the ok return value, which could cause silent test failures if the payload structure differs from expectations.

🔍 Proposed fix
-	payload, _ := approval["payload"].(map[string]interface{})
+	payload, ok := approval["payload"].(map[string]interface{})
+	if !ok {
+		t.Fatal("payload should be an object")
+	}
 	if payload["mode"] != "Automatic" {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/audit_integration_test.go` around lines 368 - 371, The
type assertion that extracts payload from the approval map ignores the ok return
value, which can cause silent test failures if the payload structure differs
from expectations. Modify the type assertion to capture both the result and the
boolean success indicator, then add a conditional check before accessing
payload["mode"]. If the assertion fails, use t.Errorf or t.Fatalf to report that
the payload was not of the expected map[string]interface{} type, ensuring the
test explicitly fails rather than proceeding with potentially nil or incorrect
values.

465-468: ⚡ Quick win

Check type assertion success before accessing the result.

The type assertion ignores the ok return value, which could cause silent test failures if the payload structure differs from expectations.

🔍 Proposed fix
-	payload, _ := approval["payload"].(map[string]interface{})
+	payload, ok := approval["payload"].(map[string]interface{})
+	if !ok {
+		t.Fatal("payload should be an object")
+	}
 	if payload["stage"] != "Execution" {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/audit_integration_test.go` around lines 465 - 468, The
type assertion for approval["payload"].(map[string]interface{}) is ignoring the
ok return value, which means the test could silently fail if the payload is not
a map. Modify the line where payload is assigned to capture both the value and
the ok boolean return value, then add a check to verify the assertion succeeded
(ok is true) before proceeding to access payload fields. If the assertion fails,
use t.Fatalf or t.Errorf to fail the test with an appropriate error message
indicating that the payload type assertion failed.

170-176: ⚡ Quick win

Check type assertion success before accessing the result.

The type assertions ignore the ok return value, which could cause silent test failures if the payload structure differs from expectations. This is inconsistent with TestAuditProposalReceivedOnFirstReconcile (lines 101-108) which properly validates type assertions.

🔍 Proposed fix to check assertion success
-	payload, _ := completed["payload"].(map[string]interface{})
+	payload, ok := completed["payload"].(map[string]interface{})
+	if !ok {
+		t.Fatal("payload should be an object")
+	}
 	if _, hasMeta := payload["metadata"]; !hasMeta {
 		t.Error("payload should contain metadata (CR serialization)")
 	}

Apply the same pattern at line 182:

-	termPayload, _ := terminal["payload"].(map[string]interface{})
+	termPayload, ok := terminal["payload"].(map[string]interface{})
+	if !ok {
+		t.Fatal("terminal payload should be an object")
+	}
 	if termPayload["phase"] != "Failed" {

Also applies to: 182-185

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/audit_integration_test.go` around lines 170 - 176, The
type assertion for the payload variable and subsequent assertions for metadata
and spec are ignoring their boolean return values (ok flag), which can cause
silent test failures if the payload structure differs from expectations. Fix
this by checking the ok return value from the payload type assertion and
properly validating that the assertion succeeded before accessing the result.
Additionally, validate the ok return values from the metadata and spec
assertions in the same block, following the same pattern used in
TestAuditProposalReceivedOnFirstReconcile, and add appropriate error handling if
any assertion fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.ai/spec/what/audit-logging.md:
- Line 24: The sentence in the audit-logging.md file currently states that all
audit events must be emitted from the reconciliation loop, but this contradicts
the definition of audit.approval.received which is emitted from the webhook
flow. Revise the statement to explicitly scope the reconciliation loop
requirement to only reconcile-emitted audit events, and either exclude
webhook-emitted events like audit.approval.received or specify that this rule
applies only to events originating from the reconciliation loop where the
Proposal object is in scope.

In `@cmd/main.go`:
- Around line 180-183: The resource.Merge function call is discarding the error
return value using the blank identifier, which silently ignores merge failures
such as schema URL conflicts. Instead of discarding the error, capture it as a
named return value in the resource.Merge call. Then add error handling logic
that checks if the error is not nil, logs the merge failure with appropriate
details, and applies a fallback strategy similar to what is already implemented
for exporter creation errors in the preceding code section (lines 169-177). This
ensures telemetry resource failures are observable and handled gracefully.
- Around line 169-172: The OTLP exporter configuration in the
otlptracegrpc.New() call includes WithInsecure() which disables TLS and exposes
sensitive trace data in plaintext, and the error returned by resource.Merge() at
line 180 is being discarded without handling. Remove the WithInsecure() option
from the otlptracegrpc.New() call to enforce TLS encryption for secure
transport, and add proper error checking and handling for the resource.Merge()
call so that any error it returns is captured and handled appropriately instead
of being silently discarded.

In `@controller/proposal/audit_test.go`:
- Around line 367-368: The comment block describing test cases contains an
incorrect field name reference. In the comment lines that describe the CR
configuration scenarios, replace the field name `spec.audit.enabled` with the
correct field name `spec.audit.logging` to match the actual API field being used
in the test cases throughout the file (referenced at lines 385, 390, 395, and
400). This ensures the documentation comment accurately reflects the actual
field name being tested.
- Around line 167-169: The json.Marshal and json.Unmarshal calls are ignoring
their returned errors by using blank identifiers, which can cause silent
failures and panics. Replace the blank identifiers with proper error variables
for both the json.Marshal call that assigns to jsonBytes and the json.Unmarshal
call that assigns to parsed, then add appropriate error checks (such as t.Fatal
or t.Errorf) to handle any errors that occur. Follow the same pattern used in
the TestSerializeCR function which correctly validates these errors.

In `@controller/proposal/audit.go`:
- Line 41: The comment referencing spec.audit.enabled is outdated and does not
reflect the current configuration structure. Update the comment to remove the
reference to spec.audit.enabled and instead document the actual split controls
that now govern audit behavior: spec.audit.logging and spec.audit.otel.endpoint.
Ensure the comment accurately describes how these separate controls manage audit
functionality instead of the previously described single on/off mechanism.
- Around line 132-156: The emitAuditEvent function returns early when auditLog
is nil, which prevents the span.AddEvent tracing code from executing. This
breaks the independence between audit logging and tracing. Restructure the code
to only skip the auditLog.zap.Info call when auditLog is nil, while allowing the
span.AddEvent logic to execute regardless. Move the tracing code (the
span.IsRecording check and span.AddEvent call) outside or after the early return
condition so it always runs when a valid span and context are available.
- Line 79: The Sync() method of the AuditLogger type in
controller/proposal/audit.go currently discards the error from a.zap.Sync()
using a blank identifier, which violates Go error handling guidelines. Change
the method signature to return an error instead of void by adding error as the
return type, then return the error from a.zap.Sync() directly. After making this
change, update the two call sites where Sync() is invoked: one in cmd/main.go at
line 119 and another in controller/proposal/audit_test.go at line 275, to
properly handle and check the returned error.

In `@controller/proposal/handlers.go`:
- Around line 291-321: The approval event emission block in the handlers (both
in the manual approval path with tracer.Start and emitAuditEvent calls, and in
the auto approval path with emitAuditEvent) executes on every retry attempt,
causing duplicate audit events. To fix this, add a guard condition before
emitting the approval events to ensure they only emit on the first execution
attempt. Check if RetryCount is zero or use a status condition marker (similar
to how ProposalConditionAnalyzed is checked) to gate the emission of both the
proposal.human_approval span and the AuditApprovalReceived event, so that
retries skip this entire approval event emission block.

In `@controller/proposal/helpers.go`:
- Around line 130-137: When the statusPatch call in the error handling block
fails, the code currently logs the error but continues to emit the terminal
audit event and returns nil. Instead, return the statusErr immediately when the
status patch fails, so the terminal event emission (including tracer.Start for
proposal.terminal and emitAuditEvent with AuditProposalTerminal) only occurs
after the failed condition is successfully persisted in the API object. This
ensures the controller will retry if the status patch operation fails.

In `@controller/proposal/reconciler.go`:
- Around line 62-64: The condition `if r.Audit != nil` incorrectly gates the
`lifecycleParentContext` call, but trace parenting and audit logging are
separate concerns. When audit logging is disabled but OpenTelemetry is
configured, r.Audit will be nil and lifecycleParentContext will be skipped,
breaking trace parenting. Remove the `if r.Audit != nil` guard and ensure
lifecycleParentContext is called unconditionally for the proposal context, or
add a separate check that verifies whether tracing is enabled instead of relying
on the audit logger's presence.

---

Outside diff comments:
In `@controller/proposal/handlers.go`:
- Around line 690-711: The terminal span and audit event are being emitted
before the status patch is executed and persisted. This creates a race condition
where if the statusPatch call fails, the audit log will indicate the proposal is
terminal while the actual API object remains non-terminal. Move the
tracer.Start("proposal.terminal"), termSpan.SetAttributes, emitAuditEvent with
AuditProposalTerminal, and termSpan.End() calls to occur after the statusPatch
succeeds. This ensures the terminal event is only recorded once the Escalated
condition is actually persisted to the API object, matching the pattern used in
other terminal paths.
- Around line 469-492: The AuditVerificationCompleted event is currently only
emitted on the success path, but a VerificationResult CR is created before the
pass/fail check. Add an emitAuditEvent call for AuditVerificationCompleted
immediately after the successful createVerificationResult call (after the
verifyCRErr nil check in the handler), passing the created verifyCR and
verifyCRName. This ensures the completion event is emitted for all verification
results regardless of pass/fail outcome, and the existing AuditVerificationRetry
event will remain as an additional retry-specific event in the failure path.
- Around line 90-111: The OTEL tracing and audit logging are currently coupled
through conditional gates, causing all tracing to be disabled when audit logging
is disabled. In reconciler.go around line 62, the tracing setup context is gated
behind an `if r.Audit != nil` check - move the trace parent context
initialization outside this gate so tracing proceeds independently of audit
logger availability. Additionally, in audit.go around line 132 in the
emitAuditEvent function, the early return when `auditLog == nil` prevents span
events from being emitted - modify the function to continue processing and
emitting span events to the OTEL tracer even when the audit logger is nil, only
skipping the audit logging portion when the logger is unavailable.

---

Nitpick comments:
In `@controller/proposal/audit_integration_test.go`:
- Around line 410-413: The type assertion for payload in the test at line 410
ignores the ok return value, which can cause silent test failures if the payload
structure is incorrect. Modify the type assertion of
terminal["payload"].(map[string]interface{}) to capture both the payload value
and the ok boolean, then add a check to verify the assertion succeeded before
attempting to access payload["phase"], using t.Errorf or t.Fatalf to report a
failure if the type assertion fails.
- Around line 318-324: The type assertion on execCompleted["payload"] ignores
the ok return value, which means if the assertion fails, payload will be nil and
the subsequent checks on payload["metadata"] and payload["spec"] will not work
as intended. Capture both the value and the ok boolean from the type assertion,
then add a check to ensure the assertion succeeded before proceeding to access
payload fields; if the assertion fails, the test should report an error
indicating that the payload type assertion failed.
- Around line 368-371: The type assertion that extracts payload from the
approval map ignores the ok return value, which can cause silent test failures
if the payload structure differs from expectations. Modify the type assertion to
capture both the result and the boolean success indicator, then add a
conditional check before accessing payload["mode"]. If the assertion fails, use
t.Errorf or t.Fatalf to report that the payload was not of the expected
map[string]interface{} type, ensuring the test explicitly fails rather than
proceeding with potentially nil or incorrect values.
- Around line 465-468: The type assertion for
approval["payload"].(map[string]interface{}) is ignoring the ok return value,
which means the test could silently fail if the payload is not a map. Modify the
line where payload is assigned to capture both the value and the ok boolean
return value, then add a check to verify the assertion succeeded (ok is true)
before proceeding to access payload fields. If the assertion fails, use t.Fatalf
or t.Errorf to fail the test with an appropriate error message indicating that
the payload type assertion failed.
- Around line 170-176: The type assertion for the payload variable and
subsequent assertions for metadata and spec are ignoring their boolean return
values (ok flag), which can cause silent test failures if the payload structure
differs from expectations. Fix this by checking the ok return value from the
payload type assertion and properly validating that the assertion succeeded
before accessing the result. Additionally, validate the ok return values from
the metadata and spec assertions in the same block, following the same pattern
used in TestAuditProposalReceivedOnFirstReconcile, and add appropriate error
handling if any assertion fails.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: dc3a03ca-b7e1-4772-b37c-e3701373310d

📥 Commits

Reviewing files that changed from the base of the PR and between 310f514 and 1033471.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (15)
  • .ai/spec/what/audit-logging.md
  • api/v1alpha1/agenticolsconfig_types.go
  • cmd/main.go
  • config/crd/bases/agentic.openshift.io_agenticolsconfigs.yaml
  • controller/proposal/approval.go
  • controller/proposal/audit.go
  • controller/proposal/audit_integration_test.go
  • controller/proposal/audit_test.go
  • controller/proposal/client.go
  • controller/proposal/handlers.go
  • controller/proposal/helpers.go
  • controller/proposal/reconciler.go
  • controller/proposal/results.go
  • controller/setup.go
  • go.mod

Comment thread .ai/spec/what/audit-logging.md Outdated
Comment thread cmd/main.go Outdated
Comment thread cmd/main.go Outdated
Comment thread controller/proposal/audit_test.go Outdated
Comment thread controller/proposal/audit_test.go Outdated
Comment thread controller/proposal/audit.go Outdated

// Sync flushes the audit logger.
func (a *AuditLogger) Sync() {
_ = a.zap.Sync()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== AuditLogger.Sync definition =="
rg -n --type go 'func \(a \*AuditLogger\) Sync\(' controller/proposal/audit.go -C 3

echo
echo "== Call sites of Sync() (to ensure propagated handling is feasible) =="
rg -n --type go '\.Sync\(\)' -C 2

echo
echo "== Other blank-identifier error discards in Go files (sanity check) =="
rg -n --type go '^\s*_\s*=\s*.*\(' -C 1

Repository: openshift/lightspeed-agentic-operator

Length of output: 4381


Change the Sync() method signature to return an error and handle it at call sites.

The Sync() method at line 79 discards the error from a.zap.Sync() using the blank identifier. Per the coding guidelines, Go files must never ignore error returns. Ignoring audit flush errors on shutdown could hide audit log loss or failures.

Change the method signature to func (a *AuditLogger) Sync() error and propagate the error to the two call sites in cmd/main.go (line 119) and controller/proposal/audit_test.go (line 275).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/audit.go` at line 79, The Sync() method of the
AuditLogger type in controller/proposal/audit.go currently discards the error
from a.zap.Sync() using a blank identifier, which violates Go error handling
guidelines. Change the method signature to return an error instead of void by
adding error as the return type, then return the error from a.zap.Sync()
directly. After making this change, update the two call sites where Sync() is
invoked: one in cmd/main.go at line 119 and another in
controller/proposal/audit_test.go at line 275, to properly handle and check the
returned error.

Source: Coding guidelines

Comment thread controller/proposal/audit.go Outdated
Comment thread controller/proposal/handlers.go Outdated
Comment thread controller/proposal/helpers.go
Comment thread controller/proposal/reconciler.go Outdated
vimalk78 and others added 2 commits June 18, 2026 16:17
Structured JSON audit events emitted to stdout at every proposal phase
transition. OTEL trace spans provide distributed tracing with W3C
traceparent propagation to the sandbox.

Audit logging:
- Dedicated AuditLogger (independent zap, immune to operator log levels)
- spec.audit.logging controls stdout JSON (default: true)
- spec.audit.otel.endpoint controls OTEL export (independent of logging)
- spec.audit.otel.insecure controls TLS for OTLP connection (default: true)
- 8 audit events covering full proposal lifecycle
- CR serialization (.spec + .status + select metadata) on all result events

OTEL tracing:
- Span hierarchy: proposal.received (root) -> analyze, human_approval,
  execute, verify, escalate, terminal (children)
- Root span ID stored in annotation for cross-reconcile parenting
- human_approval span captures human decision wait time (manual only)
- traceparent header on /v1/agent/run calls for sandbox propagation

Mutating webhook:
- Injects approver identity (uid, username, timestamp) from AdmissionReview
- Emits audit.approval.received with user identity and trace_id
- Fail-closed policy

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Vimal Kumar <vimal78@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Vimal Kumar <vimal78@gmail.com>
@vimalk78 vimalk78 force-pushed the ols-3024-audit-logging branch from 1033471 to 1b60d9e Compare June 18, 2026 10:48
@openshift-ci

openshift-ci Bot commented Jun 18, 2026

Copy link
Copy Markdown

@vimalk78: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/main.go`:
- Around line 121-124: The defer function containing auditLogger.Sync() is
currently suppressing all errors with the blank identifier, which masks real
flush failures. Instead of discarding all errors indiscriminately, capture the
error returned from auditLogger.Sync() and explicitly check if it is the known
benign "invalid argument" (EINVAL) error that occurs on many platforms when
syncing stdout. Only suppress this specific error while handling or logging any
other unexpected errors that might indicate real problems during flush
operations.
- Around line 157-159: The os.Exit(1) call in the mgr.Start error handling block
bypasses deferred cleanup functions like tp.Shutdown and auditLogger.Sync,
causing telemetry data to be lost. Refactor the main function by extracting the
manager start logic into a helper function (e.g., run() that returns an int
error code) so that deferred telemetry cleanup executes before os.Exit, or
alternatively explicitly call tp.Shutdown() and auditLogger.Sync() before
os.Exit(1) in the error path after mgr.Start fails.
- Around line 111-115: The directClient.Get() call silently falls back on all
errors, allowing RBAC and API failures to disable OTEL tracing or audit logging
without any indication. Replace context.Background() with a context that has a
timeout deadline, then explicitly check the error returned from
directClient.Get() to only silently continue for NotFound errors. For other
error types (such as permission denied or connection failures), log the error
appropriately and either fail closed or apply conservative defaults to ensure
security configurations are not accidentally disabled by API failures.
- Around line 129-130: The tp.Shutdown() call at line 130 and the OTEL exporter
creation at line 180 both use context.Background() without timeouts, which can
cause indefinite hangs if the OTEL collector is unreachable. Replace
context.Background() with context.WithTimeout() at both locations to add
appropriate timeout durations (e.g., 10-30 seconds), ensuring that exporter
creation and shutdown operations fail fast rather than blocking indefinitely
during operator startup or graceful termination.

In `@controller/proposal/audit_integration_test.go`:
- Line 129: The reconcileOnce function calls throughout the test file are
ignoring error return values, which violates Go error handling guidelines. For
each call to reconcileOnce (at lines 129, 160-161, 202-206, 247-255, 307-310,
358-360, 385-387, 402, 427, 454-457, and 482), capture the returned error and
add an assertion to verify it is nil using a pattern like err :=
reconcileOnce(...); if err != nil { t.Fatal(err) }. This ensures that test setup
failures are caught immediately rather than allowing broken state to propagate
to later assertions.
- Around line 42-54: The auditEventsFromBuffer function silently ignores JSON
unmarshal errors when parsing audit event lines, causing malformed audit output
to be skipped without any indication of failure. This masks test failures and
hides bugs in audit emission. Modify the function to properly surface unmarshal
errors instead of continuing silently when err != nil in the json.Unmarshal
call. Either return an error from auditEventsFromBuffer or use testing.T to fail
the test when unmarshal fails, depending on how this function is called in
tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 39f44ac1-3b51-4cac-b6e3-1f39356acffd

📥 Commits

Reviewing files that changed from the base of the PR and between 1033471 and 1b60d9e.

⛔ Files ignored due to path filters (2)
  • config/crd/bases/agentic.openshift.io_agenticolsconfigs.yaml is excluded by !config/crd/bases/**
  • go.sum is excluded by !**/*.sum, !go.sum
📒 Files selected for processing (14)
  • .ai/spec/what/audit-logging.md
  • api/v1alpha1/agenticolsconfig_types.go
  • cmd/main.go
  • controller/proposal/approval.go
  • controller/proposal/audit.go
  • controller/proposal/audit_integration_test.go
  • controller/proposal/audit_test.go
  • controller/proposal/client.go
  • controller/proposal/handlers.go
  • controller/proposal/helpers.go
  • controller/proposal/reconciler.go
  • controller/proposal/results.go
  • controller/setup.go
  • go.mod
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift/lightspeed-agentic-sandbox (manual)
🚧 Files skipped from review as they are similar to previous changes (12)
  • controller/proposal/helpers.go
  • controller/proposal/client.go
  • controller/setup.go
  • .ai/spec/what/audit-logging.md
  • controller/proposal/approval.go
  • controller/proposal/reconciler.go
  • controller/proposal/results.go
  • go.mod
  • api/v1alpha1/agenticolsconfig_types.go
  • controller/proposal/audit_test.go
  • controller/proposal/audit.go
  • controller/proposal/handlers.go

Comment thread cmd/main.go
Comment on lines +111 to +115
if err := directClient.Get(context.Background(), client.ObjectKey{Name: "cluster"}, &agenticConfig); err == nil {
auditLogging = proposal.ResolveLoggingEnabled(agenticConfig.Spec.Audit)
otelEndpoint = proposal.ResolveOTELEndpoint(agenticConfig.Spec.Audit)
otelInsecure = proposal.ResolveOTELInsecure(agenticConfig.Spec.Audit)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle audit config read failures explicitly.

Line 111 silently falls back for every Get error, so RBAC/API failures can disable configured OTEL tracing or apply the wrong audit logging mode without any signal. Default only for NotFound; log/fail closed for other errors, and bound the API call with a timeout.

Suggested fix
+	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+	defer cancel()
 	var agenticConfig agenticv1alpha1.AgenticOLSConfig
-	if err := directClient.Get(context.Background(), client.ObjectKey{Name: "cluster"}, &agenticConfig); err == nil {
+	if err := directClient.Get(ctx, client.ObjectKey{Name: "cluster"}, &agenticConfig); err != nil {
+		if !apierrors.IsNotFound(err) {
+			log.Error(err, "unable to read AgenticOLSConfig audit config")
+			os.Exit(1)
+		}
+	} else {
 		auditLogging = proposal.ResolveLoggingEnabled(agenticConfig.Spec.Audit)
 		otelEndpoint = proposal.ResolveOTELEndpoint(agenticConfig.Spec.Audit)
 		otelInsecure = proposal.ResolveOTELInsecure(agenticConfig.Spec.Audit)
 	}

As per coding guidelines, Go security requires “Never ignore error returns” and “context.Context for cancellation and timeouts”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/main.go` around lines 111 - 115, The directClient.Get() call silently
falls back on all errors, allowing RBAC and API failures to disable OTEL tracing
or audit logging without any indication. Replace context.Background() with a
context that has a timeout deadline, then explicitly check the error returned
from directClient.Get() to only silently continue for NotFound errors. For other
error types (such as permission denied or connection failures), log the error
appropriately and either fail closed or apply conservative defaults to ensure
security configurations are not accidentally disabled by API failures.

Source: Coding guidelines

Comment thread cmd/main.go
Comment on lines +121 to +124
defer func() {
// Sync stdout returns "invalid argument" on many platforms — benign.
_ = auditLogger.Sync()
}()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t suppress every audit logger sync error.

Line 123 hides real flush failures along with the known stdout EINVAL case. Handle or explicitly filter the benign error instead of discarding all errors.

Suggested fix
 	defer func() {
 		// Sync stdout returns "invalid argument" on many platforms — benign.
-		_ = auditLogger.Sync()
+		if err := auditLogger.Sync(); err != nil && !errors.Is(err, syscall.EINVAL) {
+			log.Error(err, "failed to sync audit logger")
+		}
 	}()

As per coding guidelines, Go security requires “Never ignore error returns”.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
defer func() {
// Sync stdout returns "invalid argument" on many platforms — benign.
_ = auditLogger.Sync()
}()
defer func() {
// Sync stdout returns "invalid argument" on many platforms — benign.
if err := auditLogger.Sync(); err != nil && !errors.Is(err, syscall.EINVAL) {
log.Error(err, "failed to sync audit logger")
}
}()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/main.go` around lines 121 - 124, The defer function containing
auditLogger.Sync() is currently suppressing all errors with the blank
identifier, which masks real flush failures. Instead of discarding all errors
indiscriminately, capture the error returned from auditLogger.Sync() and
explicitly check if it is the known benign "invalid argument" (EINVAL) error
that occurs on many platforms when syncing stdout. Only suppress this specific
error while handling or logging any other unexpected errors that might indicate
real problems during flush operations.

Source: Coding guidelines

Comment thread cmd/main.go
Comment on lines +129 to +130
defer func() {
if err := tp.Shutdown(context.Background()); err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Inspect OTEL init/shutdown call sites and any documented shutdown grace period.
rg -n -C3 'initTracerProvider|Shutdown\(context\.Background\(\)|otlptracegrpc\.New|terminationGracePeriodSeconds' cmd/main.go config/ .ai/spec/ README.md

Repository: openshift/lightspeed-agentic-operator

Length of output: 1158


🏁 Script executed:

#!/bin/bash
# Read the full initTracerProvider function to see ctx definition and usage
sed -n '163,203p' cmd/main.go

Repository: openshift/lightspeed-agentic-operator

Length of output: 1325


🏁 Script executed:

#!/bin/bash
# Check for timeout patterns in main.go and deployment config
echo "=== Timeout patterns in main.go ==="
grep -n "WithTimeout\|time.Second\|context.WithDeadline" cmd/main.go

echo -e "\n=== Deployment terminationGracePeriodSeconds ==="
find config/ -name "*.yaml" -o -name "*.yml" | xargs grep -l "terminationGracePeriodSeconds" 2>/dev/null || echo "Not found in config/"

echo -e "\n=== Manager deployment configuration ==="
cat config/manager/manager.yaml 2>/dev/null | head -80

Repository: openshift/lightspeed-agentic-operator

Length of output: 2181


🏁 Script executed:

#!/bin/bash
# Check go.mod for OTEL dependencies
echo "=== OTEL dependencies in go.mod ==="
grep -i "otel\|trace" go.mod

echo -e "\n=== Check for OTEL SDK timeout behavior in imports ==="
grep -n "otlptracegrpc\|sdktrace\|context" cmd/main.go | head -20

Repository: openshift/lightspeed-agentic-operator

Length of output: 1319


Add timeout contexts for OTEL exporter creation and shutdown.

Exporter creation at line 180 and shutdown at line 130 use context.Background() without timeouts. If the OTEL collector is unreachable or slow, these operations can hang indefinitely and delay operator startup or graceful termination. Go security guidelines require "context.Context for cancellation and timeouts."

Suggested fix
 	defer func() {
-		if err := tp.Shutdown(context.Background()); err != nil {
+		shutdownCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+		defer cancel()
+		if err := tp.Shutdown(shutdownCtx); err != nil {
 			log.Error(err, "failed to shut down tracer provider")
 		}
 	}()
-	ctx := context.Background()
+	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+	defer cancel()
 	opts := []otlptracegrpc.Option{
 		otlptracegrpc.WithEndpoint(endpoint),
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/main.go` around lines 129 - 130, The tp.Shutdown() call at line 130 and
the OTEL exporter creation at line 180 both use context.Background() without
timeouts, which can cause indefinite hangs if the OTEL collector is unreachable.
Replace context.Background() with context.WithTimeout() at both locations to add
appropriate timeout durations (e.g., 10-30 seconds), ensuring that exporter
creation and shutdown operations fail fast rather than blocking indefinitely
during operator startup or graceful termination.

Source: Coding guidelines

Comment thread cmd/main.go
Comment on lines 157 to 159
if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {
log.Error(err, "problem running manager")
os.Exit(1)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Flush telemetry before exiting on manager errors.

os.Exit(1) skips the deferred tp.Shutdown and auditLogger.Sync above, so batched spans can be lost on the failure path. Explicitly run cleanup before exiting, or move main logic into a run() int helper so defers execute before os.Exit.

Suggested pattern
-	if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {
-		log.Error(err, "problem running manager")
-		os.Exit(1)
-	}
+	if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {
+		log.Error(err, "problem running manager")
+		shutdownCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+		if shutdownErr := tp.Shutdown(shutdownCtx); shutdownErr != nil {
+			log.Error(shutdownErr, "failed to shut down tracer provider")
+		}
+		cancel()
+		if auditLogger != nil {
+			if syncErr := auditLogger.Sync(); syncErr != nil {
+				log.Error(syncErr, "failed to sync audit logger")
+			}
+		}
+		os.Exit(1)
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/main.go` around lines 157 - 159, The os.Exit(1) call in the mgr.Start
error handling block bypasses deferred cleanup functions like tp.Shutdown and
auditLogger.Sync, causing telemetry data to be lost. Refactor the main function
by extracting the manager start logic into a helper function (e.g., run() that
returns an int error code) so that deferred telemetry cleanup executes before
os.Exit, or alternatively explicitly call tp.Shutdown() and auditLogger.Sync()
before os.Exit(1) in the error path after mgr.Start fails.

Comment on lines +42 to +54
func auditEventsFromBuffer(buf *bytes.Buffer) []map[string]interface{} {
var events []map[string]interface{}
for _, line := range strings.Split(buf.String(), "\n") {
if line == "" {
continue
}
var parsed map[string]interface{}
if err := json.Unmarshal([]byte(line), &parsed); err == nil {
events = append(events, parsed)
}
}
return events
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Silently ignoring unmarshal errors masks test failures.

Line 49 discards JSON parse errors. If audit output is malformed, tests will pass with an empty events slice instead of failing. This violates the Go guideline to never ignore error returns and hides bugs in audit emission.

🛡️ Proposed fix to surface unmarshal errors
 func auditEventsFromBuffer(buf *bytes.Buffer) []map[string]interface{} {
 	var events []map[string]interface{}
 	for _, line := range strings.Split(buf.String(), "\n") {
 		if line == "" {
 			continue
 		}
 		var parsed map[string]interface{}
-		if err := json.Unmarshal([]byte(line), &parsed); err == nil {
+		if err := json.Unmarshal([]byte(line), &parsed); err != nil {
+			panic(fmt.Sprintf("invalid audit JSON: %v\nLine: %s", err, line))
+		} else {
 			events = append(events, parsed)
 		}
 	}
 	return events
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func auditEventsFromBuffer(buf *bytes.Buffer) []map[string]interface{} {
var events []map[string]interface{}
for _, line := range strings.Split(buf.String(), "\n") {
if line == "" {
continue
}
var parsed map[string]interface{}
if err := json.Unmarshal([]byte(line), &parsed); err == nil {
events = append(events, parsed)
}
}
return events
}
func auditEventsFromBuffer(buf *bytes.Buffer) []map[string]interface{} {
var events []map[string]interface{}
for _, line := range strings.Split(buf.String(), "\n") {
if line == "" {
continue
}
var parsed map[string]interface{}
if err := json.Unmarshal([]byte(line), &parsed); err != nil {
panic(fmt.Sprintf("invalid audit JSON: %v\nLine: %s", err, line))
} else {
events = append(events, parsed)
}
}
return events
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/audit_integration_test.go` around lines 42 - 54, The
auditEventsFromBuffer function silently ignores JSON unmarshal errors when
parsing audit event lines, causing malformed audit output to be skipped without
any indication of failure. This masks test failures and hides bugs in audit
emission. Modify the function to properly surface unmarshal errors instead of
continuing silently when err != nil in the json.Unmarshal call. Either return an
error from auditEventsFromBuffer or use testing.T to fail the test when
unmarshal fails, depending on how this function is called in tests.

Source: Coding guidelines

)

// First reconcile: adds finalizer + emits proposal.received
reconcileOnce(r, "fix-crash")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Check errors from all reconcileOnce calls.

Multiple tests ignore reconcileOnce errors (lines 129, 160-161, 202-206, 247-255, 307-310, 358-360, 385-387, 402, 427, 454-457, 482). This violates the Go guideline to never ignore error returns. Even when setting up state for later assertions, reconcile failures indicate broken test setup and should fail fast.

🔍 Example fix for TestAuditAnalysisCompleted
 	// First reconcile: adds finalizer + emits proposal.received
-	reconcileOnce(r, "fix-crash")
+	if _, err := reconcileOnce(r, "fix-crash"); err != nil {
+		t.Fatalf("first reconcile error: %v", err)
+	}
 	// Second reconcile: runs analysis
 	_, err := reconcileOnce(r, "fix-crash")

Apply the same pattern to all ignored reconcileOnce calls.

Also applies to: 160-161, 202-206, 247-255, 307-310, 358-360, 385-387, 402-402, 427-427, 454-457, 482-482

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/audit_integration_test.go` at line 129, The reconcileOnce
function calls throughout the test file are ignoring error return values, which
violates Go error handling guidelines. For each call to reconcileOnce (at lines
129, 160-161, 202-206, 247-255, 307-310, 358-360, 385-387, 402, 427, 454-457,
and 482), capture the returned error and add an assertion to verify it is nil
using a pattern like err := reconcileOnce(...); if err != nil { t.Fatal(err) }.
This ensures that test setup failures are caught immediately rather than
allowing broken state to propagate to later assertions.

Source: Coding guidelines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants