OLS-3024: Add compliance audit logging to agentic operator#158
OLS-3024: Add compliance audit logging to agentic operator#158vimalk78 wants to merge 2 commits into
Conversation
|
@vimalk78: This pull request references OLS-3024 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds structured JSON audit event emission to stdout and OpenTelemetry distributed tracing across the full Proposal lifecycle. Introduces independent ChangesAudit Logging and OTEL Tracing for Proposal Lifecycle
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
controller/proposal/handlers.go (3)
690-711:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMove the escalated terminal event after the status patch.
proposal.terminalis emitted beforeEscalated=Trueis set and persisted. IfstatusPatchfails, audit/trace says the proposal is terminal while the API object remains non-terminal. Emit the terminal span/event only after the successful status patch, matching the other terminal paths.Proposed fix
} emitAuditEvent(ctx, r.Audit, AuditEscalationCompleted, traceIDFromProposal(proposal), serializeCRWithStatus(escCR, escCR.Spec, escCR.Status)) - termCtx, termSpan := tracer.Start(parentCtx, "proposal.terminal") - termSpan.SetAttributes(attribute.String("phase", string(agenticv1alpha1.ProposalPhaseEscalated))) - emitAuditEvent(termCtx, r.Audit, AuditProposalTerminal, traceIDFromProposal(proposal), terminalPayload{Phase: string(agenticv1alpha1.ProposalPhaseEscalated)}) - termSpan.End() proposal.Status.Steps.Escalation.Results = append(proposal.Status.Steps.Escalation.Results, agenticv1alpha1.StepResultRef{Name: crName, Outcome: agenticv1alpha1.ActionOutcomeFromBool(escalationResult.Success)}) @@ if err := r.statusPatch(ctx, proposal, base); err != nil { return ctrl.Result{}, fmt.Errorf("%s: %w", ErrUpdateToEscalated, err) } + termCtx, termSpan := tracer.Start(parentCtx, "proposal.terminal") + termSpan.SetAttributes(attribute.String("phase", string(agenticv1alpha1.ProposalPhaseEscalated))) + emitAuditEvent(termCtx, r.Audit, AuditProposalTerminal, traceIDFromProposal(proposal), terminalPayload{Phase: string(agenticv1alpha1.ProposalPhaseEscalated)}) + termSpan.End() log.Info("escalation complete", LogKeySummary, escalationResult.Summary)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controller/proposal/handlers.go` around lines 690 - 711, The terminal span and audit event are being emitted before the status patch is executed and persisted. This creates a race condition where if the statusPatch call fails, the audit log will indicate the proposal is terminal while the actual API object remains non-terminal. Move the tracer.Start("proposal.terminal"), termSpan.SetAttributes, emitAuditEvent with AuditProposalTerminal, and termSpan.End() calls to occur after the statusPatch succeeds. This ensures the terminal event is only recorded once the Escalated condition is actually persisted to the API object, matching the pattern used in other terminal paths.
469-492:⚠️ Potential issue | 🟠 Major | ⚡ Quick winEmit
verification.completedfor failed verification results too.A
VerificationResultCR is created before the pass/fail branch, butAuditVerificationCompletedis emitted only on the success path. Retry and exhausted-retry failures miss the completed event for the created result CR. Emit it immediately after result creation succeeds, then keepAuditVerificationRetryas the additional retry-specific event.Proposed fix
verifyCRName, verifyCR, verifyCRErr := r.createVerificationResult(ctx, proposal, verifyResult, proposal.Status.Steps.Verification.Sandbox, startTime, &completedAt, "") if verifyCRErr != nil { return r.failStep(ctx, proposal, agenticv1alpha1.ProposalConditionVerified, fmt.Errorf("%s: %w", ErrCreateVerificationResult, verifyCRErr)) } + emitAuditEvent(ctx, r.Audit, AuditVerificationCompleted, traceIDFromProposal(proposal), serializeCRWithStatus(verifyCR, verifyCR.Spec, verifyCR.Status)) proposal.Status.Steps.Verification.Results = append(proposal.Status.Steps.Verification.Results, agenticv1alpha1.StepResultRef{Name: verifyCRName, Outcome: agenticv1alpha1.ActionOutcomeFromBool(verifyResult.Success)}) @@ - emitAuditEvent(ctx, r.Audit, AuditVerificationCompleted, traceIDFromProposal(proposal), serializeCRWithStatus(verifyCR, verifyCR.Spec, verifyCR.Status)) termCtx, termSpan := tracer.Start(parentCtx, "proposal.terminal")Also applies to: 542-542
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controller/proposal/handlers.go` around lines 469 - 492, The AuditVerificationCompleted event is currently only emitted on the success path, but a VerificationResult CR is created before the pass/fail check. Add an emitAuditEvent call for AuditVerificationCompleted immediately after the successful createVerificationResult call (after the verifyCRErr nil check in the handler), passing the created verifyCR and verifyCRName. This ensures the completion event is emitted for all verification results regardless of pass/fail outcome, and the existing AuditVerificationRetry event will remain as an additional retry-specific event in the failure path.
90-111:⚠️ Potential issue | 🟠 MajorDecouple OTEL tracing from stdout audit logging.
Tracing and audit logging are currently coupled despite being independent controls. When
spec.audit.logging=false, theAuditLoggeris nil, which gates both:
- Lifecycle trace parent context setup (reconciler.go, line 62:
if r.Audit != nil)- Span event emission (audit.go, line 132:
if auditLog == nil { return })With an OTEL endpoint configured and stdout logging disabled, all span events and lifecycle context are lost. OTEL tracing should proceed independently. Move tracing setup outside the
r.Audit != nilgate and allowemitAuditEventto emit span events even when the audit logger is nil.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controller/proposal/handlers.go` around lines 90 - 111, The OTEL tracing and audit logging are currently coupled through conditional gates, causing all tracing to be disabled when audit logging is disabled. In reconciler.go around line 62, the tracing setup context is gated behind an `if r.Audit != nil` check - move the trace parent context initialization outside this gate so tracing proceeds independently of audit logger availability. Additionally, in audit.go around line 132 in the emitAuditEvent function, the early return when `auditLog == nil` prevents span events from being emitted - modify the function to continue processing and emitting span events to the OTEL tracer even when the audit logger is nil, only skipping the audit logging portion when the logger is unavailable.
🧹 Nitpick comments (5)
controller/proposal/audit_integration_test.go (5)
410-413: ⚡ Quick winCheck type assertion success before accessing the result.
The type assertion ignores the
okreturn value, which could cause silent test failures if the payload structure differs from expectations.🔍 Proposed fix
- payload, _ := terminal["payload"].(map[string]interface{}) + payload, ok := terminal["payload"].(map[string]interface{}) + if !ok { + t.Fatal("payload should be an object") + } if payload["phase"] != "Denied" {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controller/proposal/audit_integration_test.go` around lines 410 - 413, The type assertion for payload in the test at line 410 ignores the ok return value, which can cause silent test failures if the payload structure is incorrect. Modify the type assertion of terminal["payload"].(map[string]interface{}) to capture both the payload value and the ok boolean, then add a check to verify the assertion succeeded before attempting to access payload["phase"], using t.Errorf or t.Fatalf to report a failure if the type assertion fails.
318-324: ⚡ Quick winCheck type assertion success before accessing the result.
The type assertion ignores the
okreturn value, which could cause silent test failures if the payload structure differs from expectations.🔍 Proposed fix
- payload, _ := execCompleted["payload"].(map[string]interface{}) + payload, ok := execCompleted["payload"].(map[string]interface{}) + if !ok { + t.Fatal("payload should be an object") + } if _, hasMeta := payload["metadata"]; !hasMeta {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controller/proposal/audit_integration_test.go` around lines 318 - 324, The type assertion on execCompleted["payload"] ignores the ok return value, which means if the assertion fails, payload will be nil and the subsequent checks on payload["metadata"] and payload["spec"] will not work as intended. Capture both the value and the ok boolean from the type assertion, then add a check to ensure the assertion succeeded before proceeding to access payload fields; if the assertion fails, the test should report an error indicating that the payload type assertion failed.
368-371: ⚡ Quick winCheck type assertion success before accessing the result.
The type assertion ignores the
okreturn value, which could cause silent test failures if the payload structure differs from expectations.🔍 Proposed fix
- payload, _ := approval["payload"].(map[string]interface{}) + payload, ok := approval["payload"].(map[string]interface{}) + if !ok { + t.Fatal("payload should be an object") + } if payload["mode"] != "Automatic" {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controller/proposal/audit_integration_test.go` around lines 368 - 371, The type assertion that extracts payload from the approval map ignores the ok return value, which can cause silent test failures if the payload structure differs from expectations. Modify the type assertion to capture both the result and the boolean success indicator, then add a conditional check before accessing payload["mode"]. If the assertion fails, use t.Errorf or t.Fatalf to report that the payload was not of the expected map[string]interface{} type, ensuring the test explicitly fails rather than proceeding with potentially nil or incorrect values.
465-468: ⚡ Quick winCheck type assertion success before accessing the result.
The type assertion ignores the
okreturn value, which could cause silent test failures if the payload structure differs from expectations.🔍 Proposed fix
- payload, _ := approval["payload"].(map[string]interface{}) + payload, ok := approval["payload"].(map[string]interface{}) + if !ok { + t.Fatal("payload should be an object") + } if payload["stage"] != "Execution" {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controller/proposal/audit_integration_test.go` around lines 465 - 468, The type assertion for approval["payload"].(map[string]interface{}) is ignoring the ok return value, which means the test could silently fail if the payload is not a map. Modify the line where payload is assigned to capture both the value and the ok boolean return value, then add a check to verify the assertion succeeded (ok is true) before proceeding to access payload fields. If the assertion fails, use t.Fatalf or t.Errorf to fail the test with an appropriate error message indicating that the payload type assertion failed.
170-176: ⚡ Quick winCheck type assertion success before accessing the result.
The type assertions ignore the
okreturn value, which could cause silent test failures if the payload structure differs from expectations. This is inconsistent with TestAuditProposalReceivedOnFirstReconcile (lines 101-108) which properly validates type assertions.🔍 Proposed fix to check assertion success
- payload, _ := completed["payload"].(map[string]interface{}) + payload, ok := completed["payload"].(map[string]interface{}) + if !ok { + t.Fatal("payload should be an object") + } if _, hasMeta := payload["metadata"]; !hasMeta { t.Error("payload should contain metadata (CR serialization)") }Apply the same pattern at line 182:
- termPayload, _ := terminal["payload"].(map[string]interface{}) + termPayload, ok := terminal["payload"].(map[string]interface{}) + if !ok { + t.Fatal("terminal payload should be an object") + } if termPayload["phase"] != "Failed" {Also applies to: 182-185
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controller/proposal/audit_integration_test.go` around lines 170 - 176, The type assertion for the payload variable and subsequent assertions for metadata and spec are ignoring their boolean return values (ok flag), which can cause silent test failures if the payload structure differs from expectations. Fix this by checking the ok return value from the payload type assertion and properly validating that the assertion succeeded before accessing the result. Additionally, validate the ok return values from the metadata and spec assertions in the same block, following the same pattern used in TestAuditProposalReceivedOnFirstReconcile, and add appropriate error handling if any assertion fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.ai/spec/what/audit-logging.md:
- Line 24: The sentence in the audit-logging.md file currently states that all
audit events must be emitted from the reconciliation loop, but this contradicts
the definition of audit.approval.received which is emitted from the webhook
flow. Revise the statement to explicitly scope the reconciliation loop
requirement to only reconcile-emitted audit events, and either exclude
webhook-emitted events like audit.approval.received or specify that this rule
applies only to events originating from the reconciliation loop where the
Proposal object is in scope.
In `@cmd/main.go`:
- Around line 180-183: The resource.Merge function call is discarding the error
return value using the blank identifier, which silently ignores merge failures
such as schema URL conflicts. Instead of discarding the error, capture it as a
named return value in the resource.Merge call. Then add error handling logic
that checks if the error is not nil, logs the merge failure with appropriate
details, and applies a fallback strategy similar to what is already implemented
for exporter creation errors in the preceding code section (lines 169-177). This
ensures telemetry resource failures are observable and handled gracefully.
- Around line 169-172: The OTLP exporter configuration in the
otlptracegrpc.New() call includes WithInsecure() which disables TLS and exposes
sensitive trace data in plaintext, and the error returned by resource.Merge() at
line 180 is being discarded without handling. Remove the WithInsecure() option
from the otlptracegrpc.New() call to enforce TLS encryption for secure
transport, and add proper error checking and handling for the resource.Merge()
call so that any error it returns is captured and handled appropriately instead
of being silently discarded.
In `@controller/proposal/audit_test.go`:
- Around line 367-368: The comment block describing test cases contains an
incorrect field name reference. In the comment lines that describe the CR
configuration scenarios, replace the field name `spec.audit.enabled` with the
correct field name `spec.audit.logging` to match the actual API field being used
in the test cases throughout the file (referenced at lines 385, 390, 395, and
400). This ensures the documentation comment accurately reflects the actual
field name being tested.
- Around line 167-169: The json.Marshal and json.Unmarshal calls are ignoring
their returned errors by using blank identifiers, which can cause silent
failures and panics. Replace the blank identifiers with proper error variables
for both the json.Marshal call that assigns to jsonBytes and the json.Unmarshal
call that assigns to parsed, then add appropriate error checks (such as t.Fatal
or t.Errorf) to handle any errors that occur. Follow the same pattern used in
the TestSerializeCR function which correctly validates these errors.
In `@controller/proposal/audit.go`:
- Line 41: The comment referencing spec.audit.enabled is outdated and does not
reflect the current configuration structure. Update the comment to remove the
reference to spec.audit.enabled and instead document the actual split controls
that now govern audit behavior: spec.audit.logging and spec.audit.otel.endpoint.
Ensure the comment accurately describes how these separate controls manage audit
functionality instead of the previously described single on/off mechanism.
- Around line 132-156: The emitAuditEvent function returns early when auditLog
is nil, which prevents the span.AddEvent tracing code from executing. This
breaks the independence between audit logging and tracing. Restructure the code
to only skip the auditLog.zap.Info call when auditLog is nil, while allowing the
span.AddEvent logic to execute regardless. Move the tracing code (the
span.IsRecording check and span.AddEvent call) outside or after the early return
condition so it always runs when a valid span and context are available.
- Line 79: The Sync() method of the AuditLogger type in
controller/proposal/audit.go currently discards the error from a.zap.Sync()
using a blank identifier, which violates Go error handling guidelines. Change
the method signature to return an error instead of void by adding error as the
return type, then return the error from a.zap.Sync() directly. After making this
change, update the two call sites where Sync() is invoked: one in cmd/main.go at
line 119 and another in controller/proposal/audit_test.go at line 275, to
properly handle and check the returned error.
In `@controller/proposal/handlers.go`:
- Around line 291-321: The approval event emission block in the handlers (both
in the manual approval path with tracer.Start and emitAuditEvent calls, and in
the auto approval path with emitAuditEvent) executes on every retry attempt,
causing duplicate audit events. To fix this, add a guard condition before
emitting the approval events to ensure they only emit on the first execution
attempt. Check if RetryCount is zero or use a status condition marker (similar
to how ProposalConditionAnalyzed is checked) to gate the emission of both the
proposal.human_approval span and the AuditApprovalReceived event, so that
retries skip this entire approval event emission block.
In `@controller/proposal/helpers.go`:
- Around line 130-137: When the statusPatch call in the error handling block
fails, the code currently logs the error but continues to emit the terminal
audit event and returns nil. Instead, return the statusErr immediately when the
status patch fails, so the terminal event emission (including tracer.Start for
proposal.terminal and emitAuditEvent with AuditProposalTerminal) only occurs
after the failed condition is successfully persisted in the API object. This
ensures the controller will retry if the status patch operation fails.
In `@controller/proposal/reconciler.go`:
- Around line 62-64: The condition `if r.Audit != nil` incorrectly gates the
`lifecycleParentContext` call, but trace parenting and audit logging are
separate concerns. When audit logging is disabled but OpenTelemetry is
configured, r.Audit will be nil and lifecycleParentContext will be skipped,
breaking trace parenting. Remove the `if r.Audit != nil` guard and ensure
lifecycleParentContext is called unconditionally for the proposal context, or
add a separate check that verifies whether tracing is enabled instead of relying
on the audit logger's presence.
---
Outside diff comments:
In `@controller/proposal/handlers.go`:
- Around line 690-711: The terminal span and audit event are being emitted
before the status patch is executed and persisted. This creates a race condition
where if the statusPatch call fails, the audit log will indicate the proposal is
terminal while the actual API object remains non-terminal. Move the
tracer.Start("proposal.terminal"), termSpan.SetAttributes, emitAuditEvent with
AuditProposalTerminal, and termSpan.End() calls to occur after the statusPatch
succeeds. This ensures the terminal event is only recorded once the Escalated
condition is actually persisted to the API object, matching the pattern used in
other terminal paths.
- Around line 469-492: The AuditVerificationCompleted event is currently only
emitted on the success path, but a VerificationResult CR is created before the
pass/fail check. Add an emitAuditEvent call for AuditVerificationCompleted
immediately after the successful createVerificationResult call (after the
verifyCRErr nil check in the handler), passing the created verifyCR and
verifyCRName. This ensures the completion event is emitted for all verification
results regardless of pass/fail outcome, and the existing AuditVerificationRetry
event will remain as an additional retry-specific event in the failure path.
- Around line 90-111: The OTEL tracing and audit logging are currently coupled
through conditional gates, causing all tracing to be disabled when audit logging
is disabled. In reconciler.go around line 62, the tracing setup context is gated
behind an `if r.Audit != nil` check - move the trace parent context
initialization outside this gate so tracing proceeds independently of audit
logger availability. Additionally, in audit.go around line 132 in the
emitAuditEvent function, the early return when `auditLog == nil` prevents span
events from being emitted - modify the function to continue processing and
emitting span events to the OTEL tracer even when the audit logger is nil, only
skipping the audit logging portion when the logger is unavailable.
---
Nitpick comments:
In `@controller/proposal/audit_integration_test.go`:
- Around line 410-413: The type assertion for payload in the test at line 410
ignores the ok return value, which can cause silent test failures if the payload
structure is incorrect. Modify the type assertion of
terminal["payload"].(map[string]interface{}) to capture both the payload value
and the ok boolean, then add a check to verify the assertion succeeded before
attempting to access payload["phase"], using t.Errorf or t.Fatalf to report a
failure if the type assertion fails.
- Around line 318-324: The type assertion on execCompleted["payload"] ignores
the ok return value, which means if the assertion fails, payload will be nil and
the subsequent checks on payload["metadata"] and payload["spec"] will not work
as intended. Capture both the value and the ok boolean from the type assertion,
then add a check to ensure the assertion succeeded before proceeding to access
payload fields; if the assertion fails, the test should report an error
indicating that the payload type assertion failed.
- Around line 368-371: The type assertion that extracts payload from the
approval map ignores the ok return value, which can cause silent test failures
if the payload structure differs from expectations. Modify the type assertion to
capture both the result and the boolean success indicator, then add a
conditional check before accessing payload["mode"]. If the assertion fails, use
t.Errorf or t.Fatalf to report that the payload was not of the expected
map[string]interface{} type, ensuring the test explicitly fails rather than
proceeding with potentially nil or incorrect values.
- Around line 465-468: The type assertion for
approval["payload"].(map[string]interface{}) is ignoring the ok return value,
which means the test could silently fail if the payload is not a map. Modify the
line where payload is assigned to capture both the value and the ok boolean
return value, then add a check to verify the assertion succeeded (ok is true)
before proceeding to access payload fields. If the assertion fails, use t.Fatalf
or t.Errorf to fail the test with an appropriate error message indicating that
the payload type assertion failed.
- Around line 170-176: The type assertion for the payload variable and
subsequent assertions for metadata and spec are ignoring their boolean return
values (ok flag), which can cause silent test failures if the payload structure
differs from expectations. Fix this by checking the ok return value from the
payload type assertion and properly validating that the assertion succeeded
before accessing the result. Additionally, validate the ok return values from
the metadata and spec assertions in the same block, following the same pattern
used in TestAuditProposalReceivedOnFirstReconcile, and add appropriate error
handling if any assertion fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: dc3a03ca-b7e1-4772-b37c-e3701373310d
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (15)
.ai/spec/what/audit-logging.mdapi/v1alpha1/agenticolsconfig_types.gocmd/main.goconfig/crd/bases/agentic.openshift.io_agenticolsconfigs.yamlcontroller/proposal/approval.gocontroller/proposal/audit.gocontroller/proposal/audit_integration_test.gocontroller/proposal/audit_test.gocontroller/proposal/client.gocontroller/proposal/handlers.gocontroller/proposal/helpers.gocontroller/proposal/reconciler.gocontroller/proposal/results.gocontroller/setup.gogo.mod
|
|
||
| // Sync flushes the audit logger. | ||
| func (a *AuditLogger) Sync() { | ||
| _ = a.zap.Sync() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== AuditLogger.Sync definition =="
rg -n --type go 'func \(a \*AuditLogger\) Sync\(' controller/proposal/audit.go -C 3
echo
echo "== Call sites of Sync() (to ensure propagated handling is feasible) =="
rg -n --type go '\.Sync\(\)' -C 2
echo
echo "== Other blank-identifier error discards in Go files (sanity check) =="
rg -n --type go '^\s*_\s*=\s*.*\(' -C 1Repository: openshift/lightspeed-agentic-operator
Length of output: 4381
Change the Sync() method signature to return an error and handle it at call sites.
The Sync() method at line 79 discards the error from a.zap.Sync() using the blank identifier. Per the coding guidelines, Go files must never ignore error returns. Ignoring audit flush errors on shutdown could hide audit log loss or failures.
Change the method signature to func (a *AuditLogger) Sync() error and propagate the error to the two call sites in cmd/main.go (line 119) and controller/proposal/audit_test.go (line 275).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@controller/proposal/audit.go` at line 79, The Sync() method of the
AuditLogger type in controller/proposal/audit.go currently discards the error
from a.zap.Sync() using a blank identifier, which violates Go error handling
guidelines. Change the method signature to return an error instead of void by
adding error as the return type, then return the error from a.zap.Sync()
directly. After making this change, update the two call sites where Sync() is
invoked: one in cmd/main.go at line 119 and another in
controller/proposal/audit_test.go at line 275, to properly handle and check the
returned error.
Source: Coding guidelines
Structured JSON audit events emitted to stdout at every proposal phase transition. OTEL trace spans provide distributed tracing with W3C traceparent propagation to the sandbox. Audit logging: - Dedicated AuditLogger (independent zap, immune to operator log levels) - spec.audit.logging controls stdout JSON (default: true) - spec.audit.otel.endpoint controls OTEL export (independent of logging) - spec.audit.otel.insecure controls TLS for OTLP connection (default: true) - 8 audit events covering full proposal lifecycle - CR serialization (.spec + .status + select metadata) on all result events OTEL tracing: - Span hierarchy: proposal.received (root) -> analyze, human_approval, execute, verify, escalate, terminal (children) - Root span ID stored in annotation for cross-reconcile parenting - human_approval span captures human decision wait time (manual only) - traceparent header on /v1/agent/run calls for sandbox propagation Mutating webhook: - Injects approver identity (uid, username, timestamp) from AdmissionReview - Emits audit.approval.received with user identity and trace_id - Fail-closed policy Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vimal Kumar <vimal78@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vimal Kumar <vimal78@gmail.com>
1033471 to
1b60d9e
Compare
|
@vimalk78: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/main.go`:
- Around line 121-124: The defer function containing auditLogger.Sync() is
currently suppressing all errors with the blank identifier, which masks real
flush failures. Instead of discarding all errors indiscriminately, capture the
error returned from auditLogger.Sync() and explicitly check if it is the known
benign "invalid argument" (EINVAL) error that occurs on many platforms when
syncing stdout. Only suppress this specific error while handling or logging any
other unexpected errors that might indicate real problems during flush
operations.
- Around line 157-159: The os.Exit(1) call in the mgr.Start error handling block
bypasses deferred cleanup functions like tp.Shutdown and auditLogger.Sync,
causing telemetry data to be lost. Refactor the main function by extracting the
manager start logic into a helper function (e.g., run() that returns an int
error code) so that deferred telemetry cleanup executes before os.Exit, or
alternatively explicitly call tp.Shutdown() and auditLogger.Sync() before
os.Exit(1) in the error path after mgr.Start fails.
- Around line 111-115: The directClient.Get() call silently falls back on all
errors, allowing RBAC and API failures to disable OTEL tracing or audit logging
without any indication. Replace context.Background() with a context that has a
timeout deadline, then explicitly check the error returned from
directClient.Get() to only silently continue for NotFound errors. For other
error types (such as permission denied or connection failures), log the error
appropriately and either fail closed or apply conservative defaults to ensure
security configurations are not accidentally disabled by API failures.
- Around line 129-130: The tp.Shutdown() call at line 130 and the OTEL exporter
creation at line 180 both use context.Background() without timeouts, which can
cause indefinite hangs if the OTEL collector is unreachable. Replace
context.Background() with context.WithTimeout() at both locations to add
appropriate timeout durations (e.g., 10-30 seconds), ensuring that exporter
creation and shutdown operations fail fast rather than blocking indefinitely
during operator startup or graceful termination.
In `@controller/proposal/audit_integration_test.go`:
- Line 129: The reconcileOnce function calls throughout the test file are
ignoring error return values, which violates Go error handling guidelines. For
each call to reconcileOnce (at lines 129, 160-161, 202-206, 247-255, 307-310,
358-360, 385-387, 402, 427, 454-457, and 482), capture the returned error and
add an assertion to verify it is nil using a pattern like err :=
reconcileOnce(...); if err != nil { t.Fatal(err) }. This ensures that test setup
failures are caught immediately rather than allowing broken state to propagate
to later assertions.
- Around line 42-54: The auditEventsFromBuffer function silently ignores JSON
unmarshal errors when parsing audit event lines, causing malformed audit output
to be skipped without any indication of failure. This masks test failures and
hides bugs in audit emission. Modify the function to properly surface unmarshal
errors instead of continuing silently when err != nil in the json.Unmarshal
call. Either return an error from auditEventsFromBuffer or use testing.T to fail
the test when unmarshal fails, depending on how this function is called in
tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 39f44ac1-3b51-4cac-b6e3-1f39356acffd
⛔ Files ignored due to path filters (2)
config/crd/bases/agentic.openshift.io_agenticolsconfigs.yamlis excluded by!config/crd/bases/**go.sumis excluded by!**/*.sum,!go.sum
📒 Files selected for processing (14)
.ai/spec/what/audit-logging.mdapi/v1alpha1/agenticolsconfig_types.gocmd/main.gocontroller/proposal/approval.gocontroller/proposal/audit.gocontroller/proposal/audit_integration_test.gocontroller/proposal/audit_test.gocontroller/proposal/client.gocontroller/proposal/handlers.gocontroller/proposal/helpers.gocontroller/proposal/reconciler.gocontroller/proposal/results.gocontroller/setup.gogo.mod
🔗 Linked repositories identified
CodeRabbit considers these linked repositories for cross-repo context during reviews:
openshift/lightspeed-agentic-sandbox(manual)
🚧 Files skipped from review as they are similar to previous changes (12)
- controller/proposal/helpers.go
- controller/proposal/client.go
- controller/setup.go
- .ai/spec/what/audit-logging.md
- controller/proposal/approval.go
- controller/proposal/reconciler.go
- controller/proposal/results.go
- go.mod
- api/v1alpha1/agenticolsconfig_types.go
- controller/proposal/audit_test.go
- controller/proposal/audit.go
- controller/proposal/handlers.go
| if err := directClient.Get(context.Background(), client.ObjectKey{Name: "cluster"}, &agenticConfig); err == nil { | ||
| auditLogging = proposal.ResolveLoggingEnabled(agenticConfig.Spec.Audit) | ||
| otelEndpoint = proposal.ResolveOTELEndpoint(agenticConfig.Spec.Audit) | ||
| otelInsecure = proposal.ResolveOTELInsecure(agenticConfig.Spec.Audit) | ||
| } |
There was a problem hiding this comment.
Handle audit config read failures explicitly.
Line 111 silently falls back for every Get error, so RBAC/API failures can disable configured OTEL tracing or apply the wrong audit logging mode without any signal. Default only for NotFound; log/fail closed for other errors, and bound the API call with a timeout.
Suggested fix
+ ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+ defer cancel()
var agenticConfig agenticv1alpha1.AgenticOLSConfig
- if err := directClient.Get(context.Background(), client.ObjectKey{Name: "cluster"}, &agenticConfig); err == nil {
+ if err := directClient.Get(ctx, client.ObjectKey{Name: "cluster"}, &agenticConfig); err != nil {
+ if !apierrors.IsNotFound(err) {
+ log.Error(err, "unable to read AgenticOLSConfig audit config")
+ os.Exit(1)
+ }
+ } else {
auditLogging = proposal.ResolveLoggingEnabled(agenticConfig.Spec.Audit)
otelEndpoint = proposal.ResolveOTELEndpoint(agenticConfig.Spec.Audit)
otelInsecure = proposal.ResolveOTELInsecure(agenticConfig.Spec.Audit)
}As per coding guidelines, Go security requires “Never ignore error returns” and “context.Context for cancellation and timeouts”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cmd/main.go` around lines 111 - 115, The directClient.Get() call silently
falls back on all errors, allowing RBAC and API failures to disable OTEL tracing
or audit logging without any indication. Replace context.Background() with a
context that has a timeout deadline, then explicitly check the error returned
from directClient.Get() to only silently continue for NotFound errors. For other
error types (such as permission denied or connection failures), log the error
appropriately and either fail closed or apply conservative defaults to ensure
security configurations are not accidentally disabled by API failures.
Source: Coding guidelines
| defer func() { | ||
| // Sync stdout returns "invalid argument" on many platforms — benign. | ||
| _ = auditLogger.Sync() | ||
| }() |
There was a problem hiding this comment.
Don’t suppress every audit logger sync error.
Line 123 hides real flush failures along with the known stdout EINVAL case. Handle or explicitly filter the benign error instead of discarding all errors.
Suggested fix
defer func() {
// Sync stdout returns "invalid argument" on many platforms — benign.
- _ = auditLogger.Sync()
+ if err := auditLogger.Sync(); err != nil && !errors.Is(err, syscall.EINVAL) {
+ log.Error(err, "failed to sync audit logger")
+ }
}()As per coding guidelines, Go security requires “Never ignore error returns”.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| defer func() { | |
| // Sync stdout returns "invalid argument" on many platforms — benign. | |
| _ = auditLogger.Sync() | |
| }() | |
| defer func() { | |
| // Sync stdout returns "invalid argument" on many platforms — benign. | |
| if err := auditLogger.Sync(); err != nil && !errors.Is(err, syscall.EINVAL) { | |
| log.Error(err, "failed to sync audit logger") | |
| } | |
| }() |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cmd/main.go` around lines 121 - 124, The defer function containing
auditLogger.Sync() is currently suppressing all errors with the blank
identifier, which masks real flush failures. Instead of discarding all errors
indiscriminately, capture the error returned from auditLogger.Sync() and
explicitly check if it is the known benign "invalid argument" (EINVAL) error
that occurs on many platforms when syncing stdout. Only suppress this specific
error while handling or logging any other unexpected errors that might indicate
real problems during flush operations.
Source: Coding guidelines
| defer func() { | ||
| if err := tp.Shutdown(context.Background()); err != nil { |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Inspect OTEL init/shutdown call sites and any documented shutdown grace period.
rg -n -C3 'initTracerProvider|Shutdown\(context\.Background\(\)|otlptracegrpc\.New|terminationGracePeriodSeconds' cmd/main.go config/ .ai/spec/ README.mdRepository: openshift/lightspeed-agentic-operator
Length of output: 1158
🏁 Script executed:
#!/bin/bash
# Read the full initTracerProvider function to see ctx definition and usage
sed -n '163,203p' cmd/main.goRepository: openshift/lightspeed-agentic-operator
Length of output: 1325
🏁 Script executed:
#!/bin/bash
# Check for timeout patterns in main.go and deployment config
echo "=== Timeout patterns in main.go ==="
grep -n "WithTimeout\|time.Second\|context.WithDeadline" cmd/main.go
echo -e "\n=== Deployment terminationGracePeriodSeconds ==="
find config/ -name "*.yaml" -o -name "*.yml" | xargs grep -l "terminationGracePeriodSeconds" 2>/dev/null || echo "Not found in config/"
echo -e "\n=== Manager deployment configuration ==="
cat config/manager/manager.yaml 2>/dev/null | head -80Repository: openshift/lightspeed-agentic-operator
Length of output: 2181
🏁 Script executed:
#!/bin/bash
# Check go.mod for OTEL dependencies
echo "=== OTEL dependencies in go.mod ==="
grep -i "otel\|trace" go.mod
echo -e "\n=== Check for OTEL SDK timeout behavior in imports ==="
grep -n "otlptracegrpc\|sdktrace\|context" cmd/main.go | head -20Repository: openshift/lightspeed-agentic-operator
Length of output: 1319
Add timeout contexts for OTEL exporter creation and shutdown.
Exporter creation at line 180 and shutdown at line 130 use context.Background() without timeouts. If the OTEL collector is unreachable or slow, these operations can hang indefinitely and delay operator startup or graceful termination. Go security guidelines require "context.Context for cancellation and timeouts."
Suggested fix
defer func() {
- if err := tp.Shutdown(context.Background()); err != nil {
+ shutdownCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+ defer cancel()
+ if err := tp.Shutdown(shutdownCtx); err != nil {
log.Error(err, "failed to shut down tracer provider")
}
}()- ctx := context.Background()
+ ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+ defer cancel()
opts := []otlptracegrpc.Option{
otlptracegrpc.WithEndpoint(endpoint),
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cmd/main.go` around lines 129 - 130, The tp.Shutdown() call at line 130 and
the OTEL exporter creation at line 180 both use context.Background() without
timeouts, which can cause indefinite hangs if the OTEL collector is unreachable.
Replace context.Background() with context.WithTimeout() at both locations to add
appropriate timeout durations (e.g., 10-30 seconds), ensuring that exporter
creation and shutdown operations fail fast rather than blocking indefinitely
during operator startup or graceful termination.
Source: Coding guidelines
| if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil { | ||
| log.Error(err, "problem running manager") | ||
| os.Exit(1) |
There was a problem hiding this comment.
Flush telemetry before exiting on manager errors.
os.Exit(1) skips the deferred tp.Shutdown and auditLogger.Sync above, so batched spans can be lost on the failure path. Explicitly run cleanup before exiting, or move main logic into a run() int helper so defers execute before os.Exit.
Suggested pattern
- if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {
- log.Error(err, "problem running manager")
- os.Exit(1)
- }
+ if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {
+ log.Error(err, "problem running manager")
+ shutdownCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+ if shutdownErr := tp.Shutdown(shutdownCtx); shutdownErr != nil {
+ log.Error(shutdownErr, "failed to shut down tracer provider")
+ }
+ cancel()
+ if auditLogger != nil {
+ if syncErr := auditLogger.Sync(); syncErr != nil {
+ log.Error(syncErr, "failed to sync audit logger")
+ }
+ }
+ os.Exit(1)
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cmd/main.go` around lines 157 - 159, The os.Exit(1) call in the mgr.Start
error handling block bypasses deferred cleanup functions like tp.Shutdown and
auditLogger.Sync, causing telemetry data to be lost. Refactor the main function
by extracting the manager start logic into a helper function (e.g., run() that
returns an int error code) so that deferred telemetry cleanup executes before
os.Exit, or alternatively explicitly call tp.Shutdown() and auditLogger.Sync()
before os.Exit(1) in the error path after mgr.Start fails.
| func auditEventsFromBuffer(buf *bytes.Buffer) []map[string]interface{} { | ||
| var events []map[string]interface{} | ||
| for _, line := range strings.Split(buf.String(), "\n") { | ||
| if line == "" { | ||
| continue | ||
| } | ||
| var parsed map[string]interface{} | ||
| if err := json.Unmarshal([]byte(line), &parsed); err == nil { | ||
| events = append(events, parsed) | ||
| } | ||
| } | ||
| return events | ||
| } |
There was a problem hiding this comment.
Silently ignoring unmarshal errors masks test failures.
Line 49 discards JSON parse errors. If audit output is malformed, tests will pass with an empty events slice instead of failing. This violates the Go guideline to never ignore error returns and hides bugs in audit emission.
🛡️ Proposed fix to surface unmarshal errors
func auditEventsFromBuffer(buf *bytes.Buffer) []map[string]interface{} {
var events []map[string]interface{}
for _, line := range strings.Split(buf.String(), "\n") {
if line == "" {
continue
}
var parsed map[string]interface{}
- if err := json.Unmarshal([]byte(line), &parsed); err == nil {
+ if err := json.Unmarshal([]byte(line), &parsed); err != nil {
+ panic(fmt.Sprintf("invalid audit JSON: %v\nLine: %s", err, line))
+ } else {
events = append(events, parsed)
}
}
return events
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func auditEventsFromBuffer(buf *bytes.Buffer) []map[string]interface{} { | |
| var events []map[string]interface{} | |
| for _, line := range strings.Split(buf.String(), "\n") { | |
| if line == "" { | |
| continue | |
| } | |
| var parsed map[string]interface{} | |
| if err := json.Unmarshal([]byte(line), &parsed); err == nil { | |
| events = append(events, parsed) | |
| } | |
| } | |
| return events | |
| } | |
| func auditEventsFromBuffer(buf *bytes.Buffer) []map[string]interface{} { | |
| var events []map[string]interface{} | |
| for _, line := range strings.Split(buf.String(), "\n") { | |
| if line == "" { | |
| continue | |
| } | |
| var parsed map[string]interface{} | |
| if err := json.Unmarshal([]byte(line), &parsed); err != nil { | |
| panic(fmt.Sprintf("invalid audit JSON: %v\nLine: %s", err, line)) | |
| } else { | |
| events = append(events, parsed) | |
| } | |
| } | |
| return events | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@controller/proposal/audit_integration_test.go` around lines 42 - 54, The
auditEventsFromBuffer function silently ignores JSON unmarshal errors when
parsing audit event lines, causing malformed audit output to be skipped without
any indication of failure. This masks test failures and hides bugs in audit
emission. Modify the function to properly surface unmarshal errors instead of
continuing silently when err != nil in the json.Unmarshal call. Either return an
error from auditEventsFromBuffer or use testing.T to fail the test when
unmarshal fails, depending on how this function is called in tests.
Source: Coding guidelines
| ) | ||
|
|
||
| // First reconcile: adds finalizer + emits proposal.received | ||
| reconcileOnce(r, "fix-crash") |
There was a problem hiding this comment.
Check errors from all reconcileOnce calls.
Multiple tests ignore reconcileOnce errors (lines 129, 160-161, 202-206, 247-255, 307-310, 358-360, 385-387, 402, 427, 454-457, 482). This violates the Go guideline to never ignore error returns. Even when setting up state for later assertions, reconcile failures indicate broken test setup and should fail fast.
🔍 Example fix for TestAuditAnalysisCompleted
// First reconcile: adds finalizer + emits proposal.received
- reconcileOnce(r, "fix-crash")
+ if _, err := reconcileOnce(r, "fix-crash"); err != nil {
+ t.Fatalf("first reconcile error: %v", err)
+ }
// Second reconcile: runs analysis
_, err := reconcileOnce(r, "fix-crash")Apply the same pattern to all ignored reconcileOnce calls.
Also applies to: 160-161, 202-206, 247-255, 307-310, 358-360, 385-387, 402-402, 427-427, 454-457, 482-482
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@controller/proposal/audit_integration_test.go` at line 129, The reconcileOnce
function calls throughout the test file are ignoring error return values, which
violates Go error handling guidelines. For each call to reconcileOnce (at lines
129, 160-161, 202-206, 247-255, 307-310, 358-360, 385-387, 402, 427, 454-457,
and 482), capture the returned error and add an assertion to verify it is nil
using a pattern like err := reconcileOnce(...); if err != nil { t.Fatal(err) }.
This ensures that test setup failures are caught immediately rather than
allowing broken state to propagate to later assertions.
Source: Coding guidelines
Summary
Add compliance audit logging to the agentic operator (OLS-3271 + OLS-3272). Structured JSON audit events are emitted to stdout at every proposal phase
transition. OTEL trace spans provide distributed tracing with W3C traceparent propagation to the sandbox.
Audit logging
AuditLogger(independent zap, immune to operator log level settings)spec.audit.loggingcontrols stdout JSON (default: true)spec.audit.otel.endpointcontrols OTEL export (independent of logging)proposal.received,analysis.completed,approval.received,execution.completed,verification.completed,verification.retry,escalation.completed,proposal.terminal.spec+.status+ select metadata) on all result eventsOTEL tracing
proposal.received(root) →analyze,human_approval,execute,verify,escalate,terminal(children)human_approvalspan captures human decision wait time (manual approval only)traceparentheader on/v1/agent/runHTTP callsagentic-operatorCRD changes
spec.auditblock onAgenticOLSConfigwithlogging(bool) andotel.endpoint(string)Spec updates (second commit)
proposal.received.statusfor result CRsproposal.terminaladded to child spansspec.audit.enabled→spec.audit.logging, logging and OTEL independentTest plan
make testpassesmake api-lintpassesproposal.human_approvalspan with duration🤖 Generated with Claude Code