Skip to content

Hyperfleet-542 : Add OpenTelemetry instrumentation to Sentinel#51

Open
tirthct wants to merge 6 commits intoopenshift-hyperfleet:mainfrom
tirthct:hyperfleet-542
Open

Hyperfleet-542 : Add OpenTelemetry instrumentation to Sentinel#51
tirthct wants to merge 6 commits intoopenshift-hyperfleet:mainfrom
tirthct:hyperfleet-542

Conversation

@tirthct
Copy link
Contributor

@tirthct tirthct commented Feb 25, 2026

Summary
Implements comprehensive OpenTelemetry tracing instrumentation in Sentinel to enable distributed tracing across the HyperFleet orchestration pipeline.

Span hierarchy:
sentinel.poll (root)

  • HTTP GET (API fetch)
    • sentinel.evaluate (resource 1)
      • hyperfleet-clusters publish
    • sentinel.evaluate (resource 2)
      • hyperfleet-clusters publish

Resolves
HYPERFLEET-542

Changed

  • Added pkg/telemetry/otel.go with configurable OpenTelemetry setup
  • Implemented required spans
  • Implemented W3C trace context propagation
  • Added environment variable support for OTEL configuration

Tested

  • Complete parent-child span relationships : tested
  • trace_id/span_id appear in structured logs : tested
  • required spans are created : tested
  • traceparent extension format : tested

Summary by CodeRabbit

  • New Features

    • Opt-in OpenTelemetry tracing with startup toggle, graceful init/shutdown, per-poll/per-resource/publish spans, CloudEvent traceparent propagation, and trace IDs surfaced in logs.
  • Configuration

    • Added messaging_system examples and MessagingSystem setting with env override; logger config gains an OTel enable toggle.
  • Tests

    • Expanded unit, integration, and HTTP instrumentation tests validating spans, propagation, exporters, and end-to-end span hierarchies.
  • Chores

    • Updated OpenTelemetry/OTLP dependencies and added HTTP client instrumentation.

@openshift-ci
Copy link

openshift-ci bot commented Feb 25, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yingzhanredhat for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Feb 25, 2026

Walkthrough

Adds OpenTelemetry observability across the codebase: new pkg/telemetry with InitTraceProvider, Shutdown, StartSpan, and SetTraceContext supporting OTLP or stdout exporters and sampler configuration; cmd/sentinel reads TRACING_ENABLED and OTEL_SERVICE_NAME to initialize/shutdown tracing; HTTP clients are wrapped with otelhttp transport; sentinel polling now creates sentinel.poll, sentinel.evaluate, and publish spans, records attributes/errors, and can inject traceparent into CloudEvents. Adds LogConfig.OTel and SentinelConfig.MessagingSystem with example config updates. Extensive unit and integration tests for telemetry were added and go.mod updated with OTEL dependencies.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant TP as TracerProvider
    participant Exporter as SpanExporter
    participant HTTP as HTTPClient (otelhttp)
    participant Sentinel as Sentinel
    participant Evaluator as Evaluator
    participant Publisher as Publisher
    participant Broker as Broker

    App->>TP: InitTraceProvider(serviceName)
    TP->>Exporter: configure exporter & sampler
    TP-->>App: return TracerProvider

    loop Poll cycle
        Sentinel->>Sentinel: Start "sentinel.poll" span
        Sentinel->>Evaluator: iterate resources
        Evaluator->>Evaluator: Start "sentinel.evaluate" span (attrs)
        Evaluator->>Publisher: Start child "publish" span
        Publisher->>HTTP: Publish(event, publishCtx) (context injected)
        HTTP->>Broker: HTTP publish / broker API
        Broker-->>HTTP: response
        HTTP-->>Publisher: result
        Publisher-->>Evaluator: publish result
        Evaluator->>Evaluator: End "sentinel.evaluate" span
        Sentinel->>Sentinel: End "sentinel.poll" span
    end

    App->>TP: Shutdown(ctx)
    TP->>Exporter: Flush & shutdown
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title clearly and concisely describes the main change: adding OpenTelemetry instrumentation to the Sentinel component, which aligns with the comprehensive span hierarchy and tracing infrastructure changes throughout the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/sentinel/sentinel.go (1)

113-187: ⚠️ Potential issue | 🔴 Critical

Span leak: evalSpan.End() and publishSpan.End() are skipped on continue paths.

When event.SetData fails (Line 141) or Publish fails (Line 164), the continue statements skip evalSpan.End() at Line 187 and publishSpan.End() at Line 167 respectively. Leaked spans will accumulate in memory and produce incomplete traces.

The safest fix is to end spans via defer or explicit End() calls before each continue.

Proposed fix
 		// span: sentinel.evaluate
 		evalCtx, evalSpan := telemetry.StartSpan(ctx, "sentinel.evaluate",
 			attribute.String("hyperfleet.resource_type", resource.Kind),
 			attribute.String("hyperfleet.resource_id", resource.ID),
 		)
 
 		decision := s.decisionEngine.Evaluate(resource, now)
 
 		if decision.ShouldPublish {
 			pending++
 
 			// Add decision reason to context for structured logging
 			eventCtx := logger.WithDecisionReason(evalCtx, decision.Reason)
 
 			// Create CloudEvent
 			event := cloudevents.NewEvent()
 			event.SetSpecVersion(cloudevents.VersionV1)
 			event.SetType(fmt.Sprintf("com.redhat.hyperfleet.%s.reconcile", resource.Kind))
 			event.SetSource("hyperfleet-sentinel")
 			event.SetID(uuid.New().String())
 			if err := event.SetData(cloudevents.ApplicationJSON, map[string]interface{}{
 				"kind":       resource.Kind,
 				"id":         resource.ID,
 				"generation": resource.Generation,
 				"href":       resource.Href,
 				"reason":     decision.Reason,
 			}); err != nil {
 				s.logger.Errorf(eventCtx, "Failed to set event data resource_id=%s error=%v", resource.ID, err)
+				evalSpan.End()
 				continue
 			}
 
 			// span: publish (child of sentinel.evaluate)
 			publishCtx, publishSpan := telemetry.StartSpan(eventCtx, fmt.Sprintf("%s publish", topic),
 				attribute.String("hyperfleet.resource_type", resource.Kind),
 				attribute.String("hyperfleet.resource_id", resource.ID),
 				attribute.String("hyperfleet.decision_reason", decision.Reason),
 				attribute.String("messaging.system", "gcp_pubsub"),
 				attribute.String("messaging.operation.type", "publish"),
 				attribute.String("messaging.destination.name", topic),
 				attribute.String("messaging.message.id", event.ID()),
 			)
 
 			if publishSpan.SpanContext().IsValid() {
 				telemetry.SetTraceContext(&event, publishSpan)
 			}
 
 			// Publish to broker using configured topic
 			if err := s.publisher.Publish(publishCtx, topic, &event); err != nil {
 				// Record broker error
 				metrics.UpdateBrokerErrorsMetric(resourceType, resourceSelector, "publish_error")
 				s.logger.Errorf(eventCtx, "Failed to publish event resource_id=%s error=%v", resource.ID, err)
+				publishSpan.End()
+				evalSpan.End()
 				continue
 			}
 
 			publishSpan.End()

Alternatively, a cleaner approach would be to use a closure or restructure to use defer-like patterns within the loop body.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/sentinel/sentinel.go` around lines 113 - 187, The spans started by
telemetry.StartSpan (evalSpan and publishSpan) are leaked on early continue
paths (failures in event.SetData and s.publisher.Publish); fix by ensuring each
span is ended on all control paths—either call defer evalSpan.End() immediately
after creating evalSpan and defer publishSpan.End() right after creating
publishSpan (so they run even on continue), or explicitly call publishSpan.End()
and evalSpan.End() just before every continue in the error branches (references:
telemetry.StartSpan, evalSpan, publishSpan, event.SetData, s.publisher.Publish).
🧹 Nitpick comments (4)
internal/sentinel/sentinel_test.go (1)

444-448: Global tracer provider is set but never restored after test.

otel.SetTracerProvider(tp) modifies the global state. While tests in the same package run sequentially by default, this could affect subsequent tests that rely on the global provider state (or lack thereof). Consider saving and restoring the previous provider.

Suggested pattern
 	otel.SetTracerProvider(tp)
+	// Optionally restore previous provider on cleanup
+	// prevTP := otel.GetTracerProvider()
+	// defer otel.SetTracerProvider(prevTP)

This is optional since tests currently run sequentially and the deferred tp.Shutdown will clean up the provider.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/sentinel/sentinel_test.go` around lines 444 - 448, The test sets a
global provider via trace.NewTracerProvider(...) and otel.SetTracerProvider(tp)
but doesn't restore the previous global provider; capture the current provider
with otel.GetTracerProvider() before calling otel.SetTracerProvider(tp), and
defer otel.SetTracerProvider(previousProvider) so global state is restored; keep
the existing deferred tp.Shutdown() to shut down the new provider.
internal/client/client_test.go (1)

796-797: Ignored w.Write error return value.

Line 797 ignores the error from w.Write(...), unlike the rest of the test file which consistently handles write errors.

Suggested fix
-		w.Write([]byte(`{"error": "internal server error"}`))
+		if _, err := w.Write([]byte(`{"error": "internal server error"}`)); err != nil {
+			t.Logf("Error writing response: %v", err)
+		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/client/client_test.go` around lines 796 - 797, The test HTTP handler
ignores the error returned from w.Write(...) at the end of the handler; update
the handler to check the write error and fail the test if it occurs.
Specifically, replace the bare w.Write(...) call following
w.WriteHeader(http.StatusInternalServerError) with a checked write (capture the
return error from w.Write and call t.Fatalf or t.Errorf with the error), using
the same testing.T variable used elsewhere in the file so failures are reported
consistently.
test/integration/integration_test.go (1)

459-464: InitTraceProvider with stdout exporter will write span output to stdout during test runs.

Since no OTEL_EXPORTER_OTLP_ENDPOINT env var is set in the test, InitTraceProvider falls back to the stdout exporter, which will print formatted span data to stdout during test execution. This is noisy but functional. If this is intentional for debugging, it's fine; otherwise consider using an in-memory exporter as done in TestIntegration_EndToEndSpanHierarchy.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/integration/integration_test.go` around lines 459 - 464, The test
currently calls telemetry.InitTraceProvider(ctx, "sentinel", "test", 1.0) which
falls back to the stdout exporter and prints spans to stdout; change the test to
use the in-memory exporter instead of stdout (or set
OTEL_EXPORTER_OTLP_ENDPOINT) by invoking the InitTraceProvider variant or option
the project uses for in-memory tracing (match the pattern used in
TestIntegration_EndToEndSpanHierarchy), and keep the deferred
telemetry.Shutdown(tp) call; update the call site in integration_test.go to use
that in-memory configuration so spans are not emitted to stdout during test
runs.
pkg/logger/logger.go (1)

46-50: Consider validating SamplingRate range.

SamplingRate is a float64 with no validation at the config level. While InitTraceProvider in pkg/telemetry/otel.go handles edge cases (≥1.0 → AlwaysSample, ≤0.0 → NeverSample), a negative or >1.0 value could be confusing to operators. Consider documenting the valid range or adding a validation method on OTelConfig.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/logger/logger.go` around lines 46 - 50, Add explicit validation for
OTelConfig.SamplingRate by implementing a validation method (e.g., func (c
*OTelConfig) Validate() error) that ensures SamplingRate is within the inclusive
range 0.0–1.0 and returns a clear error if out of range; reference the
OTelConfig struct and call this Validate method where configuration is loaded or
before InitTraceProvider is invoked so invalid values error early (include a
helpful message indicating the valid range).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/sentinel/main.go`:
- Around line 257-262: The telemetry shutdown is using the same shutdownCtx as
metricsServer.Shutdown and the log call uses the already-cancelled ctx; create a
separate context with its own timeout for telemetry (e.g., teleCtx via
context.WithTimeout) and call telemetry.Shutdown(teleCtx, tp) so it has its own
budget independent of metricsServer.Shutdown, ensure you cancel the teleCtx
after use, and change the error log invocation (log.Extra(...).Error) to use the
telemetry context (teleCtx) or the appropriate non-cancelled shutdown context so
the log is not emitted with a canceled ctx; reference telemetry.Shutdown,
metricsServer.Shutdown, shutdownCtx, tp, and log.Extra to locate and update the
code.
- Around line 153-169: The OTel settings never get set because initLogging uses
DefaultConfig() which leaves LogConfig.OTel.Enabled false; update initLogging to
populate logCfg.OTel.Enabled and logCfg.OTel.SamplingRate from configuration
sources (either the loaded SentinelConfig/YAML or environment variables) so the
check in main.go that inspects logCfg.OTel.Enabled can actually enable tracing;
specifically, within initLogging (and where config.LoadConfig/SentinelConfig is
processed) read an OTEL_ENABLED (boolean) env var or map the YAML fields into
LogConfig.OTel (including SamplingRate) and assign them to logCfg before main.go
calls telemetry.InitTraceProvider (referencing symbols: initLogging, LogConfig,
DefaultConfig, logCfg.OTel.Enabled, logCfg.OTel.SamplingRate,
config.LoadConfig).

In `@internal/sentinel/sentinel.go`:
- Around line 149-150: The attribute.String("messaging.system", "gcp_pubsub") is
hardcoded but should come from configuration; change the code that builds the
span/attributes (the attribute.String calls around
messaging.operation.type/publish in sentinel.go) to read the messaging system
from a configuration value or environment variable (e.g., cfg.MessagingSystem or
os.Getenv("MESSAGING_SYSTEM")) and use that value when constructing
attribute.String("messaging.system", <configuredValue>), with a sensible default
(e.g., "rabbitmq") if unset.

In `@pkg/telemetry/otel_test.go`:
- Around line 308-326: TestSetTraceContext_InvalidSpan currently ends a real
span which doesn't make SpanContext invalid; replace creation of the span with a
no-op/non-recording span (e.g., obtain span via
trace.SpanFromContext(context.Background())) so span.SpanContext().IsValid() is
false and the !IsValid() branch in SetTraceContext is exercised, then call
SetTraceContext(&event, span) and assert that no "traceparent" extension was
added to event.Extensions(); update imports accordingly to reference
go.opentelemetry.io/otel/trace if needed.
- Around line 42-77: TestInitTraceProvider_OTLPExporter currently points
OTEL_EXPORTER_OTLP_ENDPOINT at a non-existent collector and then calls Shutdown
which blocks/fails; fix by making the test use a local httptest server that
responds 200 to POST /v1/traces and set OTEL_EXPORTER_OTLP_ENDPOINT to that
server's URL (server.URL + "/v1/traces"), so InitTraceProvider, tracer use, and
the deferred Shutdown call complete reliably; update the test to start
httptest.NewServer before setting the env var and close the server at the end,
referencing TestInitTraceProvider_OTLPExporter, InitTraceProvider, and Shutdown
in your changes.
- Around line 65-70: Deferred cleanup closures calling Shutdown currently use
t.Fatal which is unsafe inside defers; change those deferred anonymous functions
(the ones invoking Shutdown(ctx, tp) and similar cleanup calls) to call t.Errorf
with the error when Shutdown returns non-nil instead of t.Fatal/t.Fatalf, so the
test reports the cleanup failure without invoking Fatal after the test body has
returned. Locate the three other deferred closures in this file that call
Shutdown or similar cleanup (same pattern as the closure around Shutdown at
lines noted in the review) and replace their t.Fatal/t.Fatalf calls with
t.Errorf while keeping the rest of the defer logic unchanged.

In `@pkg/telemetry/otel.go`:
- Line 117: Fix the typo in the doc comment for StartSpan: change "conext" to
"context" in the comment that reads "// StartSpan starts a span and enriches
conext with trace/span IDs for logging" so it correctly reads "// StartSpan
starts a span and enriches context with trace/span IDs for logging"; locate the
comment above the StartSpan function in otel.go and update the word only.
- Around line 22-29: The listed package-level string literals
(SAMPLER_ALWAYS_ON, SAMPLER_ALWAYS_OFF, SAMPLER_TRACE_ID_RATIO,
ENV_OTEL_TRACES_SAMPLER, ENV_OTEL_TRACES_SAMPLER_ARG,
ENV_OTEL_EXPORTER_OTLP_ENDPOINT) should be declared as constants instead of
variables; replace the var (...) block in pkg/telemetry/otel.go with a const
(...) block so these identifiers are immutable and cannot be reassigned
elsewhere in the package.
- Around line 139-145: The traceparent being set in SetTraceContext incorrectly
hardcodes the trace-flags byte to "01"; update SetTraceContext to read the
actual flags from span.SpanContext().TraceFlags() and format that byte as two
hex characters when building traceParent (keep the existing TraceID and SpanID
usage and the IsValid() guard). Use the span context's TraceFlags value
(converted to a uint8 and formatted with "%02x") instead of the fixed "-01" so
the traceparent accurately reflects whether the span was sampled.
- Around line 41-46: The code incorrectly passes a full URL from
ENV_OTEL_EXPORTER_OTLP_ENDPOINT into otlptracehttp.WithEndpoint (which expects
host:port), causing mis-parsing; fix by either removing the manual env var
handling and simply calling otlptracehttp.New(ctx) so the SDK reads
OTEL_EXPORTER_OTLP_ENDPOINT itself, or if you must pass it explicitly, replace
otlptracehttp.WithEndpoint(otlpEndpoint) with
otlptracehttp.WithEndpointURL(parsedURL) (use the URL form) and ensure
otlpEndpoint is a properly parsed *url.URL; update error logging in the
otlptracehttp.New/creation branch (log.Errorf) accordingly and remove the
redundant env read of ENV_OTEL_EXPORTER_OTLP_ENDPOINT.

In `@test/integration/integration_test.go`:
- Around line 816-817: The loop using pollSpans[:2] can panic if fewer than 2
spans were captured; update the iteration to safely bound the slice by computing
n := len(pollSpans) and if n > 2 set n = 2, then iterate over pollSpans[:n] (or
otherwise iterate up to min(2, len(pollSpans))). Apply this change around the
existing pollSpans variable and the for _, pollSpan := range pollSpans[:2] loop
so the test never slices beyond the available elements.

---

Outside diff comments:
In `@internal/sentinel/sentinel.go`:
- Around line 113-187: The spans started by telemetry.StartSpan (evalSpan and
publishSpan) are leaked on early continue paths (failures in event.SetData and
s.publisher.Publish); fix by ensuring each span is ended on all control
paths—either call defer evalSpan.End() immediately after creating evalSpan and
defer publishSpan.End() right after creating publishSpan (so they run even on
continue), or explicitly call publishSpan.End() and evalSpan.End() just before
every continue in the error branches (references: telemetry.StartSpan, evalSpan,
publishSpan, event.SetData, s.publisher.Publish).

---

Nitpick comments:
In `@internal/client/client_test.go`:
- Around line 796-797: The test HTTP handler ignores the error returned from
w.Write(...) at the end of the handler; update the handler to check the write
error and fail the test if it occurs. Specifically, replace the bare
w.Write(...) call following w.WriteHeader(http.StatusInternalServerError) with a
checked write (capture the return error from w.Write and call t.Fatalf or
t.Errorf with the error), using the same testing.T variable used elsewhere in
the file so failures are reported consistently.

In `@internal/sentinel/sentinel_test.go`:
- Around line 444-448: The test sets a global provider via
trace.NewTracerProvider(...) and otel.SetTracerProvider(tp) but doesn't restore
the previous global provider; capture the current provider with
otel.GetTracerProvider() before calling otel.SetTracerProvider(tp), and defer
otel.SetTracerProvider(previousProvider) so global state is restored; keep the
existing deferred tp.Shutdown() to shut down the new provider.

In `@pkg/logger/logger.go`:
- Around line 46-50: Add explicit validation for OTelConfig.SamplingRate by
implementing a validation method (e.g., func (c *OTelConfig) Validate() error)
that ensures SamplingRate is within the inclusive range 0.0–1.0 and returns a
clear error if out of range; reference the OTelConfig struct and call this
Validate method where configuration is loaded or before InitTraceProvider is
invoked so invalid values error early (include a helpful message indicating the
valid range).

In `@test/integration/integration_test.go`:
- Around line 459-464: The test currently calls telemetry.InitTraceProvider(ctx,
"sentinel", "test", 1.0) which falls back to the stdout exporter and prints
spans to stdout; change the test to use the in-memory exporter instead of stdout
(or set OTEL_EXPORTER_OTLP_ENDPOINT) by invoking the InitTraceProvider variant
or option the project uses for in-memory tracing (match the pattern used in
TestIntegration_EndToEndSpanHierarchy), and keep the deferred
telemetry.Shutdown(tp) call; update the call site in integration_test.go to use
that in-memory configuration so spans are not emitted to stdout during test
runs.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09a1a58 and dae968d.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (10)
  • cmd/sentinel/main.go
  • go.mod
  • internal/client/client.go
  • internal/client/client_test.go
  • internal/sentinel/sentinel.go
  • internal/sentinel/sentinel_test.go
  • pkg/logger/logger.go
  • pkg/telemetry/otel.go
  • pkg/telemetry/otel_test.go
  • test/integration/integration_test.go

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
cmd/sentinel/main.go (2)

160-176: ⚠️ Potential issue | 🔴 Critical

logCfg.OTel.Enabled is always false — OTel initialization is unreachable dead code.

initLogging constructs LogConfig via logger.DefaultConfig() (which sets OTel.Enabled: false) and only updates Level, Format, and Output. Neither OTEL_ENABLED nor OTEL_SAMPLING_RATE are ever read and applied to logCfg.OTel. As a result, the branch at Line 166 is never taken and tracing is never activated.

This was flagged in a previous review; the fix (reading OTEL_ENABLED/OTEL_SAMPLING_RATE from env vars in initLogging) has not been applied yet.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/sentinel/main.go` around lines 160 - 176, The OTel branch is never
reached because initLogging uses logger.DefaultConfig() which leaves
logCfg.OTel.Enabled false and never reads OTEL_ENABLED/OTEL_SAMPLING_RATE;
update initLogging to read the OTEL_ENABLED and OTEL_SAMPLING_RATE environment
variables and apply them to the returned LogConfig (i.e., set
logCfg.OTel.Enabled based on OTEL_ENABLED parsing and set
logCfg.OTel.SamplingRate from OTEL_SAMPLING_RATE parsing), ensuring any parsing
errors are handled sensibly so the downstream check of logCfg.OTel.Enabled in
main.go will correctly enable telemetry; reference initLogging,
logger.DefaultConfig(), LogConfig and logCfg.OTel.Enabled/SamplingRate in your
changes.

290-295: ⚠️ Potential issue | 🟠 Major

OTel shutdown reuses the shared shutdownCtx, and the error log uses an already-cancelled ctx.

Two pre-existing issues remain unaddressed:

  1. telemetry.Shutdown shares the same shutdownCtx (20 s) already partially consumed by healthServer.Shutdown + metricsServer.Shutdown. If those are slow, the OTel flush budget is eroded.
  2. Line 293 passes the root ctx (cancelled at Line 279) to log.Extra(...).Error(...), so the log call is issued with a cancelled context.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/sentinel/main.go` around lines 290 - 295, The OTel shutdown call reuses
the shared shutdownCtx and logs using the already-cancelled root ctx; change
telemetry.Shutdown(shutdownCtx, tp) to use its own fresh context and ensure the
log call uses a live context: create a new context with timeout (e.g., otelCtx,
otelCancel := context.WithTimeout(context.Background(), 20*time.Second)) and
defer otelCancel(), call telemetry.Shutdown(otelCtx, tp), and pass a
non-cancelled context (e.g., otelCtx or context.Background()) to
log.Extra(...).Error(...) so the shutdown has its own budget and the error is
logged with an active context.
🧹 Nitpick comments (1)
cmd/sentinel/main.go (1)

30-33: serviceName is a runtime-configurable value mixed into the build-info var block.

version, commit, and date are set by ldflags at build time and are effectively constants, while serviceName is mutated at runtime inside runServe (Line 162). Keeping them in the same block is misleading. Consider making serviceName a local variable in runServe or, if it needs to be the default value, a separate const.

♻️ Suggested refactor
 var (
 	version = "dev"
 	commit  = "none"
 	date    = "unknown"
-	serviceName = "sentinel"
 )

Then in runServe:

+	serviceName := "sentinel"
 	if envServiceName := os.Getenv("OTEL_SERVICE_NAME"); envServiceName != "" {
 		serviceName = envServiceName
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/sentinel/main.go` around lines 30 - 33, The serviceName variable should
not live in the same build-info var block as the ldflags-populated constants
(version, commit, date); move serviceName out of that var block and make it
either a local variable inside runServe (referencing runServe) if it is mutated
at runtime, or make it a separate const/var with a default value distinct from
version/commit/date if you need a package-level default. Update any references
to serviceName accordingly (e.g., where runServe mutates it) so you no longer
mix runtime-mutable state with the ldflags-injected build constants.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cmd/sentinel/main.go`:
- Around line 160-176: The OTel branch is never reached because initLogging uses
logger.DefaultConfig() which leaves logCfg.OTel.Enabled false and never reads
OTEL_ENABLED/OTEL_SAMPLING_RATE; update initLogging to read the OTEL_ENABLED and
OTEL_SAMPLING_RATE environment variables and apply them to the returned
LogConfig (i.e., set logCfg.OTel.Enabled based on OTEL_ENABLED parsing and set
logCfg.OTel.SamplingRate from OTEL_SAMPLING_RATE parsing), ensuring any parsing
errors are handled sensibly so the downstream check of logCfg.OTel.Enabled in
main.go will correctly enable telemetry; reference initLogging,
logger.DefaultConfig(), LogConfig and logCfg.OTel.Enabled/SamplingRate in your
changes.
- Around line 290-295: The OTel shutdown call reuses the shared shutdownCtx and
logs using the already-cancelled root ctx; change
telemetry.Shutdown(shutdownCtx, tp) to use its own fresh context and ensure the
log call uses a live context: create a new context with timeout (e.g., otelCtx,
otelCancel := context.WithTimeout(context.Background(), 20*time.Second)) and
defer otelCancel(), call telemetry.Shutdown(otelCtx, tp), and pass a
non-cancelled context (e.g., otelCtx or context.Background()) to
log.Extra(...).Error(...) so the shutdown has its own budget and the error is
logged with an active context.

---

Nitpick comments:
In `@cmd/sentinel/main.go`:
- Around line 30-33: The serviceName variable should not live in the same
build-info var block as the ldflags-populated constants (version, commit, date);
move serviceName out of that var block and make it either a local variable
inside runServe (referencing runServe) if it is mutated at runtime, or make it a
separate const/var with a default value distinct from version/commit/date if you
need a package-level default. Update any references to serviceName accordingly
(e.g., where runServe mutates it) so you no longer mix runtime-mutable state
with the ldflags-injected build constants.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dae968d and 3844006.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (2)
  • cmd/sentinel/main.go
  • go.mod

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/client/client_test.go`:
- Line 804: The test currently ignores the error returned by
NewHyperFleetClient; update the test to capture both return values (client, err
:= NewHyperFleetClient(...)) and assert/fail if err != nil (e.g., t.Fatalf or
require.NoError) before using client so any constructor failures surface
immediately; ensure you also handle the client variable in subsequent
assertions.
- Around line 795-800: In the httptest server handler (the anonymous
http.HandlerFunc passed to httptest.NewServer) replace the unsafe t.Fatalf call
with t.Errorf and then return to avoid invoking FailNow from a handler
goroutine; specifically change the error handling after w.Write to call
t.Errorf("Failed to write response: %v", err) followed by return instead of
t.Fatalf.
- Around line 670-676: Capture the previous global tracer provider before
calling otel.SetTracerProvider(tp) in both
TestNewHyperFleetClient_HTTPInstrumentation and
TestNewHyperFleetClient_HTTPInstrumentation_ErrorCase (e.g., prevTp :=
otel.GetTracerProvider()), then in the defer restore it (e.g., defer
otel.SetTracerProvider(prevTp)) and still shutdown tp as before; ensure you
reference the local tp shutdown logic (tp.Shutdown(ctx)) but add restoring the
previous provider to avoid test-state leakage.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3844006 and 5fb7281.

📒 Files selected for processing (1)
  • internal/client/client_test.go

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (3)
pkg/telemetry/otel.go (1)

93-93: ⚠️ Potential issue | 🟠 Major

Prefer parent-based ratio sampling to preserve incoming trace decisions.

Using trace.TraceIDRatioBased directly can resample child spans independently. Wrap it with trace.ParentBased(...) for distributed trace continuity.

💡 Proposed fix
-		sampler = trace.TraceIDRatioBased(samplingRate)
+		sampler = trace.ParentBased(trace.TraceIDRatioBased(samplingRate))
OpenTelemetry Go: For services receiving remote parent context, is ParentBased(TraceIDRatioBased(x)) recommended over TraceIDRatioBased(x) to preserve parent sampling decisions?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/telemetry/otel.go` at line 93, Replace the direct TraceIDRatioBased
sampler assignment so child spans respect remote parent sampling: instead of
assigning sampler = trace.TraceIDRatioBased(samplingRate) wrap it with
trace.ParentBased(trace.TraceIDRatioBased(samplingRate)) (use the same
samplingRate variable) so parent-based sampling is preserved for incoming trace
contexts.
cmd/sentinel/main.go (2)

34-34: ⚠️ Potential issue | 🟡 Minor

Set the default OTel service name to hyperfleet-sentinel.

Keeping the default as sentinel can fragment service identity in tracing backends unless explicitly overridden.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/sentinel/main.go` at line 34, The default OTel serviceName value
currently set as serviceName = "sentinel" should be changed to
"hyperfleet-sentinel" to standardize tracing identity; update the assignment of
the serviceName variable in cmd/sentinel/main.go (the serviceName symbol) to use
"hyperfleet-sentinel" as the default so traces are attributed consistently
unless explicitly overridden.

301-305: ⚠️ Potential issue | 🟡 Minor

Use the telemetry shutdown context when logging shutdown errors.

The error log still uses ctx instead of otelShutdownCtx, which may already be canceled during shutdown.

💡 Proposed fix
-			if err := telemetry.Shutdown(otelShutdownCtx, tp); err != nil {
-				log.Extra("error", err).Error(ctx, "Failed to shutdown OpenTelemetry")
+			if err := telemetry.Shutdown(otelShutdownCtx, tp); err != nil {
+				log.Extra("error", err).Error(otelShutdownCtx, "Failed to shutdown OpenTelemetry")
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/sentinel/main.go` around lines 301 - 305, The shutdown logging currently
uses the outer request context variable ctx which may be canceled; change the
log call to use the telemetry shutdown context instead: when calling
telemetry.Shutdown(otelShutdownCtx, tp) and it returns an error, pass
otelShutdownCtx into log.Extra(...).Error(...) (keeping the existing
log.Extra("error", err) and message) so the failure is logged with the
shutdown-specific context created by context.WithTimeout
(otelShutdownCtx/otelShutdownCancel).
🧹 Nitpick comments (1)
test/integration/integration_test.go (1)

902-909: Trace continuity assertion is too broad and can become flaky.

This currently counts trace IDs from all captured spans. If unrelated root spans are emitted, the check can fail even when poll/evaluate/publish hierarchy is correct. Prefer deriving continuity from poll traces (or poll descendants) only.

Suggested tightening
- // Verify trace continuity - spans should form coherent traces
- traceIDs := make(map[string]bool)
- for _, span := range spans {
- 	traceIDs[span.SpanContext.TraceID().String()] = true
- }
-
- if len(traceIDs) > len(pollSpans) {
- 	t.Errorf("Expected spans to belong to %d traces (one per poll), but found %d trace IDs", len(pollSpans), len(traceIDs))
- }
+ // Verify trace continuity for poll roots specifically (one trace per poll span)
+ pollTraceIDs := make(map[string]struct{})
+ for _, pollSpan := range pollSpans {
+ 	pollTraceIDs[pollSpan.SpanContext.TraceID().String()] = struct{}{}
+ }
+ if len(pollTraceIDs) != len(pollSpans) {
+ 	t.Errorf("Expected one unique trace per poll span, got %d unique traces for %d poll spans", len(pollTraceIDs), len(pollSpans))
+ }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/integration/integration_test.go` around lines 902 - 909, The current
check counts trace IDs from all captured spans which can include unrelated roots
and become flaky; change it to derive continuity only from the poll traces by
first building a set of expectedTraceIDs from pollSpans (use each
pollSpan.SpanContext.TraceID().String()) and then only collect/count trace IDs
from spans whose TraceID is in that expectedTraceIDs set (i.e., filter spans by
membership in the poll trace set) before asserting the number of distinct
poll-related traces matches len(pollSpans).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/config/config.go`:
- Around line 110-112: The code currently assigns MESSAGING_SYSTEM even when the
environment variable is present but empty; update the block that reads
os.LookupEnv("MESSAGING_SYSTEM") so it trims and checks the value is non-empty
before assigning to cfg.MessagingSystem (e.g., use strings.TrimSpace on the
looked-up value and only set cfg.MessagingSystem when the trimmed value != "");
this avoids clearing defaults or emitting empty messaging.system attributes.

In `@internal/sentinel/sentinel.go`:
- Around line 114-118: The telemetry spans created via telemetry.StartSpan
(e.g., evalSpan and publishSpan in the sentinel.evaluate flow) are not ended on
error paths that call continue; update each error branch that returns/continues
after calling resource.SetData or publisher.Publish to call the corresponding
span.End() (evalSpan.End() and, if publishSpan was created, publishSpan.End())
before continuing so all spans are closed; search for usages of SetData and
Publish within the sentinel evaluation loop (the branches around evalSpan and
publishSpan creation) and add the missing End() calls on every error path.

In `@pkg/telemetry/otel_test.go`:
- Around line 84-104: The test calls Shutdown twice (once in the deferred
anonymous func and once explicitly), causing flakiness; update
TestInitTraceProvider_OTLPExporter to call Shutdown only once by removing the
redundant call: either eliminate the explicit err = Shutdown(shutdownCtx, tp)
block (keeping the defer) or remove the defer and keep the explicit shutdown,
ensuring only a single call to Shutdown(ctx, tp) on the trace provider
(reference Shutdown and the deferred anonymous func that wraps Shutdown).

In `@test/integration/integration_test.go`:
- Around line 717-718: The test currently ignores the error returned by
client.NewHyperFleetClient; modify the setup to capture the error (e.g.,
hyperfleetClient, err := client.NewHyperFleetClient(...)) and fail fast if it is
non-nil by calling t.Fatalf or using the test assertion helper (e.g.,
require.NoError) so setup failures are explicit; update the code around
NewHyperFleetClient and any test setup that depends on hyperfleetClient to use
the checked variable.
- Around line 675-682: The test sets a new global OTEL tracer provider with
otel.SetTracerProvider(tp) but never restores the previous global provider,
leaking state to other tests; capture the existing provider via
otel.GetTracerProvider() before calling otel.SetTracerProvider(tp) and in the
deferred cleanup restore it by calling otel.SetTracerProvider(prevProvider) (and
then perform tp.Shutdown(cleanupCtx) as currently implemented), referencing
otel.SetTracerProvider, otel.GetTracerProvider, tp, and tp.Shutdown to locate
where to add the save/restore logic.

---

Duplicate comments:
In `@cmd/sentinel/main.go`:
- Line 34: The default OTel serviceName value currently set as serviceName =
"sentinel" should be changed to "hyperfleet-sentinel" to standardize tracing
identity; update the assignment of the serviceName variable in
cmd/sentinel/main.go (the serviceName symbol) to use "hyperfleet-sentinel" as
the default so traces are attributed consistently unless explicitly overridden.
- Around line 301-305: The shutdown logging currently uses the outer request
context variable ctx which may be canceled; change the log call to use the
telemetry shutdown context instead: when calling
telemetry.Shutdown(otelShutdownCtx, tp) and it returns an error, pass
otelShutdownCtx into log.Extra(...).Error(...) (keeping the existing
log.Extra("error", err) and message) so the failure is logged with the
shutdown-specific context created by context.WithTimeout
(otelShutdownCtx/otelShutdownCancel).

In `@pkg/telemetry/otel.go`:
- Line 93: Replace the direct TraceIDRatioBased sampler assignment so child
spans respect remote parent sampling: instead of assigning sampler =
trace.TraceIDRatioBased(samplingRate) wrap it with
trace.ParentBased(trace.TraceIDRatioBased(samplingRate)) (use the same
samplingRate variable) so parent-based sampling is preserved for incoming trace
contexts.

---

Nitpick comments:
In `@test/integration/integration_test.go`:
- Around line 902-909: The current check counts trace IDs from all captured
spans which can include unrelated roots and become flaky; change it to derive
continuity only from the poll traces by first building a set of expectedTraceIDs
from pollSpans (use each pollSpan.SpanContext.TraceID().String()) and then only
collect/count trace IDs from spans whose TraceID is in that expectedTraceIDs set
(i.e., filter spans by membership in the poll trace set) before asserting the
number of distinct poll-related traces matches len(pollSpans).

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5fb7281 and 2d3f335.

📒 Files selected for processing (10)
  • cmd/sentinel/main.go
  • configs/dev-example.yaml
  • configs/gcp-pubsub-example.yaml
  • configs/rabbitmq-example.yaml
  • internal/client/client_test.go
  • internal/config/config.go
  • internal/sentinel/sentinel.go
  • pkg/telemetry/otel.go
  • pkg/telemetry/otel_test.go
  • test/integration/integration_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/client/client_test.go

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
cmd/sentinel/main.go (1)

301-305: ⚠️ Potential issue | 🟡 Minor

Use otelShutdownCtx for the shutdown error log context.

Line 304 logs with ctx even though telemetry shutdown is executed under otelShutdownCtx.

💡 Proposed fix
 		if tp != nil {
 			otelShutdownCtx, otelShutdownCancel := context.WithTimeout(context.Background(), 5*time.Second)
 			defer otelShutdownCancel()
 			if err := telemetry.Shutdown(otelShutdownCtx, tp); err != nil {
-				log.Extra("error", err).Error(ctx, "Failed to shutdown OpenTelemetry")
+				log.Extra("error", err).Error(otelShutdownCtx, "Failed to shutdown OpenTelemetry")
 			}
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/sentinel/main.go` around lines 301 - 305, The shutdown error log is using
the wrong context; change the log call to use otelShutdownCtx (created by
context.WithTimeout) instead of ctx so the telemetry.Shutdown error is logged
with the same context it runs under: locate otelShutdownCtx and
otelShutdownCancel and update the log.Extra(...).Error(...) invocation that
follows telemetry.Shutdown(otelShutdownCtx, tp) to pass otelShutdownCtx as the
context argument.
internal/sentinel/sentinel.go (1)

134-143: ⚠️ Potential issue | 🔴 Critical

End and mark evalSpan on SetData failure before continue.

On Line 142, continue skips Line 192, so evalSpan leaks on this error path and the span is not marked failed.

🐛 Proposed fix
 			if err := event.SetData(cloudevents.ApplicationJSON, map[string]interface{}{
 				"kind":       resource.Kind,
 				"id":         resource.ID,
 				"generation": resource.Generation,
 				"href":       resource.Href,
 				"reason":     decision.Reason,
 			}); err != nil {
+				evalSpan.RecordError(err)
+				evalSpan.SetStatus(codes.Error, "set event data failed")
 				s.logger.Errorf(eventCtx, "Failed to set event data resource_id=%s error=%v", resource.ID, err)
+				evalSpan.End()
 				continue
 			}

Also applies to: 192-192

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/sentinel/sentinel.go` around lines 134 - 143, When
event.SetData(...) fails in the block around event.SetData, ensure you mark and
end the `evalSpan` before returning/continuing: after logging the error in the
SetData error branch, record the error on `evalSpan` (e.g., RecordError /
SetStatus to error with the err message) and call `evalSpan.End()` prior to the
`continue` so the span is not leaked; apply the same fix to the other identical
error path near the later SetData call (the occurrence referenced at line ~192).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/sentinel/main.go`:
- Around line 151-155: The code currently ignores parse errors from
strconv.ParseBool for TRACING_ENABLED; modify the block around
otelEnabled/strconv.ParseBool so that when err != nil you either log a clear
error and exit (fail fast) or at minimum emit a warning to stderr/logger
indicating the invalid value and that tracing will remain at its default, e.g.
using log.Printf or fmt.Fprintf(os.Stderr, ...), referencing the TRACING_ENABLED
env var, the strconv.ParseBool call, and cfg.OTel.Enabled to show
current/default state; ensure successful ParseBool still sets cfg.OTel.Enabled
as before.

In `@pkg/telemetry/otel.go`:
- Around line 77-80: The code silently ignores strconv.ParseFloat errors when
reading ENV_OTEL_TRACES_SAMPLER_ARG; update the parsing branch (the block
referencing ENV_OTEL_TRACES_SAMPLER_ARG and samplingRate) to detect a parse
error and handle it explicitly by logging a clear warning or returning an error
that includes the environment variable name and the invalid value, and only
assign samplingRate on successful parse; ensure the log/returned message
references ENV_OTEL_TRACES_SAMPLER_ARG and the raw arg so misconfiguration is
visible.

---

Duplicate comments:
In `@cmd/sentinel/main.go`:
- Around line 301-305: The shutdown error log is using the wrong context; change
the log call to use otelShutdownCtx (created by context.WithTimeout) instead of
ctx so the telemetry.Shutdown error is logged with the same context it runs
under: locate otelShutdownCtx and otelShutdownCancel and update the
log.Extra(...).Error(...) invocation that follows
telemetry.Shutdown(otelShutdownCtx, tp) to pass otelShutdownCtx as the context
argument.

In `@internal/sentinel/sentinel.go`:
- Around line 134-143: When event.SetData(...) fails in the block around
event.SetData, ensure you mark and end the `evalSpan` before
returning/continuing: after logging the error in the SetData error branch,
record the error on `evalSpan` (e.g., RecordError / SetStatus to error with the
err message) and call `evalSpan.End()` prior to the `continue` so the span is
not leaked; apply the same fix to the other identical error path near the later
SetData call (the occurrence referenced at line ~192).

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2d3f335 and b3a9ae1.

📒 Files selected for processing (3)
  • cmd/sentinel/main.go
  • internal/sentinel/sentinel.go
  • pkg/telemetry/otel.go

@tirthct
Copy link
Contributor Author

tirthct commented Feb 26, 2026

/retest

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
pkg/telemetry/otel.go (1)

87-96: ParentBased not applied for AlwaysSample and NeverSample cases.

When samplingRate >= 1.0 or <= 0.0, the sampler is set directly to AlwaysSample() or NeverSample() without wrapping in ParentBased(). This means parent sampling decisions are ignored in these cases, potentially breaking trace continuity across service boundaries.

For consistency with the architecture standard that specifies parentbased_traceidratio as the default strategy, consider wrapping all samplers:

♻️ Proposed fix
 	var sampler trace.Sampler
 	switch {
 	case samplingRate >= 1.0:
-		sampler = trace.AlwaysSample() // Sample all
+		sampler = trace.ParentBased(trace.AlwaysSample())
 	case samplingRate <= 0.0:
-		sampler = trace.NeverSample() // Sample none
+		sampler = trace.ParentBased(trace.NeverSample())
 	default:
 		sampler = trace.ParentBased(trace.TraceIDRatioBased(samplingRate))
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/telemetry/otel.go` around lines 87 - 96, The sampler logic sets sampler
directly to trace.AlwaysSample() or trace.NeverSample() for samplingRate >= 1.0
or <= 0.0, which bypasses parent-based behavior; change those branches to wrap
the base samplers with trace.ParentBased(...) so all cases—including
trace.AlwaysSample(), trace.NeverSample() and
trace.TraceIDRatioBased(samplingRate)—are applied via trace.ParentBased, keeping
the existing switch and the sampler variable but replacing direct calls with
trace.ParentBased(trace.AlwaysSample()), trace.ParentBased(trace.NeverSample()),
and trace.ParentBased(trace.TraceIDRatioBased(samplingRate)).
pkg/telemetry/otel_test.go (1)

181-185: Test assertion for expectedSample=false cases may be misleading.

For always_off (line 117-121), the test expects expectedSample: false but only checks IsValid() when expectedSample is true. The span context remains valid even when not sampled — NeverSample still produces a valid SpanContext, it just won't be exported.

To properly verify sampling behavior, check span.SpanContext().IsSampled() instead:

💡 Suggested enhancement
-			if tt.expectedSample {
-				if !span.SpanContext().IsValid() {
-					t.Error("Expected valid span context for sampling=true")
-				}
+			if span.SpanContext().IsSampled() != tt.expectedSample {
+				t.Errorf("Expected IsSampled()=%v, got %v", tt.expectedSample, span.SpanContext().IsSampled())
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/telemetry/otel_test.go` around lines 181 - 185, The test currently only
checks span.SpanContext().IsValid() when tt.expectedSample is true, which
doesn't verify sampling; update the assertion to use
span.SpanContext().IsSampled() (and assert true/false according to
tt.expectedSample) so cases like the "always_off" sampler correctly validate
sampling behavior; locate the table-driven test and replace the IsValid-based
conditional with an assertion on SpanContext().IsSampled() for the test case
keyed by tt.expectedSample in the test function in pkg/telemetry/otel_test.go.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/sentinel/sentinel_test.go`:
- Around line 449-454: The deferred shutdown of the tracer provider uses
t.Fatalf which runs after the test body and can mislead test results; change the
deferred call in the shutdown closure (the anonymous func that calls
tp.Shutdown(ctx) on the trace.TracerProvider instance `tp` using `ctx`) to use
t.Errorf to report errors instead of t.Fatalf, so cleanup failures are reported
without aborting after test completion.
- Line 467: The test currently ignores the error returned by
NewHyperFleetClient; change the setup to capture the error (e.g.,
hyperfleetClient, err := client.NewHyperFleetClient(server.URL, 10*time.Second))
and fail fast on error (e.g., t.Fatalf or require.NoError(t, err)) so setup
failures are reported immediately instead of surfacing as secondary errors later
when hyperfleetClient is used.

In `@pkg/telemetry/otel.go`:
- Around line 77-83: The current logic checks ENV_OTEL_TRACES_SAMPLER_ARG and
logs "Invalid" in the else branch when the env var is empty, but it fails to
warn on strconv.ParseFloat errors; update the logic in the block that reads
ENV_OTEL_TRACES_SAMPLER_ARG so that: if arg == "" do nothing (keep default
samplingRate), if arg != "" attempt strconv.ParseFloat and on err != nil call
log.Warnf with a clear message including ENV_OTEL_TRACES_SAMPLER_ARG and the bad
arg value and that the default samplingRate will be used, and only set
samplingRate when ParseFloat succeeds; adjust references to
ENV_OTEL_TRACES_SAMPLER_ARG, arg, strconv.ParseFloat, samplingRate, and
log.Warnf accordingly.

---

Nitpick comments:
In `@pkg/telemetry/otel_test.go`:
- Around line 181-185: The test currently only checks
span.SpanContext().IsValid() when tt.expectedSample is true, which doesn't
verify sampling; update the assertion to use span.SpanContext().IsSampled() (and
assert true/false according to tt.expectedSample) so cases like the "always_off"
sampler correctly validate sampling behavior; locate the table-driven test and
replace the IsValid-based conditional with an assertion on
SpanContext().IsSampled() for the test case keyed by tt.expectedSample in the
test function in pkg/telemetry/otel_test.go.

In `@pkg/telemetry/otel.go`:
- Around line 87-96: The sampler logic sets sampler directly to
trace.AlwaysSample() or trace.NeverSample() for samplingRate >= 1.0 or <= 0.0,
which bypasses parent-based behavior; change those branches to wrap the base
samplers with trace.ParentBased(...) so all cases—including
trace.AlwaysSample(), trace.NeverSample() and
trace.TraceIDRatioBased(samplingRate)—are applied via trace.ParentBased, keeping
the existing switch and the sampler variable but replacing direct calls with
trace.ParentBased(trace.AlwaysSample()), trace.ParentBased(trace.NeverSample()),
and trace.ParentBased(trace.TraceIDRatioBased(samplingRate)).

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b3a9ae1 and 8c2c2bb.

📒 Files selected for processing (8)
  • cmd/sentinel/main.go
  • internal/client/client_test.go
  • internal/config/config.go
  • internal/sentinel/sentinel.go
  • internal/sentinel/sentinel_test.go
  • pkg/telemetry/otel.go
  • pkg/telemetry/otel_test.go
  • test/integration/integration_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/client/client_test.go
  • internal/config/config.go

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/telemetry/otel.go`:
- Around line 142-149: Add nil/zero guards in SetTraceContext: return early if
event is nil or span is nil/invalid before calling span.SpanContext(); check
span != nil (or equivalent) and use span.SpanContext().IsValid() only after
those checks to avoid panics from dereferencing nil spans or nil events, and
keep the existing traceparent formatting logic (traceParent := fmt.Sprintf(...)
and event.SetExtension("traceparent", traceParent)) inside the guarded block so
they only run when both inputs are valid.
- Around line 41-43: The current code calls
otlptracehttp.WithEndpointURL(otlpEndpoint) (inside otlptracehttp.New), which
treats the value as a full URL and prevents the SDK from appending the
/v1/traces path; remove that call and either let the SDK read
OTEL_EXPORTER_OTLP_ENDPOINT itself (omit the WithEndpointURL option) or, if
specifying programmatically, replace WithEndpointURL with
otlptracehttp.WithEndpoint(...) using a host:port value (and keep the call site
in otlptracehttp.New).

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8c2c2bb and 31b40d6.

📒 Files selected for processing (3)
  • internal/sentinel/sentinel.go
  • internal/sentinel/sentinel_test.go
  • pkg/telemetry/otel.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/sentinel/sentinel_test.go
  • internal/sentinel/sentinel.go

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/telemetry/otel.go`:
- Around line 41-42: The exporter selection currently only checks
ENV_OTEL_EXPORTER_OTLP_ENDPOINT; update the gating logic to also consider
ENV_OTEL_EXPORTER_OTLP_TRACES_ENDPOINT and prefer the traces-specific variable
when present so otlptracehttp.New is attempted for traces; modify the check
around otlptracehttp.New (references: ENV_OTEL_EXPORTER_OTLP_ENDPOINT,
ENV_OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, otlptracehttp.New, exporter, err) to
look for the traces endpoint first and fall back to the generic OTLP endpoint
before defaulting to stdout.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31b40d6 and f30c9bf.

📒 Files selected for processing (1)
  • pkg/telemetry/otel.go

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
internal/config/config.go (1)

110-112: LGTM on env var handling; consider validating MessagingSystem isn't empty.

The empty string guard for the environment variable is correctly implemented. However, if the config file explicitly sets messaging_system: "", viper will override the default to an empty string, and this won't be caught since no validation exists in Validate(). This could result in empty messaging.system span attributes.

💡 Optional: Add validation for non-empty MessagingSystem
 func (c *SentinelConfig) Validate() error {
+	if c.MessagingSystem == "" {
+		return fmt.Errorf("messaging_system cannot be empty")
+	}
+
 	if c.ResourceType == "" {
 		return fmt.Errorf("resource_type is required")
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/config/config.go` around lines 110 - 112, The config currently
allows cfg.MessagingSystem to be set to an empty string (e.g., via viper reading
messaging_system:""), but Validate() does not catch this; update the Validate()
method to check strings.TrimSpace(cfg.MessagingSystem) != "" and return a clear
error (e.g., "messaging.system must be non-empty") when empty so downstream code
that sets span attributes can't receive an empty value; reference the
cfg.MessagingSystem field in your validation and ensure the error is returned
from Validate().
test/integration/integration_test.go (1)

505-505: Consider checking NewHyperFleetClient error for consistency.

Line 718 now properly checks the client construction error after addressing a previous review comment. For consistency within the same file, this line should follow the same pattern.

♻️ Proposed fix
-	hyperfleetClient, _ := client.NewHyperFleetClient(server.URL, 10*time.Second)
+	hyperfleetClient, clientErr := client.NewHyperFleetClient(server.URL, 10*time.Second)
+	if clientErr != nil {
+		t.Fatalf("failed to create HyperFleet client: %v", clientErr)
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/integration/integration_test.go` at line 505, The call to
NewHyperFleetClient currently ignores its error return; update the invocation to
capture the error (e.g., hyperfleetClient, err :=
client.NewHyperFleetClient(server.URL, 10*time.Second)) and handle it
consistently with the rest of the test file (fail the test or t.Fatalf with the
error) so the client construction error is not silently ignored.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/telemetry/otel.go`:
- Line 28: Rename the misspelled constant envOtelExplorerOltpEndpoint to
envOtelExporterOtlpEndpoint (fixing "Explorer"→"Exporter" and "Oltp"→"Otlp") and
update all references to it (e.g., the usage in the telemetry setup where the
constant is read) to use the new name so the constant matches the
OTEL_EXPORTER_OTLP_ENDPOINT environment variable and OpenTelemetry conventions.

---

Nitpick comments:
In `@internal/config/config.go`:
- Around line 110-112: The config currently allows cfg.MessagingSystem to be set
to an empty string (e.g., via viper reading messaging_system:""), but Validate()
does not catch this; update the Validate() method to check
strings.TrimSpace(cfg.MessagingSystem) != "" and return a clear error (e.g.,
"messaging.system must be non-empty") when empty so downstream code that sets
span attributes can't receive an empty value; reference the cfg.MessagingSystem
field in your validation and ensure the error is returned from Validate().

In `@test/integration/integration_test.go`:
- Line 505: The call to NewHyperFleetClient currently ignores its error return;
update the invocation to capture the error (e.g., hyperfleetClient, err :=
client.NewHyperFleetClient(server.URL, 10*time.Second)) and handle it
consistently with the rest of the test file (fail the test or t.Fatalf with the
error) so the client construction error is not silently ignored.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f30c9bf and 53bf028.

📒 Files selected for processing (5)
  • internal/client/client_test.go
  • internal/config/config.go
  • pkg/logger/logger.go
  • pkg/telemetry/otel.go
  • test/integration/integration_test.go

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
pkg/telemetry/otel_test.go (1)

65-74: Prefer t.Setenv to reduce cleanup boilerplate and leakage risk.

Using t.Setenv makes env restoration automatic and keeps each test case isolated with less defer code.

Also applies to: 137-158

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/telemetry/otel_test.go` around lines 65 - 74, Replace manual environment
manipulation in the tests with t.Setenv to avoid cleanup boilerplate and
leakage: instead of calling os.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", server.URL)
and deferring os.Unsetenv("OTEL_EXPORTER_OTLP_ENDPOINT"), call
t.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", server.URL) so the testing framework
automatically restores the original value; apply the same change to the other
occurrences around the OTEL exporter env setup (the block using
os.Setenv/os.Unsetenv at ~lines 137-158) so both test cases use t.Setenv and
remove the explicit defer cleanup code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/telemetry/otel_test.go`:
- Around line 49-60: The test's mock OTLP server handler increments
receivedRequests but the test never starts any spans nor asserts that
receivedRequests increased, so exports may never occur; update the test by
creating and ending at least one span (using the package's tracer, e.g., via
TracerProvider/Tracer or Start/End calls used elsewhere in the test) before
shutting down/exporter close, then add assertions that receivedRequests > 0
(and/or equal to expected count) after shutdown to verify the OTLP export
happened; apply the same change to the second occurrence referenced (the block
around lines 90-97) so both server handlers actually receive and assert exported
traces.
- Around line 176-180: The test only asserts the true branch for sampling
(checking span.SpanContext().IsValid()) and never validates the false case;
update the test loop in otel_test.go where expectedSample is used to add an else
branch that asserts the span is not sampled/valid for expectedSample==false (use
span.SpanContext().IsValid() and/or span.SpanContext().IsSampled() to make the
assertion). Locate the table-driven test and the block using expectedSample and
ensure the false branch fails the test when a span is unexpectedly valid or
sampled.
- Around line 311-320: The test uses fixed-index slicing on traceParentStr
(checks at len 55, then slices traceParentStr[:3] and traceParentStr[len-3:])
which can panic if the length check fails; update the assertions in Test
(traceParentStr) to avoid slicing by using safe checks such as verifying length
first and then using strings.HasPrefix(traceParentStr, "00-") and
strings.HasSuffix(traceParentStr, "-01") or perform conditional slicing only
after re-checking len(traceParentStr) >= 55, referencing the traceParentStr
variable to locate and replace the brittle slices.

In `@pkg/telemetry/otel.go`:
- Around line 78-83: The code parsing OTEL_TRACES_SAMPLER_ARG currently accepts
any float and silently lets downstream logic coerce out-of-range values; update
the parsing in pkg/telemetry/otel.go (the block that assigns 'rate' and sets
'samplingRate' using envOtelTracesSamplerArg) to validate that rate is within
[0.0, 1.0], and if not, log a clear warning via log.Warnf including the invalid
value and retain the prior/default samplingRate instead of assigning the
out-of-range value; apply the same explicit range check and warning to the
related parsing block that also assigns samplingRate (the similar code around
the other OTEL_TRACES_* parsing).

---

Nitpick comments:
In `@pkg/telemetry/otel_test.go`:
- Around line 65-74: Replace manual environment manipulation in the tests with
t.Setenv to avoid cleanup boilerplate and leakage: instead of calling
os.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", server.URL) and deferring
os.Unsetenv("OTEL_EXPORTER_OTLP_ENDPOINT"), call
t.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", server.URL) so the testing framework
automatically restores the original value; apply the same change to the other
occurrences around the OTEL exporter env setup (the block using
os.Setenv/os.Unsetenv at ~lines 137-158) so both test cases use t.Setenv and
remove the explicit defer cleanup code.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 53bf028 and f641fca.

📒 Files selected for processing (3)
  • cmd/sentinel/main.go
  • pkg/telemetry/otel.go
  • pkg/telemetry/otel_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/sentinel/main.go

@tirthct
Copy link
Contributor Author

tirthct commented Mar 4, 2026

/retest

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
test/integration/integration_test.go (2)

539-539: ⚠️ Potential issue | 🟡 Minor

Ignoring NewHyperFleetClient error in TestIntegration_BrokerLoggerContext.

The error from NewHyperFleetClient is discarded. This was fixed in TestIntegration_EndToEndSpanHierarchy (lines 759-762) but not here. Fail fast to surface setup issues explicitly.

Proposed fix
-	hyperfleetClient, _ := client.NewHyperFleetClient(server.URL, 10*time.Second)
+	hyperfleetClient, clientErr := client.NewHyperFleetClient(server.URL, 10*time.Second)
+	if clientErr != nil {
+		t.Fatalf("failed to create HyperFleet client: %v", clientErr)
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/integration/integration_test.go` at line 539, In
TestIntegration_BrokerLoggerContext update the call to NewHyperFleetClient to
handle its returned error instead of discarding it: capture both return values
from client.NewHyperFleetClient(server.URL, 10*time.Second), check the error and
fail the test immediately (e.g., using t.Fatalf or require.NoError) if non-nil;
this mirrors the handling used in TestIntegration_EndToEndSpanHierarchy and
ensures setup failures surface promptly.

475-476: ⚠️ Potential issue | 🟡 Minor

Data race on callCount in TestIntegration_BrokerLoggerContext.

callCount is a plain int incremented from the HTTP handler goroutine (line 520) without synchronization. This is the same race condition pattern that was fixed in TestIntegration_EndToEndSpanHierarchy using atomic.Int32.

Proposed fix
-	callCount := 0
+	var callCount atomic.Int32

And update the handler (lines 520-521):

-		callCount++
-		if callCount >= 2 {
+		if callCount.Add(1) >= 2 {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/integration/integration_test.go` around lines 475 - 476, The test has a
data race on the plain int variable callCount in
TestIntegration_BrokerLoggerContext; change callCount to an atomic counter
(e.g., atomic.Int32 or atomic.Int64 from sync/atomic) and update the HTTP
handler that increments it to use the atomic increment (Add) and reads (Load)
instead of direct ++/== operations; ensure any comparisons or assignments to
callCount elsewhere in the test use the atomic Load/Store equivalents so all
accesses are synchronized.
🧹 Nitpick comments (3)
test/integration/integration_test.go (3)

153-153: Pre-existing: Error ignored for NewHyperFleetClient in other tests.

For consistency with the fix applied to TestIntegration_EndToEndSpanHierarchy, consider addressing the ignored error here and in TestIntegration_LabelSelectorFiltering (line 269) and TestIntegration_TSLSyntaxMultipleLabels (line 380) as a follow-up cleanup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/integration/integration_test.go` at line 153, The call to
NewHyperFleetClient in the test ignores its returned error; update the test to
handle the error consistently with TestIntegration_EndToEndSpanHierarchy by
capturing the error from client.NewHyperFleetClient(server.URL, 10*time.Second)
and failing the test if non-nil (e.g., t.Fatalf or t.Fatal with a descriptive
message). Apply the same change to the other tests mentioned
(TestIntegration_LabelSelectorFiltering and
TestIntegration_TSLSyntaxMultipleLabels) so NewHyperFleetClient's error is
checked and handled instead of ignored.

974-990: validateSpanAttribute only validates the first matching span.

The function returns after finding the first span with the given name (line 982-983). If multiple spans exist with the same name, only the first one's attribute is validated. This is fine if all spans of that name are expected to have identical attributes, but could miss inconsistencies on subsequent spans.

Alternative: validate all spans with matching name
 func validateSpanAttribute(t *testing.T, spans []tracetest.SpanStub, spanName, attrKey, expectedValue string) {
+	found := false
 	for _, span := range spans {
 		if span.Name == spanName {
+			found = true
 			for _, attr := range span.Attributes {
 				if string(attr.Key) == attrKey {
 					if attr.Value.AsString() != expectedValue {
-						t.Errorf("Span '%s': expected %s=%s, got %s", spanName, attrKey, expectedValue, attr.Value.AsString())
+						t.Errorf("Span '%s' (%s): expected %s=%s, got %s", spanName, span.SpanContext.SpanID().String()[:8], attrKey, expectedValue, attr.Value.AsString())
 					}
-					return
+					break
 				}
 			}
-			t.Errorf("Span '%s': attribute '%s' not found", spanName, attrKey)
-			return
 		}
 	}
-	t.Errorf("Span '%s' not found", spanName)
+	if !found {
+		t.Errorf("Span '%s' not found", spanName)
+	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/integration/integration_test.go` around lines 974 - 990, The helper
validateSpanAttribute currently stops after validating the first span with the
given name, so inconsistencies in later spans are missed; update
validateSpanAttribute to iterate through all spans in the spans slice and check
every span where span.Name == spanName, verifying that each matching span has an
attribute with Key equal to attrKey and Value.AsString() equal to expectedValue,
and report t.Errorf for any missing attribute or mismatched value per-span (or
aggregate failures) instead of returning after the first match so all matching
spans are validated.

493-498: Follow the same tracer provider restoration pattern as TestIntegration_EndToEndSpanHierarchy in TestIntegration_BrokerLoggerContext.

InitTraceProvider sets the global tracer provider via otel.SetTracerProvider(tp) (line 113 of otel.go). The EndToEndSpanHierarchy test correctly saves and restores the previous provider to avoid leaking state to other tests. BrokerLoggerContext should do the same by adding:

prevTP := otel.GetTracerProvider()
defer otel.SetTracerProvider(prevTP)

before the call to InitTraceProvider to maintain consistency and prevent global state pollution.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/integration/integration_test.go` around lines 493 - 498, The
TestIntegration_BrokerLoggerContext test currently calls
telemetry.InitTraceProvider which sets the global tracer provider but does not
restore the previous provider; mirror the pattern used in
TestIntegration_EndToEndSpanHierarchy by capturing the existing provider with
otel.GetTracerProvider() before calling InitTraceProvider and deferring
otel.SetTracerProvider(prevTP) to restore it, so modify the setup in
TestIntegration_BrokerLoggerContext to save prevTP, defer its restoration, then
call InitTraceProvider and the existing deferred telemetry.Shutdown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@test/integration/integration_test.go`:
- Line 539: In TestIntegration_BrokerLoggerContext update the call to
NewHyperFleetClient to handle its returned error instead of discarding it:
capture both return values from client.NewHyperFleetClient(server.URL,
10*time.Second), check the error and fail the test immediately (e.g., using
t.Fatalf or require.NoError) if non-nil; this mirrors the handling used in
TestIntegration_EndToEndSpanHierarchy and ensures setup failures surface
promptly.
- Around line 475-476: The test has a data race on the plain int variable
callCount in TestIntegration_BrokerLoggerContext; change callCount to an atomic
counter (e.g., atomic.Int32 or atomic.Int64 from sync/atomic) and update the
HTTP handler that increments it to use the atomic increment (Add) and reads
(Load) instead of direct ++/== operations; ensure any comparisons or assignments
to callCount elsewhere in the test use the atomic Load/Store equivalents so all
accesses are synchronized.

---

Nitpick comments:
In `@test/integration/integration_test.go`:
- Line 153: The call to NewHyperFleetClient in the test ignores its returned
error; update the test to handle the error consistently with
TestIntegration_EndToEndSpanHierarchy by capturing the error from
client.NewHyperFleetClient(server.URL, 10*time.Second) and failing the test if
non-nil (e.g., t.Fatalf or t.Fatal with a descriptive message). Apply the same
change to the other tests mentioned (TestIntegration_LabelSelectorFiltering and
TestIntegration_TSLSyntaxMultipleLabels) so NewHyperFleetClient's error is
checked and handled instead of ignored.
- Around line 974-990: The helper validateSpanAttribute currently stops after
validating the first span with the given name, so inconsistencies in later spans
are missed; update validateSpanAttribute to iterate through all spans in the
spans slice and check every span where span.Name == spanName, verifying that
each matching span has an attribute with Key equal to attrKey and
Value.AsString() equal to expectedValue, and report t.Errorf for any missing
attribute or mismatched value per-span (or aggregate failures) instead of
returning after the first match so all matching spans are validated.
- Around line 493-498: The TestIntegration_BrokerLoggerContext test currently
calls telemetry.InitTraceProvider which sets the global tracer provider but does
not restore the previous provider; mirror the pattern used in
TestIntegration_EndToEndSpanHierarchy by capturing the existing provider with
otel.GetTracerProvider() before calling InitTraceProvider and deferring
otel.SetTracerProvider(prevTP) to restore it, so modify the setup in
TestIntegration_BrokerLoggerContext to save prevTP, defer its restoration, then
call InitTraceProvider and the existing deferred telemetry.Shutdown.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e2248316-857d-4df6-82d5-5e60698916ff

📥 Commits

Reviewing files that changed from the base of the PR and between b7d1044 and ccacfc0.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (2)
  • go.mod
  • test/integration/integration_test.go

… gRPC/HTTP exporters

                    - Add OpenTelemetry trace provider initialization with configurable exporters
                    - Support both gRPC (port 4317) and HTTP (port 4318) OTLP exporters
                    - Implement standards-compliant sampler configuration for all OTEL sampler types
                    - Add trace context propagation to CloudEvents via traceparent extension
                    - Enhance integration tests with comprehensive tracing validation
                    - Fix sampler bug where parentbased_traceidratio ignored OTEL_TRACES_SAMPLER_ARG
                    - Add span instrumentation for API calls, event publishing, and polling cycles
                    - Support protocol selection via OTEL_EXPORTER_OTLP_PROTOCOL environment variable
                    - Implement proper error handling and graceful degradation for tracing failures
                    - Add OpenTelemetry dependencies and update go.mod/go.sum
publishSpan.SetStatus(codes.Error, "publish failed")
// Record broker error
metrics.UpdateBrokerErrorsMetric(resourceType, resourceSelector, "publish_error")
s.logger.Errorf(eventCtx, "Failed to publish event resource_id=%s error=%v", resource.ID, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Category: Bug

The error log on line 190 uses eventCtx but it should use publishCtx instead. Right now, when
a publish fails, the log entry gets the evaluate span's trace/span IDs rather than the publish
span's IDs, so you lose correlation between this error log and the publish span in your
tracing backend.

Suggested change
s.logger.Errorf(eventCtx, "Failed to publish event resource_id=%s error=%v", resource.ID, err)
s.logger.Errorf(publishCtx, "Failed to publish event resource_id=%s error=%v", resource.ID, err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used publishCtx

tp = traceProvider
}
} else {
log.Extra("TRACING_ENABLED", false).Info(ctx, "OpenTelemetry disabled")
Copy link
Contributor

@rafabene rafabene Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Category: Standards

The log key "TRACING_ENABLED" on line 184 should be snake_case per the logging spec. All other
log fields in this codebase use lowercase (trace_id, decision_reason, etc.).

Suggested change
log.Extra("TRACING_ENABLED", false).Info(ctx, "OpenTelemetry disabled")
log.Extra("tracing_enabled", false).Info(ctx, "OpenTelemetry disabled")

I think I have pointed this in previous reviews.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

case parentBasedTraceIdRatio, "":
// Default per tracing standard
sampler = trace.ParentBased(trace.TraceIDRatioBased(parseSamplingRate(ctx, log)))
default:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Category: Pattern

The sampler switch is missing parentbased_always_on and parentbased_always_off — both are
standard OTEL_TRACES_SAMPLER values per the OTel spec. Setting either currently triggers the
"unrecognized sampler" warning and falls back to parentbased_traceidratio.

  case "parentbased_always_on":
      sampler = trace.ParentBased(trace.AlwaysSample())
  case "parentbased_always_off":
      sampler = trace.ParentBased(trace.NeverSample())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the missing switch cases

const (
samplerAlwaysOn = "always_on"
samplerAlwaysOff = "always_off"
samplerTraceIdRatio = "traceidratio"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Category: Standards

Minor naming nit: Go convention is to capitalize acronyms fully — ID not Id. The rest of the
codebase follows this (e.g., TraceIDCtxKey, SpanIDCtxKey).

Suggested change
samplerTraceIdRatio = "traceidratio"
samplerTraceIDRatio = "traceidratio"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the naming convention

envOtelTracesSamplerArg = "OTEL_TRACES_SAMPLER_ARG"
envOtelExporterOtlpEndpoint = "OTEL_EXPORTER_OTLP_ENDPOINT"
envOtelExporterOtlpProtocol = "OTEL_EXPORTER_OTLP_PROTOCOL"
parentBasedTraceIdRatio = "parentbased_traceidratio"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Category: Standards

Minor naming nit: Go convention is to capitalize acronyms fully — ID not Id. The rest of the
codebase follows this (e.g., TraceIDCtxKey, SpanIDCtxKey).

Suggested change
parentBasedTraceIdRatio = "parentbased_traceidratio"
parentBasedTraceIDRatio = "parentbased_traceidratio"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, renamed

}
if err != nil {
log.Errorf(ctx, "Failed to create OTLP exporter (protocol=%s): %v", protocol, err)
return nil, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Category: Improvement

The errors returned from InitTraceProvider are bare — the caller just sees the raw
exporter/resource error without knowing which step failed. Consider wrapping them:

Suggested change
return nil, err
return nil, fmt.Errorf("create OTLP exporter (protocol=%s): %w", protocol, err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapped the error message

)
if err != nil {
log.Errorf(ctx, "Failed to create OpenTelemetry stdout exporter: %v", err)
return nil, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, err
return nil, fmt.Errorf("create stdout exporter: %w", err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapped the error message

log.Warnf(ctx, "Failed to shutdown exporter: %v", shutdownErr)
}
log.Extra("service_name", serviceName).Extra("service_version", serviceVersion).Errorf(ctx, "Failed to create OpenTelemetry resource: %v", err)
return nil, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, err
return nil, fmt.Errorf("create OTel resource: %w", err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapped the error message


// span: publish (child of sentinel.evaluate)
publishCtx, publishSpan := telemetry.StartSpan(eventCtx, fmt.Sprintf("%s publish", topic),
attribute.String("hyperfleet.resource_type", s.config.ResourceType),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hyperfleet.resource_type, hyperfleet.resource_id, and hyperfleet.decision_reason are set on
both the evaluate span (lines 140-142, 145) and the publish span (lines 171-173). Since
publish is a child of evaluate, these are redundant — tracing backends let you see parent
attributes. Consider keeping only the messaging-specific attributes on the publish span:

Suggested change
attribute.String("hyperfleet.resource_type", s.config.ResourceType),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, makes sense. Removed the redundant attribute

// span: publish (child of sentinel.evaluate)
publishCtx, publishSpan := telemetry.StartSpan(eventCtx, fmt.Sprintf("%s publish", topic),
attribute.String("hyperfleet.resource_type", s.config.ResourceType),
attribute.String("hyperfleet.resource_id", resource.ID),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
attribute.String("hyperfleet.resource_id", resource.ID),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the redundant attribute

publishCtx, publishSpan := telemetry.StartSpan(eventCtx, fmt.Sprintf("%s publish", topic),
attribute.String("hyperfleet.resource_type", s.config.ResourceType),
attribute.String("hyperfleet.resource_id", resource.ID),
attribute.String("hyperfleet.decision_reason", decision.Reason),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
attribute.String("hyperfleet.decision_reason", decision.Reason),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the redundant attribute

"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/sdk/trace"
"go.opentelemetry.io/otel/sdk/trace/tracetest"
t2 "go.opentelemetry.io/otel/trace"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Category: Improvement

The import alias t2 for go.opentelemetry.io/otel/trace is a bit cryptic and easy to confuse
with t *testing.T. The production code already uses oteltrace for this same package — using
the same alias in tests would be more readable:

Suggested change
t2 "go.opentelemetry.io/otel/trace"
oteltrace "go.opentelemetry.io/otel/trace"

It's interesting because I think I have pointed this in previous reviews.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I remember seeing and fixing this earlier, not sure how it got replaced. Anyways, fixed the alias

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/sentinel/main.go`:
- Around line 175-185: The TracerProvider created by telemetry.InitTraceProvider
(assigned to tp) is only shut down in the signal-handling goroutine, so
non-signal exits skip shutdown; after successfully getting traceProvider from
telemetry.InitTraceProvider in main (where tp/traceProvider are set) add a
deferred shutdown for that provider (call traceProvider.Shutdown(ctx) or
telemetry.ShutdownTraceProvider(ctx, traceProvider) as appropriate) immediately
after init so it runs on any return from runServe/main, and remove or keep the
signal goroutine shutdown as a secondary/duplicate safe call; ensure you
reference the trace.TracerProvider variable (tp/traceProvider) and the
telemetry.InitTraceProvider call when placing the defer.

In `@pkg/telemetry/otel.go`:
- Around line 93-110: The sampler case for samplerTraceIDRatio creates a
standalone trace.TraceIDRatioBased sampler which violates the org standard;
update the sampler assignment in the switch (the samplerTraceIDRatio branch) to
wrap the ratio-based sampler with
trace.ParentBased(trace.TraceIDRatioBased(parseSamplingRate(ctx, log))) so that
ParentBased is used for all ratio samplers (keep parseSamplingRate(ctx, log) as
the rate source and leave other cases unchanged).

In `@test/integration/integration_test.go`:
- Around line 769-776: The SentinelConfig used in
TestIntegration_EndToEndSpanHierarchy is missing MessageData which leads to a
nil payloadBuilder in buildEventData and results in empty payloads; update the
test's cfg (the config.SentinelConfig instance) to include a suitable
MessageData value (matching other tests) so payloadBuilder is non-nil and events
exercise the full publish path, ensuring references to MessageData,
buildEventData, and payloadBuilder in the test are covered.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 163dd6e3-81b9-4631-b1f2-81bd488dec90

📥 Commits

Reviewing files that changed from the base of the PR and between ccacfc0 and 8d97966.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (15)
  • cmd/sentinel/main.go
  • configs/dev-example.yaml
  • configs/gcp-pubsub-example.yaml
  • configs/rabbitmq-example.yaml
  • go.mod
  • internal/client/client.go
  • internal/client/client_test.go
  • internal/config/config.go
  • internal/sentinel/sentinel.go
  • internal/sentinel/sentinel_test.go
  • pkg/logger/logger.go
  • pkg/logger/logger_test.go
  • pkg/telemetry/otel.go
  • pkg/telemetry/otel_test.go
  • test/integration/integration_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • configs/rabbitmq-example.yaml
  • configs/dev-example.yaml
  • pkg/logger/logger_test.go
  • pkg/telemetry/otel_test.go

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/sentinel/main.go`:
- Around line 181-186: The TracerProvider (traceProvider/tp) is being shut down
immediately; change the immediate call to telemetry.Shutdown(otelShutdownCtx,
tp) so it runs on function exit by deferring a cleanup closure that creates a
timeout context, calls telemetry.Shutdown with tp, and then cancels; keep
otelShutdownCancel only for that deferred closure (remove the current immediate
call), and reference traceProvider/tp, telemetry.Shutdown, otelShutdownCtx and
otelShutdownCancel in the deferred cleanup so OTel remains active for the
lifetime of the process.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 370d4512-d9d0-4a09-b78f-99bc15bfd42b

📥 Commits

Reviewing files that changed from the base of the PR and between 10aff14 and 82574d4.

📒 Files selected for processing (2)
  • cmd/sentinel/main.go
  • pkg/telemetry/otel.go

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
test/integration/integration_test.go (1)

481-499: ⚠️ Potential issue | 🟠 Major

Make the tracer setup test-local.

telemetry.InitTraceProvider in pkg/telemetry/otel.go calls otel.SetTracerProvider(...) and switches to OTLP when OTEL_EXPORTER_OTLP_ENDPOINT is set. In this test that can leak a shutdown provider into later tests and make the run depend on the runner's OTEL env. Please save/restore the previous provider here and neutralize the exporter env, or use the same isolated provider pattern as TestIntegration_EndToEndSpanHierarchy.

♻️ Suggested change
+	prevTP := otel.GetTracerProvider()
+	t.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", "")
 	// Setup OpenTelemetry for integration test
 	tp, err := telemetry.InitTraceProvider(ctx, "sentinel", "test")
 	if err != nil {
 		t.Fatalf("Failed to initialize OpenTelemetry: %v", err)
 	}
-	defer telemetry.Shutdown(context.Background(), tp)
+	defer func() {
+		otel.SetTracerProvider(prevTP)
+		cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), 5*time.Second)
+		defer cleanupCancel()
+		if err := telemetry.Shutdown(cleanupCtx, tp); err != nil {
+			t.Logf("Warning: shutdown of tracer provider failed: %v", err)
+		}
+	}()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/integration/integration_test.go` around lines 481 - 499, The tracer
setup in TestIntegration uses telemetry.InitTraceProvider which calls
otel.SetTracerProvider and can leak a shutdown provider or honor
OTEL_EXPORTER_OTLP_ENDPOINT; modify the test to be local by saving the previous
tracer provider (call otel.GetTracerProvider()), unset or clear
OTEL_EXPORTER_OTLP_ENDPOINT before calling InitTraceProvider, then after
creating the test provider restore the previous provider and ensure
telemetry.Shutdown(tp) is called only for the test provider (mirroring the
isolated-provider pattern used in TestIntegration_EndToEndSpanHierarchy);
reference InitTraceProvider, Shutdown, otel.SetTracerProvider and
OTEL_EXPORTER_OTLP_ENDPOINT when locating the code to implement these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/integration/integration_test.go`:
- Around line 481-499: The tracer setup in TestIntegration uses
telemetry.InitTraceProvider which calls otel.SetTracerProvider and can leak a
shutdown provider or honor OTEL_EXPORTER_OTLP_ENDPOINT; modify the test to be
local by saving the previous tracer provider (call otel.GetTracerProvider()),
unset or clear OTEL_EXPORTER_OTLP_ENDPOINT before calling InitTraceProvider,
then after creating the test provider restore the previous provider and ensure
telemetry.Shutdown(tp) is called only for the test provider (mirroring the
isolated-provider pattern used in TestIntegration_EndToEndSpanHierarchy);
reference InitTraceProvider, Shutdown, otel.SetTracerProvider and
OTEL_EXPORTER_OTLP_ENDPOINT when locating the code to implement these changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8e226166-8780-4093-88da-50e04bc494be

📥 Commits

Reviewing files that changed from the base of the PR and between 82574d4 and 4bfb9ba.

📒 Files selected for processing (2)
  • cmd/sentinel/main.go
  • test/integration/integration_test.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants