Skip to content

feat: Added support of Open Telemetry tracing for Smart Router#2260

Open
sotskov-do wants to merge 3 commits intomainfrom
smartrouter-tracing
Open

feat: Added support of Open Telemetry tracing for Smart Router#2260
sotskov-do wants to merge 3 commits intomainfrom
smartrouter-tracing

Conversation

@sotskov-do
Copy link
Copy Markdown
Contributor

Description

Closes: #XXXX


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • read the contribution guide
  • included the correct type prefix in the PR title, you can find examples of the prefixes below:
  • confirmed ! in the type prefix if API or client breaking change
  • targeted the main branch
  • provided a link to the relevant issue or specification
  • reviewed "Files changed" and left comments if necessary
  • included the necessary unit and integration tests
  • updated the relevant documentation or specification, including comments for documenting Go code
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add OpenTelemetry tracing support for Smart Router with comprehensive instrumentation

✨ Enhancement 🧪 Tests

Grey Divider

Walkthroughs

Description
• Adds comprehensive OpenTelemetry tracing support to Smart Router with full SDK initialization and
  lifecycle management
• Implements trace context propagation using W3C standards for HTTP and gRPC protocols
• Instruments Smart Router server with spans at relay entry point and throughout the processing
  pipeline (parse, process, cache lookup, session acquisition)
• Adds client-side span instrumentation for JSON-RPC, REST, and gRPC relay operations with
  request/response details
• Records relay attributes including GUID, chain ID, API interface, method, consistency stats, cache
  results, and retry counts
• Provides span helper utilities for error recording, body recording with truncation, and orphan
  span protection
• Includes comprehensive test coverage for span helpers, trace context propagation, and trace
  manager configuration
• Adds --otel-trace-body CLI flag for enabling request/response body recording
• Delegates standard OpenTelemetry configuration to environment variables per SDK specification
• Updates dependencies to include OpenTelemetry core, SDK, OTLP exporters, and related packages
Diagram
flowchart LR
  A["Smart Router Startup"] -->|Initialize| B["TraceManager"]
  B -->|Configure| C["OTel SDK"]
  C -->|Create| D["Tracer Provider"]
  E["Relay Request"] -->|Extract Context| F["Server Span"]
  F -->|Record Attributes| G["Relay Data"]
  F -->|Create Child Spans| H["Internal Operations"]
  H -->|Parse/Process/Cache| I["Operation Spans"]
  J["Direct RPC Relay"] -->|Inject Context| K["Client Span"]
  K -->|Record Details| L["Request/Response"]
  L -->|Propagate| M["W3C Headers"]
  N["Shutdown"] -->|Graceful| O["Trace Manager"]
Loading

Grey Divider

File Changes

1. protocol/tracing/span_helpers_test.go 🧪 Tests +662/-0

Span helpers test suite with comprehensive coverage

• Comprehensive test suite for span instrumentation helpers with 662 lines of test coverage
• Tests for error recording, body recording with truncation, relay/provider/cache/session attributes
• Tests for HTTP/gRPC request/response recording with semconv compliance
• Tests for span context propagation and orphan span protection mechanisms

protocol/tracing/span_helpers_test.go


2. protocol/tracing/trace_manager.go ✨ Enhancement +339/-0

OpenTelemetry SDK initialization and lifecycle management

• Initializes OpenTelemetry SDK with environment variable configuration
• Implements sampler and propagator builders from OTel spec environment variables
• Manages tracer provider lifecycle with graceful shutdown
• Handles noop provider when tracing is disabled via OTEL_SDK_DISABLED or
 OTEL_TRACES_EXPORTER=none

protocol/tracing/trace_manager.go


3. protocol/tracing/propagation_test.go 🧪 Tests +353/-0

Trace context propagation test suite

• Tests for W3C trace context injection/extraction in HTTP and gRPC carriers
• Validates traceparent and tracestate header propagation round-trips
• Tests case-insensitive header handling and existing header replacement
• Verifies trace context preservation across protocol boundaries

protocol/tracing/propagation_test.go


View more (9)
4. protocol/tracing/span_helpers.go ✨ Enhancement +291/-0

Span instrumentation helpers and attribute recording

• Helper functions for recording span attributes (relay, provider, cache, session stats)
• Span lifecycle management with orphan-root protection for client/internal spans
• HTTP/gRPC request/response recording with semconv compliance
• Context plumbing for relay span access in deeply nested helpers

protocol/tracing/span_helpers.go


5. protocol/tracing/trace_manager_test.go 🧪 Tests +289/-0

Trace manager configuration and initialization tests

• Tests for SDK disable scenarios via environment variables
• Tests for sampler configuration from OTEL_TRACES_SAMPLER and OTEL_TRACES_SAMPLER_ARG
• Tests for propagator configuration from OTEL_PROPAGATORS
• Tests for OTEL_SDK_DISABLED parsing per OTel spec

protocol/tracing/trace_manager_test.go


6. protocol/rpcsmartrouter/rpcsmartrouter_server.go ✨ Enhancement +94/-9

Smart router server instrumentation with tracing spans

• Adds server span creation at relay entry point with trace context extraction
• Records relay attributes (GUID, chain ID, API interface, method, body)
• Instruments internal operations (parse, process, cache lookup, session acquisition) with child
 spans
• Records consistency stats, cache results, retry counts, and provider attempt numbers
• Adds error recording on span failures throughout relay processing pipeline

protocol/rpcsmartrouter/rpcsmartrouter_server.go


7. protocol/rpcsmartrouter/direct_rpc_relay.go ✨ Enhancement +22/-0

Direct RPC relay client span instrumentation

• Adds client spans for JSON-RPC, REST, and gRPC relay operations
• Records HTTP/gRPC request details (method, URL, headers) and response status/size
• Injects trace context headers into outbound requests
• Records errors on failed relay attempts

protocol/rpcsmartrouter/direct_rpc_relay.go


8. protocol/rpcsmartrouter/rpcsmartrouter.go ✨ Enhancement +30/-0

Smart router initialization with tracing setup

• Initializes TraceManager during smart router startup with configuration from environment
• Adds --otel-trace-body CLI flag for enabling request/response body recording
• Delegates all standard OTel configuration to environment variables per SDK spec
• Ensures graceful shutdown of trace manager on exit

protocol/rpcsmartrouter/rpcsmartrouter.go


9. protocol/tracing/propagation.go ✨ Enhancement +62/-0

W3C trace context propagation helpers

• Implements W3C trace context injection into HTTP metadata headers
• Implements W3C trace context injection into gRPC metadata maps
• Implements trace context extraction from HTTP metadata with case-insensitive header handling
• Replaces existing traceparent headers to avoid duplicates

protocol/tracing/propagation.go


10. protocol/tracing/span_names.go ✨ Enhancement +60/-0

Span and attribute name constants

• Defines span name constants for all instrumented operations in smart router
• Defines span attribute key constants for relay, provider, cache, session, and consistency data
• Centralizes naming to prevent drift across instrumentation sites

protocol/tracing/span_names.go


11. go.mod Dependencies +19/-6

OpenTelemetry and related dependencies

• Adds OpenTelemetry core dependencies (go.opentelemetry.io/otel, sdk, trace)
• Adds autoexport contrib package for automatic exporter selection
• Adds OTLP exporter packages for gRPC and HTTP protocols
• Updates indirect dependencies for OTel exporters and Prometheus client

go.mod


12. go.sum Dependencies +34/-8

OpenTelemetry tracing dependencies and version upgrades

• Updated cenkalti/backoff/v4 from v4.1.3 to v4.2.1 with new hash
• Added golang/glog v1.2.0 dependency
• Updated prometheus/client_golang from v1.16.0 to v1.17.0
• Updated prometheus/client_model from v0.4.0 to v0.5.0
• Updated prometheus/procfs from v0.10.1 to v0.11.1
• Added multiple OpenTelemetry exporters and related dependencies (OTLP trace/metric, Prometheus,
 stdout exporters)
• Added grpc-ecosystem/grpc-gateway/v2 v2.16.0 dependency
• Added opentelemetry/contrib/exporters/autoexport v0.46.1
• Added opentelemetry/proto/otlp v1.0.0
• Added opentelemetry/otel/sdk/metric v1.21.0

go.sum


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Apr 7, 2026

Code Review by Qodo

🐞 Bugs (2)   📘 Rule violations (1)   📎 Requirement gaps (0)   🎨 UX Issues (0)
🐞\ ≡ Correctness (1) ☼ Reliability (1)
📘\ ⚙ Maintainability (1)

Grey Divider


Action required

1. Trace body flag ignored 🐞
Description
CreateRPCSmartRouterCobraCommand reads TraceConfig.TraceBody via viper.GetBool("otel-trace-body")
even though the flag is only registered on Cobra, so --otel-trace-body won’t enable body recording
at runtime.
Code

protocol/rpcsmartrouter/rpcsmartrouter.go[R1360-1363]

+			traceCfg := tracing.TraceConfig{
+				TraceBody: viper.GetBool(tracing.TraceBodyFlag),
+			}
+			traceManager, err := tracing.New(ctx, traceCfg)
Evidence
The command registers the CLI flag ("otel-trace-body") but never binds it into Viper; TraceBody is
then read from Viper, so the CLI flag value is not observed. The same file demonstrates the intended
pattern via multiple viper.BindPFlag(...) calls for other Cobra flags.

protocol/rpcsmartrouter/rpcsmartrouter.go[1354-1367]
protocol/rpcsmartrouter/rpcsmartrouter.go[1403-1423]
protocol/rpcsmartrouter/rpcsmartrouter.go[1482-1494]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`--otel-trace-body` is registered as a Cobra flag but `TraceBody` is read via `viper.GetBool(tracing.TraceBodyFlag)` without any `viper.BindPFlag` call (or direct `cmd.Flags().GetBool`). As a result, passing `--otel-trace-body` won’t actually enable request/response body recording.
### Issue Context
In this command, some Cobra flags are explicitly bound into viper via `viper.BindPFlag(...)` (e.g. provider optimizer weights). `otel-trace-body` needs the same treatment, or should be read directly from Cobra flags.
### Fix Focus Areas
- protocol/rpcsmartrouter/rpcsmartrouter.go[1354-1367]
- protocol/rpcsmartrouter/rpcsmartrouter.go[1482-1494]
- protocol/rpcsmartrouter/rpcsmartrouter.go[1403-1423]
### Suggested fix
After registering the flag, add:

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Nonstandard test function names 📘
Description
New tests do not consistently follow the required TestComponent_Scenario naming convention
(several are named without an underscore scenario suffix). This reduces consistency and makes it
harder to scan and categorize tests in CI output.
Code

protocol/tracing/propagation_test.go[36]

+func TestInjectHTTP(t *testing.T) {
Evidence
PR Compliance ID 4 requires standard test naming (TestComponent_Scenario). The added test
functions include names like TestInjectHTTP, TestInjectGRPC, TestExtractHTTP, and
TestRecordError which do not include a scenario suffix separated by _.

AGENTS.md
protocol/tracing/propagation_test.go[36-36]
protocol/tracing/propagation_test.go[95-95]
protocol/tracing/propagation_test.go[145-145]
protocol/tracing/span_helpers_test.go[42-42]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Some newly added Go tests do not follow the required naming convention `TestComponent_Scenario`.
## Issue Context
Compliance requires consistent naming for tests to improve readability and CI discoverability.
## Fix Focus Areas
- protocol/tracing/propagation_test.go[36-210]
- protocol/tracing/span_helpers_test.go[42-662]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. GUID forced to zero 🐞
Description
sendRelayToDirectEndpoints unconditionally writes the parent GUID into the goroutine context even
when GetUniqueIdentifier returns (0,false), which can pin GUID=0 as “present” and prevent later code
from generating a real unique identifier.
Code

protocol/rpcsmartrouter/rpcsmartrouter_server.go[R832-834]

+			guid, _ := utils.GetUniqueIdentifier(ctx)
+			goroutineCtx = utils.WithUniqueIdentifier(goroutineCtx, guid)
+
Evidence
GetUniqueIdentifier returns (0,false) when no GUID is present. WithUniqueIdentifier always writes
the value into context (including 0), and other code uses GetUniqueIdentifier’s found result to
decide whether to generate/append a GUID; once 0 is stored, found becomes true and GUID generation
paths are skipped.

protocol/rpcsmartrouter/rpcsmartrouter_server.go[826-840]
utils/uniqueIdentifier.go[10-32]
protocol/lavasession/consumer_session_manager.go[1847-1850]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
In `sendRelayToDirectEndpoints`, the code ignores the `found` boolean from `utils.GetUniqueIdentifier(ctx)` and always calls `utils.WithUniqueIdentifier(goroutineCtx, guid)`. When the GUID is absent, this stores `uint64(0)` into the context, which makes subsequent `GetUniqueIdentifier` calls report `found=true` and can block later GUID generation/append logic.
### Issue Context
`utils.GetUniqueIdentifier` returns `(0,false)` when missing. `utils.WithUniqueIdentifier` writes the value unconditionally. Some downstream flows only generate a GUID when `found==false`, so a stored 0 changes behavior.
### Fix Focus Areas
- protocol/rpcsmartrouter/rpcsmartrouter_server.go[826-840]
- utils/uniqueIdentifier.go[10-32]
### Suggested fix
Use the `found` boolean (and/or guard against 0) before setting the value, or generate a new GUID:
Option A (preserve old behavior):

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment on lines +1360 to +1363
traceCfg := tracing.TraceConfig{
TraceBody: viper.GetBool(tracing.TraceBodyFlag),
}
traceManager, err := tracing.New(ctx, traceCfg)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Trace body flag ignored 🐞 Bug ≡ Correctness

CreateRPCSmartRouterCobraCommand reads TraceConfig.TraceBody via viper.GetBool("otel-trace-body")
even though the flag is only registered on Cobra, so --otel-trace-body won’t enable body recording
at runtime.
Agent Prompt
### Issue description
`--otel-trace-body` is registered as a Cobra flag but `TraceBody` is read via `viper.GetBool(tracing.TraceBodyFlag)` without any `viper.BindPFlag` call (or direct `cmd.Flags().GetBool`). As a result, passing `--otel-trace-body` won’t actually enable request/response body recording.

### Issue Context
In this command, some Cobra flags are explicitly bound into viper via `viper.BindPFlag(...)` (e.g. provider optimizer weights). `otel-trace-body` needs the same treatment, or should be read directly from Cobra flags.

### Fix Focus Areas
- protocol/rpcsmartrouter/rpcsmartrouter.go[1354-1367]
- protocol/rpcsmartrouter/rpcsmartrouter.go[1482-1494]
- protocol/rpcsmartrouter/rpcsmartrouter.go[1403-1423]

### Suggested fix
After registering the flag, add:
```go
if err := viper.BindPFlag(tracing.TraceBodyFlag, cmdRPCSmartRouter.Flags().Lookup(tracing.TraceBodyFlag)); err != nil {
    utils.LavaFormatFatal("failed binding otel trace body flag", err)
}
```

Alternatively, in `RunE` read it from Cobra:
```go
traceBody, _ := cmd.Flags().GetBool(tracing.TraceBodyFlag)
traceCfg := tracing.TraceConfig{ TraceBody: traceBody }
```

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 64.42049% with 132 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
protocol/rpcsmartrouter/rpcsmartrouter_server.go 11.11% 48 Missing ⚠️
protocol/tracing/trace_manager.go 71.32% 38 Missing and 3 partials ⚠️
protocol/tracing/span_helpers.go 76.98% 22 Missing and 7 partials ⚠️
protocol/rpcsmartrouter/direct_rpc_relay.go 61.11% 7 Missing ⚠️
protocol/rpcsmartrouter/rpcsmartrouter.go 0.00% 7 Missing ⚠️
Flag Coverage Δ
consensus 8.96% <ø> (ø)
protocol 34.95% <64.42%> (+0.24%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
protocol/tracing/propagation.go 100.00% <100.00%> (ø)
protocol/rpcsmartrouter/direct_rpc_relay.go 53.94% <61.11%> (+0.33%) ⬆️
protocol/rpcsmartrouter/rpcsmartrouter.go 6.76% <0.00%> (-0.05%) ⬇️
protocol/tracing/span_helpers.go 76.98% <76.98%> (ø)
protocol/tracing/trace_manager.go 71.32% <71.32%> (ø)
protocol/rpcsmartrouter/rpcsmartrouter_server.go 13.31% <11.11%> (-0.05%) ⬇️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

Test Results

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
7 files   ±0   0 ❌ ±0 

Results for commit 97c6ffa. ± Comparison against base commit bec08b8.

♻️ This comment has been updated with latest results.

Copy link
Copy Markdown
Collaborator

@NadavLevi NadavLevi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of the OTel tracing changes — 5 inline comments. Two are functional bugs (body flag dead, GUID forced to 0); the other three are smaller correctness/perf notes.

// because it's a per-invocation debug toggle rather than deployment
// configuration. Body size is delegated to the SDK via
// OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT (SDK default: unlimited).
cmdRPCSmartRouter.Flags().Bool(tracing.TraceBodyFlag, false, "record request/response bodies on trace spans (size limit delegated to OTel SDK via OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT)")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--otel-trace-body is registered on Cobra here but read via viper.GetBool(tracing.TraceBodyFlag) at line 1477 without a corresponding viper.BindPFlag. Other flags in this file explicitly bind to viper (lines 1515, 1531–1543, 1593). As-is, viper returns the default (false) regardless of CLI input, so the flag is a no-op.

Suggested fix — add right after the Flags().Bool(...) registration:

if err := viper.BindPFlag(tracing.TraceBodyFlag, cmdRPCSmartRouter.Flags().Lookup(tracing.TraceBodyFlag)); err != nil {
    log.Fatalln("unable to bind otel-trace-body flag", err)
}

goroutineCtx = utils.WithUniqueIdentifier(goroutineCtx, guid)
}
guid, _ := utils.GetUniqueIdentifier(ctx)
goroutineCtx = utils.WithUniqueIdentifier(goroutineCtx, guid)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavior change in this PR: utils.WithUniqueIdentifier (utils/uniqueIdentifier.go:14) writes unconditionally, so when GetUniqueIdentifier returns (0, false) this line stores uint64(0) into goroutineCtx.

Downstream, utils.AppendUniqueIdentifier (utils/uniqueIdentifier.go:19) skips when ctx.Value(...) != niluint64(0) satisfies that — so real GUID generation later in the pipeline is blocked. Previously the write was gated by if found.

Suggested fix — restore the guard and still capture guid for the span (0 is fine as 'unknown'):

guid, found := utils.GetUniqueIdentifier(ctx)
if found {
    goroutineCtx = utils.WithUniqueIdentifier(goroutineCtx, guid)
}
// ...
tracing.RecordProviderAttributes(provSpan, guid, endpointAddress)

}

ctx, processSpan := tracing.StartInternalSpan(ctx, tracing.SpanProcessingResult)
defer processSpan.End()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processSpan is named SpanProcessingResult but defer processSpan.End() at this scope keeps it open until SendParsedRelay returns — so the measured duration includes appendHeadersToRelayResult, analytics assignment, and LogRelay, not just ProcessingResult(). The span's timing no longer matches its name.

Suggested fix — scope the span to just the ProcessingResult call:

returnedResult, err := func() (*common.RelayResult, error) {
    _, processSpan := tracing.StartInternalSpan(ctx, tracing.SpanProcessingResult)
    defer processSpan.End()
    r, e := relayProcessor.ProcessingResult()
    if r != nil && r.Reply != nil {
        tracing.RecordBody(processSpan, tracing.AttrRelayResponseBody, r.Reply.Data)
    }
    return r, e
}()

cacheLatencyMs := float64(time.Since(cacheStart).Milliseconds())
cacheHit := cacheError == nil && cacheReply != nil && cacheReply.GetReply() != nil
tracing.RecordCacheResult(ctx, cacheSpan, cacheHit, cacheLatencyMs)
cacheSpan.End()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: cacheSpan.End() is called explicitly rather than deferred. If anything between line 1584 and here panics, the span leaks. Defensive nit — prefer defer cacheSpan.End() right after the span is started (ideally wrapping this whole block in a helper or IIFE so the defer fires on any exit).

// by the SDK's SpanLimits, which reads OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT
// (default: unlimited). Operators who want a body cap should set that env var.
func RecordBody(span trace.Span, attrKey string, body []byte) {
if traceBodyEnabled && span.IsRecording() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perf note: string(body) copies the full byte slice before SetAttributes applies the SDK's OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT truncation. For a large response body with a small configured limit, the full payload is still allocated only to be thrown away.

Suggested fix — read the env var once during New() and pre-truncate here:

var bodyAttrLimit int // set from OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT, 0 = unlimited

func RecordBody(span trace.Span, attrKey string, body []byte) {
    if !traceBodyEnabled || !span.IsRecording() {
        return
    }
    if bodyAttrLimit > 0 && len(body) > bodyAttrLimit {
        body = body[:bodyAttrLimit]
    }
    span.SetAttributes(attribute.String(attrKey, string(body)))
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants