feat: Added support of Open Telemetry tracing for Smart Router#2260
feat: Added support of Open Telemetry tracing for Smart Router#2260sotskov-do wants to merge 3 commits intomainfrom
Conversation
Review Summary by QodoAdd OpenTelemetry tracing support for Smart Router with comprehensive instrumentation
WalkthroughsDescription• Adds comprehensive OpenTelemetry tracing support to Smart Router with full SDK initialization and lifecycle management • Implements trace context propagation using W3C standards for HTTP and gRPC protocols • Instruments Smart Router server with spans at relay entry point and throughout the processing pipeline (parse, process, cache lookup, session acquisition) • Adds client-side span instrumentation for JSON-RPC, REST, and gRPC relay operations with request/response details • Records relay attributes including GUID, chain ID, API interface, method, consistency stats, cache results, and retry counts • Provides span helper utilities for error recording, body recording with truncation, and orphan span protection • Includes comprehensive test coverage for span helpers, trace context propagation, and trace manager configuration • Adds --otel-trace-body CLI flag for enabling request/response body recording • Delegates standard OpenTelemetry configuration to environment variables per SDK specification • Updates dependencies to include OpenTelemetry core, SDK, OTLP exporters, and related packages Diagramflowchart LR
A["Smart Router Startup"] -->|Initialize| B["TraceManager"]
B -->|Configure| C["OTel SDK"]
C -->|Create| D["Tracer Provider"]
E["Relay Request"] -->|Extract Context| F["Server Span"]
F -->|Record Attributes| G["Relay Data"]
F -->|Create Child Spans| H["Internal Operations"]
H -->|Parse/Process/Cache| I["Operation Spans"]
J["Direct RPC Relay"] -->|Inject Context| K["Client Span"]
K -->|Record Details| L["Request/Response"]
L -->|Propagate| M["W3C Headers"]
N["Shutdown"] -->|Graceful| O["Trace Manager"]
File Changes1. protocol/tracing/span_helpers_test.go
|
Code Review by Qodo
|
| traceCfg := tracing.TraceConfig{ | ||
| TraceBody: viper.GetBool(tracing.TraceBodyFlag), | ||
| } | ||
| traceManager, err := tracing.New(ctx, traceCfg) |
There was a problem hiding this comment.
1. Trace body flag ignored 🐞 Bug ≡ Correctness
CreateRPCSmartRouterCobraCommand reads TraceConfig.TraceBody via viper.GetBool("otel-trace-body")
even though the flag is only registered on Cobra, so --otel-trace-body won’t enable body recording
at runtime.
Agent Prompt
### Issue description
`--otel-trace-body` is registered as a Cobra flag but `TraceBody` is read via `viper.GetBool(tracing.TraceBodyFlag)` without any `viper.BindPFlag` call (or direct `cmd.Flags().GetBool`). As a result, passing `--otel-trace-body` won’t actually enable request/response body recording.
### Issue Context
In this command, some Cobra flags are explicitly bound into viper via `viper.BindPFlag(...)` (e.g. provider optimizer weights). `otel-trace-body` needs the same treatment, or should be read directly from Cobra flags.
### Fix Focus Areas
- protocol/rpcsmartrouter/rpcsmartrouter.go[1354-1367]
- protocol/rpcsmartrouter/rpcsmartrouter.go[1482-1494]
- protocol/rpcsmartrouter/rpcsmartrouter.go[1403-1423]
### Suggested fix
After registering the flag, add:
```go
if err := viper.BindPFlag(tracing.TraceBodyFlag, cmdRPCSmartRouter.Flags().Lookup(tracing.TraceBodyFlag)); err != nil {
utils.LavaFormatFatal("failed binding otel trace body flag", err)
}
```
Alternatively, in `RunE` read it from Cobra:
```go
traceBody, _ := cmd.Flags().GetBool(tracing.TraceBodyFlag)
traceCfg := tracing.TraceConfig{ TraceBody: traceBody }
```
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
NadavLevi
left a comment
There was a problem hiding this comment.
Review of the OTel tracing changes — 5 inline comments. Two are functional bugs (body flag dead, GUID forced to 0); the other three are smaller correctness/perf notes.
| // because it's a per-invocation debug toggle rather than deployment | ||
| // configuration. Body size is delegated to the SDK via | ||
| // OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT (SDK default: unlimited). | ||
| cmdRPCSmartRouter.Flags().Bool(tracing.TraceBodyFlag, false, "record request/response bodies on trace spans (size limit delegated to OTel SDK via OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT)") |
There was a problem hiding this comment.
--otel-trace-body is registered on Cobra here but read via viper.GetBool(tracing.TraceBodyFlag) at line 1477 without a corresponding viper.BindPFlag. Other flags in this file explicitly bind to viper (lines 1515, 1531–1543, 1593). As-is, viper returns the default (false) regardless of CLI input, so the flag is a no-op.
Suggested fix — add right after the Flags().Bool(...) registration:
if err := viper.BindPFlag(tracing.TraceBodyFlag, cmdRPCSmartRouter.Flags().Lookup(tracing.TraceBodyFlag)); err != nil {
log.Fatalln("unable to bind otel-trace-body flag", err)
}| goroutineCtx = utils.WithUniqueIdentifier(goroutineCtx, guid) | ||
| } | ||
| guid, _ := utils.GetUniqueIdentifier(ctx) | ||
| goroutineCtx = utils.WithUniqueIdentifier(goroutineCtx, guid) |
There was a problem hiding this comment.
Behavior change in this PR: utils.WithUniqueIdentifier (utils/uniqueIdentifier.go:14) writes unconditionally, so when GetUniqueIdentifier returns (0, false) this line stores uint64(0) into goroutineCtx.
Downstream, utils.AppendUniqueIdentifier (utils/uniqueIdentifier.go:19) skips when ctx.Value(...) != nil — uint64(0) satisfies that — so real GUID generation later in the pipeline is blocked. Previously the write was gated by if found.
Suggested fix — restore the guard and still capture guid for the span (0 is fine as 'unknown'):
guid, found := utils.GetUniqueIdentifier(ctx)
if found {
goroutineCtx = utils.WithUniqueIdentifier(goroutineCtx, guid)
}
// ...
tracing.RecordProviderAttributes(provSpan, guid, endpointAddress)| } | ||
|
|
||
| ctx, processSpan := tracing.StartInternalSpan(ctx, tracing.SpanProcessingResult) | ||
| defer processSpan.End() |
There was a problem hiding this comment.
processSpan is named SpanProcessingResult but defer processSpan.End() at this scope keeps it open until SendParsedRelay returns — so the measured duration includes appendHeadersToRelayResult, analytics assignment, and LogRelay, not just ProcessingResult(). The span's timing no longer matches its name.
Suggested fix — scope the span to just the ProcessingResult call:
returnedResult, err := func() (*common.RelayResult, error) {
_, processSpan := tracing.StartInternalSpan(ctx, tracing.SpanProcessingResult)
defer processSpan.End()
r, e := relayProcessor.ProcessingResult()
if r != nil && r.Reply != nil {
tracing.RecordBody(processSpan, tracing.AttrRelayResponseBody, r.Reply.Data)
}
return r, e
}()| cacheLatencyMs := float64(time.Since(cacheStart).Milliseconds()) | ||
| cacheHit := cacheError == nil && cacheReply != nil && cacheReply.GetReply() != nil | ||
| tracing.RecordCacheResult(ctx, cacheSpan, cacheHit, cacheLatencyMs) | ||
| cacheSpan.End() |
There was a problem hiding this comment.
Minor: cacheSpan.End() is called explicitly rather than deferred. If anything between line 1584 and here panics, the span leaks. Defensive nit — prefer defer cacheSpan.End() right after the span is started (ideally wrapping this whole block in a helper or IIFE so the defer fires on any exit).
| // by the SDK's SpanLimits, which reads OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT | ||
| // (default: unlimited). Operators who want a body cap should set that env var. | ||
| func RecordBody(span trace.Span, attrKey string, body []byte) { | ||
| if traceBodyEnabled && span.IsRecording() { |
There was a problem hiding this comment.
Perf note: string(body) copies the full byte slice before SetAttributes applies the SDK's OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT truncation. For a large response body with a small configured limit, the full payload is still allocated only to be thrown away.
Suggested fix — read the env var once during New() and pre-truncate here:
var bodyAttrLimit int // set from OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT, 0 = unlimited
func RecordBody(span trace.Span, attrKey string, body []byte) {
if !traceBodyEnabled || !span.IsRecording() {
return
}
if bodyAttrLimit > 0 && len(body) > bodyAttrLimit {
body = body[:bodyAttrLimit]
}
span.SetAttributes(attribute.String(attrKey, string(body)))
}
Description
Closes: #XXXX
Author Checklist
All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.
I have...
!in the type prefix if API or client breaking changemainbranchReviewers Checklist
All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.
I have...