Skip to content

feat(sdk): [Enterprise Integration]: Add provider agnostic traceing#145

Open
namrataghadi-galileo wants to merge 4 commits intomainfrom
feature/59789-add-provider-agnostic-tracing
Open

feat(sdk): [Enterprise Integration]: Add provider agnostic traceing#145
namrataghadi-galileo wants to merge 4 commits intomainfrom
feature/59789-add-provider-agnostic-tracing

Conversation

@namrataghadi-galileo
Copy link
Contributor

Summary
Added a new provider-agnostic telemetry package to the AgentControl Python SDK for external trace context resolution and merged control event emission.
Updated tracing to consult a registered external trace context provider before falling back to OTEL context.
Exported the new telemetry APIs from the top-level agent_control package.
Added focused tests to ensure provider/sink failures do not affect existing behavior.

Scope

  • User-facing/API changes:

    • New SDK APIs:

      1. set_trace_context_provider(...)
      2. get_trace_context_from_provider()
      3. clear_trace_context_provider()
      4. set_control_event_sink(...)
      5. emit_control_events(...)
      6. clear_control_event_sink()
  • Internal changes:

    • Added:

      1. agent_control/telemetry/trace_context.py
      2. agent_control/telemetry/event_sink.py
      3. agent_control/telemetry/init.py
      4. Updated tracing.py to use external provider before OTEL fallback.
      5. Added tests for provider behavior, sink behavior, and tracing precedence.

Risk and Rollout
Risk level: low
Rollback plan:
Revert the new telemetry package and the small tracing/export changes in the SDK.
Since the change is additive and inactive unless a provider or sink is explicitly registered, rollback is straightforward.

Testing

  • Added or updated automated tests
  • Ran make check (or explained why not)
  • Ran targeted SDK tests only; did not run the full project check suite.
  • Manually verified behavior

Checklist

  • Linked issue/spec (if applicable)
  • Updated docs/examples for user-facing changes
  • Included any required follow-up tasks

@namrataghadi-galileo namrataghadi-galileo changed the title feat(sdk): Add provider agnostic traceing feat(sdk): [Enterprise Integration]: Add provider agnostic traceing Mar 23, 2026
@codecov
Copy link

codecov bot commented Mar 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@lan17
Copy link
Contributor

lan17 commented Mar 23, 2026

Why not emit these from the server?

@namrataghadi-galileo
Copy link
Contributor Author

@lan17 Keeping emission in the SDK gives us one merged, ordered batch and a single integration model for both logger and non-logger flows

@lan17
Copy link
Contributor

lan17 commented Mar 23, 2026

@lan17 Keeping emission in the SDK gives us one merged, ordered batch and a single integration model for both logger and non-logger flows

What about other language sdks, like typescript, java, etc.?

If we can delegate this to agent control server then it makes sdk easier.

Could we just send additional metadata to agent control server when we send events to integrate with Galileo spans/traces?

@namrataghadi-galileo
Copy link
Contributor Author

Server-side emission is attractive for thinner SDKs, but it defeats the main purpose of the logger-based design: reusing Galileo Logger’s in-process trace buffering and flush. For logger integrations, the SDK must have the merged control events locally before flush. Other languages can still be supported through the thinner OTEL non-logger path until they need full logger integration.

@lan17
Copy link
Contributor

lan17 commented Mar 23, 2026

Server-side emission is attractive for thinner SDKs, but it defeats the main purpose of the logger-based design: reusing Galileo Logger’s in-process trace buffering and flush. For logger integrations, the SDK must have the merged control events locally before flush. Other languages can still be supported through the thinner OTEL non-logger path until they need full logger integration.

Should this be done outside of core OSS sdk, then?

This pattern conflicts with one @abhinav-galileo implemented where we emit events to the server (also via a buffer on SDK), so it maybe confusing to have both systems at same time.

@namrataghadi-galileo
Copy link
Contributor Author

namrataghadi-galileo commented Mar 23, 2026

The current PR only adds provider-agnostic hooks to OSS. It does not move logger context into OSS or change the default buffered event-to-server flow. Logger context, span conversion, trace attachment, and flush integration still belong to the external Galileo integration layer, so OSS is not taking on a second concrete observability system.

Also, it would be confusing if both were active default systems. It is much less confusing if the existing buffered SDK-to-server flow remains the only default OSS behavior and the new sink/provider APIs are treated purely as optional extension points for external integrations.

@namrataghadi-galileo
Copy link
Contributor Author

I’m also waiting to hear back from both Davids on the RFC and get their perspective on the proposed logger-based and non-logger-based approaches. That said, the hook additions themselves are independent of any Galileo-specific observability or tracing design. They are generic enough to support other third-party observability systems as well. For example, if LangSmith is the external observability framework, these hooks could still be used to publish trace_id and span_id

Copy link
Contributor

@lan17 lan17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the attempt to keep the new telemetry surface generic, but there are still a few code-level gaps before this is ready to merge. I left inline comments on the sink wiring, provider validation, and the partial integration of the provider across SDK entry points.

_control_event_sink = sink


def emit_control_events(events: list[ControlExecutionEvent]) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a sink registration API, but no SDK evaluation path ever calls emit_control_events(). Local results still flow through evaluation._emit_local_events() into observability.add_event(), and server results are only merged as EvaluationResult. As written, set_control_event_sink() does not observe real control executions unless callers manually construct ControlExecutionEvent lists themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal was to just add the tracing interfaces in this PR and do heavy lifting in another PR else this PR would have been big and difficult to review


trace_id = trace_context.get("trace_id")
span_id = trace_context.get("span_id")
if not isinstance(trace_id, str) or not isinstance(span_id, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validates types, but it still accepts empty strings. A provider returning {"trace_id": "", "span_id": ""} currently passes through as valid context, and the tracing helpers will return those empty IDs unchanged. That can silently drop headers via truthiness checks or produce uncorrelated local events; these values should be rejected here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return trace_id, span_id

# Try external provider
trace_context = get_trace_context_from_provider()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wires the provider into the tracing helpers, but the public evaluate_controls() path still forwards trace_id / span_id unchanged when callers omit them. So the new provider works for flows that use these helpers, but not for all public SDK evaluation entry points. That inconsistency should be fixed before we rely on this as the generic trace-context hook.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. You’re right that just wiring the provider into tracing.py is not enough if the public evaluation path can still bypass it when trace_id / span_id are omitted. I’ll fix that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants