Skip to content

feat(llm): MLflow tracing for the Agent SDK subprocess (migration, phase 2)#31

Draft
PalmPalm7 wants to merge 1 commit into
rhpds:mainfrom
PalmPalm7:migration/sdk-tracing
Draft

feat(llm): MLflow tracing for the Agent SDK subprocess (migration, phase 2)#31
PalmPalm7 wants to merge 1 commit into
rhpds:mainfrom
PalmPalm7:migration/sdk-tracing

Conversation

@PalmPalm7

Copy link
Copy Markdown
Contributor

What

The Agent SDK (#24) runs the Claude Code CLI as a child process, so the legacy app-side MLflow collector (src/connections/mlflow_tracking.py) never sees its LLM calls. This adds src/llm/sdk_tracing.py to trace the SDK path the way RHDP already traces Claude Code itself — by environment — so the subprocess exports its own claude_code.* spans (tokens, per-turn latency, tool calls) to the same MLflow server/experiment.

Why

Phase-2 needs trustworthy per-call cost/latency/tool numbers to benchmark the SDK path against legacy. This is the measurement layer.

How

  • build_tracing_env(config) derives MLFLOW_CLAUDE_TRACING_ENABLED / MLFLOW_TRACKING_URI / MLFLOW_EXPERIMENT_NAME (+ basic-auth) from the existing mlflow.* config — the same reads as init_mlflow. Returns {} when tracking_url is empty, so it's a no-op by default, exactly like the collector.
  • AgentSdkClient.from_config merges that env into the SDK subprocess env automatically; an explicit agent.sdk.env wins on conflict.
  • build_hooks_settings() returns the settings.json hooks block (Stop + SessionStart) for deployments that prefer hook-based wiring.

Scope / safety

Additive; only affects the sdk runtime, and only when mlflow.tracking_url is set. No change to the legacy path or default deploy.

How to test

pytest tests/test_sdk_tracing.py -q   # 9 tests

Result (local gate)

  • black ✓ · ruff ✓ · mypy
  • pytest tests/test_sdk_tracing.py9 passed; full suite → 96 passed, no regressions
  • End-to-end span export against the live MLflow parsec-agent-metrics is pending in-cluster verification — results will be commented below once run.

Builds on #24 (SDK adapter) and #30 (AgentRunner). Plan: artifacts/parsec-agent-sdk-migration-plan.md.

The SDK runs the Claude Code CLI as a child process, so the legacy
app-side MLflow collector doesn't see its work. Trace it the way RHDP
already traces Claude Code itself: by environment, so the subprocess
exports its own claude_code.* spans (tokens, per-turn latency, tool
calls) to the same MLflow server.

- src/llm/sdk_tracing.py: build_tracing_env() derives MLFLOW_* env from
  the existing mlflow.* config (uri, experiment, basic-auth); returns {}
  when tracking_url is empty, so it's a no-op by default like the
  collector. build_hooks_settings() returns the settings.json hooks block
  (Stop + SessionStart) for hook-based wiring.
- AgentSdkClient.from_config now merges the tracing env into the SDK
  subprocess env automatically; an explicit agent.sdk.env wins on conflict.

10 tracing tests; adapter suite still green. End-to-end span export is
verified in-cluster (MLflow experiment parsec-agent-metrics).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@PalmPalm7

Copy link
Copy Markdown
Contributor Author

Test results.

  • Local gate: black ✓ · ruff ✓ · mypy ✓ · pytest tests/test_sdk_tracing.py9 passed; full suite 96 passed, no regressions.
  • Upstream CI: quality-gates + docker-build + ci-status all green.
  • End-to-end span export to the live MLflow (parsec-agent-metrics) is pending in-cluster verification — the personal NERC cluster is currently at pod/disk capacity (no schedulable nodes). Will post the trace evidence here once capacity frees.

@rut31337

rut31337 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Code Review — PR #31 (Draft)

Scope: 3 files, +214 lines — MLflow tracing for Agent SDK subprocess
Effort: High

Findings (2)

1. build_hooks_settings() is dead code (src/llm/sdk_tracing.py:80)
Defined but never called anywhere in the codebase. Adds maintenance surface with no current value. Consider removing until there's a concrete caller.

2. Duplicated _section() helper (src/llm/sdk_tracing.py:97)
Identical to _get_section() in agent_sdk_client.py. Bug fixes must be applied to both. Extract to a shared location — both modules already live under src/llm/.

No blocking issues. Core tracing logic is correct — build_tracing_env() properly derives MLflow env vars from config, no-op path returns {} when tracking_url is empty, and env merge precedence in from_config is correct. Test coverage is solid across 9 tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants