Skip to content

Conversation

@mjehanzaib999
Copy link

PR: M1 — Drop-in Instrumentation & End-to-End Optimization

Branch: feature/M1-instrument-and-optimizefeature/M0-technical-plan


Summary

This PR delivers Milestone 1 of the LangGraph OTEL Instrumentation API: two function calls to instrument and optimize any LangGraph agent end-to-end.

  • instrument_graph() — one-liner to wrap any StateGraph/CompiledGraph with full OTEL tracing, dual semantic conventions (param.* + gen_ai.*), and auto-derived Binding objects
  • optimize_graph() — one-liner optimization loop: invoke → flush OTLP → TGJ → ingest_tgj → optimizer backward/stepapply_updates() via bindings → next iteration uses updated prompts
  • 63 tests passing (StubLLM-only, CI-safe) including a 21-test E2E integration suite with a real LangGraph
  • M1 notebook with dual mode (StubLLM deterministic + Live OpenRouter), Colab badge, and executed outputs

All 7 M1 acceptance gates from the reviewed technical plan are verified.


Architecture

User Code
─────────
ig = instrument_graph(graph, llm, initial_templates={...})
result = optimize_graph(ig, queries=[...], iterations=3)

Under the Hood
──────────────
┌─────────────────────────────────────────────────────────────┐
│  InstrumentedGraph                                          │
│  ├── .session    → TelemetrySession (TracerProvider +       │
│  │                  InMemorySpanExporter, flush_otlp/tgj)   │
│  ├── .tracing_llm → TracingLLM (dual semconv:               │
│  │                    param.* parent + gen_ai.* child spans) │
│  ├── .templates  → {param_key: template_str}                │
│  ├── .bindings   → {param_key: Binding(get, set)}           │
│  └── .invoke()   → delegates to compiled LangGraph          │
└─────────────────────────────────────────────────────────────┘

Optimization Loop (per iteration)
──────────────────────────────────
invoke() → flush_otlp() → OTLP→TGJ → ingest_tgj()
→ ParameterNode + MessageNode → backward(feedback)
→ optimizer.step() → apply_updates(bindings) → next invoke()

New Modules (opto/trace/io/)

File Lines Purpose
instrumentation.py 138 instrument_graph() + InstrumentedGraph dataclass
optimization.py 412 optimize_graph() + EvalResult, EvalFn, RunResult, OptimizationResult
telemetry_session.py 188 TelemetrySession — unified OTEL session with flush_otlp(), flush_tgj(), export_run_bundle()
bindings.py 105 Binding dataclass + apply_updates() + make_dict_binding()
otel_semconv.py 126 emit_reward(), emit_trace(), record_genai_chat(), set_span_attributes()

Modified Modules

File Change
langgraph_otel_runtime.py TracingLLM enhanced: dual semconv parent (param.*) + child (gen_ai.*) spans; child spans carry trace.temporal_ignore = "true" to protect TGJ chaining
__init__.py Exports all new M1 public APIs (21 symbols)

Tests (63 passing)

File Tests Scope
test_bindings.py 10 Binding, apply_updates(), strict/non-strict modes
test_otel_semconv.py 5 emit_reward(), emit_trace(), record_genai_chat()
test_telemetry_session.py 6 flush_otlp(), flush_tgj(), record_spans, span_attribute_filter
test_instrumentation.py 10 instrument_graph(), child span emission, temporal chaining
test_optimization.py 11 EvalResult, _normalise_eval(), data contracts
test_e2e_m1_pipeline.py 21 Full E2E: real LangGraph + StubLLM → invoke → OTLP → TGJ → ParameterNode → mock optimizer → apply_updates() → re-invoke with updated template

Notebook and Docs

File Description
examples/notebooks/01_m1_instrument_and_optimize.ipynb 10 sections, 20 code cells, dual mode (StubLLM + Live OpenRouter), Colab badge, 3 item dataset, temperature=0, max_tokens=256 budget guard, committed with executed outputs
docs/m1_README.md Architecture diagrams, full API reference, data flow pipeline, semantic convention design, temporal chaining contract, file map, quick start, acceptance criteria status
requirements.txt Pinned dependencies for uv/pip environments

Key Design Decisions

  1. Generic — no hard-coded node names. trainable_keys=None means all nodes are trainable; explicit set to restrict. No "planner" or "synthesizer" baked into the optimization API.

  2. Dual semantic conventions. Each TracingLLM.node_call() emits a parent span (param.* for optimization) and an optional child span (gen_ai.* for Agent Lightning observability). Child spans carry trace.temporal_ignore = "true".

  3. Temporal chaining contract. The OTLP to TGJ adapter only advances the temporal chain on root spans (those without parentSpanId). This prevents child LLM spans from corrupting the planner to synthesizer ordering that the optimizer relies on. Verified by 3 dedicated tests.

  4. Explicit Binding layer. Optimizer output keys map to Binding(get, set) objects. apply_updates() pushes values through the bindings. Auto-derived from initial_templates by default, but users can supply custom bindings for class attributes, DB rows, etc.

  5. Flexible EvalFn contract. eval_fn can return float, str, dict, or EvalResult — all auto-normalized. This avoids forcing users into a rigid evaluation interface.

  6. Closure-based wiring. Node functions close over ig.tracing_llm and ig.templates dict. When apply_updates() modifies the dict, the next invoke() automatically reads the updated values — no globals needed.


M1 Acceptance Gates

# Gate Status
1 OTLP export worksflush_otlp(clear=True) returns at least 1 span; second flush returns 0 PASS
2 TGJ conversion worksflush_tgj() docs consumable by ingest_tgj() PASS
3 Temporal chaining — child spans do NOT advance TGJ temporal chain PASS
4 Bindings apply deterministicallystrict=True raises on missing keys PASS
5 E2E update path (StubLLM)optimize_graph(iterations>=2) changes at least 1 prompt PASS
6 Notebook live validation — OTLP+TGJ with param.* from real provider PASS
7 Tests + notebook gate — all new APIs have at least 1 pytest; Colab badge present PASS

Test Plan

  • Run python -m pytest tests/unit_tests/test_bindings.py tests/unit_tests/test_otel_semconv.py tests/unit_tests/test_telemetry_session.py tests/unit_tests/test_instrumentation.py tests/unit_tests/test_optimization.py tests/features_tests/test_e2e_m1_pipeline.py -v — all 63 pass
  • Notebook runs end-to-end in StubLLM mode without API keys
  • Notebook live section runs when OPENROUTER_API_KEY is set (Colab Secrets or .env)
  • No secrets committed in notebook outputs
  • notebook_outputs/m1/ artifacts contain valid OTLP JSON and TGJ JSON

doxav and others added 21 commits February 12, 2026 15:01
…tion do not lose initial node to optimize (TODO: trainer might have a better solution)
- Add T1 technical plan for LangGraph OTEL Instrumentation API
- Add architecture & strategy doc (unified OTEL instrumentation design)
- Add M0 README with before/after boilerplate reduction comparison
- Add feedback analysis and API strategy comparison (Trace-first, dual semconv)
- Add prototype_api_validation.py with real LangGraph StateGraph + OpenRouter/StubLLM
- Add Jupyter notebook (prototype_api_validation.ipynb) for Colab-ready demo
- Add example trace output JSON files (notebook_trace_output, optimization_traces)
- Add .env.example for OpenRouter configuration
- Replace hardcoded API key with 3-tier auto-lookup (Colab Secrets → env → .env)
- Save all trace outputs to RUN_FOLDER (Google Drive on Colab, local fallback)
- Add run_summary.json export with scores and history
- Update configuration docs with key setup priority table
- Fix Colab badge URL with actual repo/branch path
Deliver Milestone 1 — drop-in OTEL instrumentation and end-to-end
optimization for any LangGraph agent via two function calls.

New modules (opto/trace/io/):
- instrumentation.py: instrument_graph() + InstrumentedGraph wrapper
- optimization.py: optimize_graph() loop + EvalResult/EvalFn contracts
- telemetry_session.py: TelemetrySession (TracerProvider + flush/export)
- bindings.py: Binding dataclass + apply_updates() + make_dict_binding()
- otel_semconv.py: emit_reward(), emit_trace(), record_genai_chat()

Modified modules:
- langgraph_otel_runtime.py: TracingLLM dual semconv (param.* parent +
  gen_ai.* child spans with trace.temporal_ignore)
- __init__.py: export all new M1 public APIs

Tests (63 passing, StubLLM-only, CI-safe):
- Unit tests for bindings, semconv, session, instrumentation, optimization
- E2E integration test (test_e2e_m1_pipeline.py): real LangGraph with
  StubLLM proving full pipeline instrument → invoke → OTLP → TGJ →
  optimizer → apply_updates → re-invoke with updated template

Notebook + docs:
- 01_m1_instrument_and_optimize.ipynb: dual-mode (StubLLM + live
  OpenRouter), Colab badge, executed outputs, <=3 item dataset,
  temperature=0, max_tokens=256 budget guard
- docs/m1_README.md: architecture, API reference, data flow, semantic
  conventions, acceptance criteria status
- requirements.txt: pinned dependencies for uv/pip environments
A. Live mode error handling:
 - A1: TracingLLM raises LLMCallError on HTTP errors/empty content instead of passing error strings as assistant content
 - A2: Notebook only prints [OK] when provider call actually succeeds with non-empty content
 - A3: gen_ai.provider.name correctly set to "openrouter" (not "openai") when using OpenRouter
 - A4: optimize_graph forces score=0 on invocation failure, bypassing eval_fn

B. TelemetrySession API correctness + redaction:
 - B5: flush_otlp(clear=False) properly peeks at spans without clearing the exporter
 - B6: span_attribute_filter now applied during flush_otlp; supports drop (return {}), redact, and truncate

C. TGJ/ingest correctness and optimizer safety:
 - C7: _deduplicate_param_nodes() strips numeric suffixes to collapse duplicate ParameterNodes
 - C8: _select_output_node() excludes child LLM spans, selects the true sink (synthesizer)

D. OTEL topology and temporal chaining:
 - D9: Root invocation span wraps graph.invoke(), producing a single trace ID per invocation
 - D10: Temporal chaining uses trace.temporal_ignore attribute instead of OTEL parent presence

E. optimize_graph semantics + trace-linked reward:
 - E11: best_parameters is a real snapshot captured at the best-scoring iteration
 - E12: eval.score attached to root invocation span before flush, linking reward to trace

F. Non-saturating scoring for Stub mode:
 - F13: StubLLM and eval_fn are structure-aware; stub optimization demonstrates score improvement

Files changed:
 - langgraph_otel_runtime.py: LLMCallError, _validate_content, flush_otlp(clear=)
 - telemetry_session.py: flush_otlp delegation, _apply_attribute_filter
 - otel_adapter.py: root span exclusion, trace.temporal_ignore chaining
 - instrumentation.py: _root_invocation_span context manager, root span on invoke/stream
 - optimization.py: _deduplicate_param_nodes, _select_output_node, _snapshot_parameters, eval-in-trace
 - __init__.py: export LLMCallError
 - test_optimization.py: updated for best_parameters field
 - 01_m1_instrument_and_optimize.ipynb: all fixes reflected in notebook
 - test_client_feedback_fixes.py: 20 new tests covering all 13 issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants