Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
192949c
checkpoint of WIP JSON OTEL demo
doxav Oct 3, 2025
2f1794b
working OTEL/LANGGRAPH demo
doxav Oct 5, 2025
bc0b304
converted demo JSON/OpenTelemetry to LangGraph
doxav Oct 5, 2025
e81ad34
checkpoint
doxav Oct 6, 2025
a71e1ed
OTEL/JSON/LANGGRAPH demo: add a mechanism to ensure multiple optimiza…
doxav Oct 6, 2025
53871aa
ADDED batchify for handling the multiple feedback in a batch + ADDED …
doxav Nov 6, 2025
87d3c67
working code optimization - TODO: clean, simplify the code
doxav Nov 7, 2025
da80055
fixed code optimization
doxav Nov 20, 2025
d88a779
ADD synthtizer prompt in optim score > High score
doxav Nov 20, 2025
d03fec5
TEST removing span/OTEL from optimized code
doxav Nov 20, 2025
1692a89
fixed and updated LangGraph/Otel demo README
doxav Nov 21, 2025
1c75117
restore
doxav Nov 25, 2025
779db55
ADD demo and tests for native LangGraph integration with OTEL tracing
doxav Dec 11, 2025
23a377c
ADD refactor run_graph_with_otel to support custom evaluation functio…
doxav Dec 12, 2025
d19ba70
ADD implement run_benchmark function to compare different feedback mode
doxav Dec 12, 2025
22d1064
ADD M0 technical plan, architecture docs, and prototype API validation
JZOMVI Feb 6, 2026
30a89c8
Fix Colab badge URL: replace placeholders with actual repo/branch path
JZOMVI Feb 6, 2026
6760695
ADD Colab Secrets key retrieval and Google Drive auto-save to notebook
JZOMVI Feb 6, 2026
c85baf8
Update T1 tech plan: notebooks + acceptance alignment + fixed opto/tr…
doxav Feb 8, 2026
c8dfc04
ADD M1 core: instrument_graph + optimize_graph + E2E pipeline + notebook
JZOMVI Feb 12, 2026
2f0d53d
FIX: address all feedback issues (A1-F13)
JZOMVI Feb 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# OpenRouter Configuration
# Copy this file to .env and fill in your values
# Get your API key from: https://openrouter.ai/keys

OPENROUTER_API_KEY=sk-or-v1-your-key-here
OPENROUTER_MODEL=meta-llama/llama-3.1-8b-instruct:free
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
USE_STUB_LLM=false
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -168,4 +168,6 @@ OAI_CONFIG_LIST
*.gv.pdf

# jupyter book API output
docs/api/*
docs/api/*

uv.lock
238 changes: 238 additions & 0 deletions docs/OTEL_Graph_Optim_Draft_Feedback_analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,238 @@
## 1) What “good M0” means for this job (non-negotiable deliverable shape)

Milestone 0 is not “some code that runs”. It’s a **design contract** that makes M1–M3 mechanical and reviewable:

### M0 must include (minimum)

1. **Boilerplate inventory** (from the existing demo): list the exact blocks to eliminate and where they move (runtime init, exporter setup, node spans, OTLP flush, OTLP→TGJ conversion, diff dumps, optimizer loop, result summaries).
2. **Public API signatures** (exact function/class signatures) for:

* `instrument_graph(...)`
* LLM/tool wrappers (auto span emission)
* `optimize_langgraph(...)` or `LangGraphOptimizer.run(...)`
* `TelemetrySession` / `UnifiedTelemetry` (OTEL + MLflow)
3. **A genericity statement**: “works for any LangGraph graph”, and what “any” means (sync/async nodes? streaming? retries? tools? subgraphs?).
4. **A telemetry coverage plan**: how spans/metrics/artifacts flow across **nodes + LLM + tools + optimizers + trainers** into OTEL and into MLflow.
5. **A deterministic testing plan** (StubLLM mode), including what is asserted in pytest.
6. **A notebook plan** for M1/M2/M3: minimal code path, no secrets committed, “Open in Colab” badge, persistent artifacts.

---

## 2) Your key concern is correct: the optimization API must not be demo-specific

Your “planner / researcher / synthesizer / evaluator” graph is just a sample. The API needs to be framed around **LangGraph as a graph runtime**, not around that single graph’s roles.

The M0 doc must explicitly answer:

### What is the abstraction boundary?

There are really only two robust patterns (he should pick one, and justify):

#### Approach A — Node wrapper / decorator instrumentation (usually most reliable)

* Wrap each node callable with `@trace_node(...)` or `trace_node(fn, ...)`.
* Pros: works even if nodes aren’t LangChain “runnables”; consistent spans.
* Cons: requires touching node registration; but can still be “minimal change”.

#### Approach B — Callback-based instrumentation (lowest code change, but not always complete)

LangChain / LangGraph expose a callback system intended for monitoring/logging. In LangChain docs, callbacks are explicitly positioned for observability side effects. ([reference.langchain.com][1])

* Pros: can be “one-liner” when supported (pass a callback handler to the compiled graph).
* Cons: many graphs won’t emit enough callback events unless nodes are implemented as LangChain components; and mixing callbacks with streaming has known foot-guns in practice.

**M0 must pick A or B (or hybrid):**

* Hybrid is common: callbacks for LLM/tool calls; node wrappers for node spans.

---

## 3) Boilerplate reduction must be shown as a “before/after” (table + diff)

You’re right to demand a “code before vs after” view. This is the *developer adoption* metric. Agent Lightning’s positioning (“almost zero code changes”) is exactly the framing you want to compete with. ([GitHub][2])

Below is a **ChatGPT-generated example** table he can paste into README (replace names with your actual APIs). This is not a claim about your repo; it’s a template.

### Example “Before vs After” table (template)

| Aspect | Before (manual demo) | After (proposed API) |
| -------------------------- | ---------------------------------------------------------- | ------------------------------------------------------- |
| OTEL init/exporter | manual tracer/provider/exporter wiring in every script | `session = TelemetrySession(...); session.start()` |
| Node spans | `with tracer.start_as_current_span("node"):` everywhere | `instrument_graph(graph, session, ...)` |
| LLM spans + prompt capture | manually `set_attribute("inputs.gen_ai.prompt", ...)` etc. | `llm = TracingLLM(base_llm, session)` (auto `gen_ai.*`) |
| OTLP flush | manual exporter flush | `session.flush_otlp()` |
| OTLP→TGJ | manual conversion calls | `optimize_langgraph(..., session=session)` |
| Apply updates | custom patching | `PatchApplier.apply(update, targets=...)` |
| Artifacts | ad-hoc json dumps | `RunArtifacts.write_run(...)` standard layout |

### Example unified diff snippet (template)

```diff
- tracer, exporter = init_otel_exporter(...)
- graph = build_graph(llm)
- for x in dataset:
- with tracer.start_as_current_span("planner") as sp:
- sp.set_attribute("inputs.gen_ai.prompt", prompt)
- out = llm(prompt)
- otlp = flush(exporter)
- tgj = otlp_to_tgj(otlp)
- upd = optimizer.step(tgj, scores)
- apply_updates(graph, upd)
+ session = TelemetrySession(project="langgraph-demo", mode="stub")
+ llm = TracingLLM(base_llm, session=session)
+ graph = build_graph(llm)
+ graph = instrument_graph(graph, session=session, optimizable=Optimizable(nodes="*"))
+ result = optimize_langgraph(graph, dataset, optimizer="OptoPrimeV2", session=session)
```

If his M0 doesn’t include something like this, he’s not meeting the “boilerplate reduction is top success metric” requirement.

---

## 4) The API surface must be specified as a matrix of optimization “cases”

You requested a table of “all the API in different cases of optimization” (prompts vs code vs params, selection, observability tuning). This is exactly what you need to force now, because otherwise he’ll implement only what the demo uses.

Here is a concrete matrix he should include in M0.

### API matrix (what must exist / be planned)

| Use case | What is optimizable? | How dev selects targets | Required API | What is persisted |
| -------------------------- | ---------------------- | ------------------------------------------------- | --------------------------------------------------- | ----------------------------------------------- |
| Trace-only instrumentation | nothing | n/a | `instrument_graph(...)` | OTLP traces + minimal run metadata |
| Prompt optimization | prompt templates | `nodes=[...]` or `tags=[...]` or `selector=regex` | `TrainablePrompt("key")`, `optimize_langgraph(...)` | OTLP + TGJ + prompt patch/diff + summary |
| Code optimization | node code blocks | `code_nodes=[...]` | `TrainableCode(fn)` + patch applier | OTLP + TGJ + code patch + before/after snapshot |
| Hyperparam optimization | graph/node params | `param_keys=[...]` | `TrainableParam("k")` | param update log + config snapshot |
| Partial graph optimization | subset only | `selector` (node names/tags) | `Optimizable(selector=...)` | includes “skipped nodes” rationale |
| Observability “lite” | minimal spans | `capture_state=False` | `InstrumentOptions(capture=...)` | small artifacts, safe defaults |
| Observability “debug” | state I/O + truncation | `state_keys=[...]` | `CapturePolicy(truncate=..., redact=...)` | large artifacts, deterministic truncation |

This should be in his M0 doc. If it isn’t, ask him to add it.

---

## 5) OTEL semantics: define what attributes/spans you emit, and why

This job is explicitly OTEL-first. He should anchor the design to the emerging OpenTelemetry GenAI semantic conventions (even if you store some data as artifacts for size). OpenTelemetry defines GenAI spans and related conventions (status is still evolving, but it’s the right direction). ([OpenTelemetry][3])

### What to insist on in M0

* **Node span contract** (what attributes are always present):

* `graph.id`, `node.name`, `node.type`
* `param.*` (Trace optimization keys)
* `inputs.*` / `outputs.*` (with truncation rules)
* error fields (exception, status)
* **LLM span contract**:

* a dedicated child “LLM call” span is the cleanest separation
* populate `gen_ai.*` keys per OpenTelemetry conventions where feasible ([OpenTelemetry][3])
* put full prompt/response in **artifacts**, not span attributes, if size is large (and store only hashes/short previews in attributes)

### Agent Lightning compatibility (optional but should be planned cleanly)

If you keep the optional “Agent Lightning semconv compatibility”, his plan must reflect the actual documented conventions:

* Rewards are dedicated spans named `agentlightning.annotation` ([microsoft.github.io][4])
* Reward keys use the `agentlightning.reward` prefix; example `agentlightning.reward.0.value` ([microsoft.github.io][5])
* `emit_reward`/`emit_annotation` exist as the conceptual model (even if you won’t depend on the library) ([microsoft.github.io][6])

So in M0 he should decide:

* Do we emit those spans/attrs **always**, or behind a flag?
* If we emit child spans, how do we ensure TGJ conversion doesn’t break ordering (your “temporal_ignore” idea is sensible; if he adopts it, it must be explicitly in the M0 design).

---

## 6) Telemetry unification: he must show a plan for trainers + optimizers + nodes

Your note is correct: if his work plan doesn’t explicitly cover “how telemetry is initiated and wired across all components,” he will miss M2.

### What to demand in M0: a concrete telemetry table

Below is the table you asked for (template; he should fill exact modules).

| Component | Today | Target telemetry hook | OTEL output | MLflow output |
| ---------------------------------- | ------------ | ---------------------------------------------------- | -------------------------------------------- | ------------------------------------------------- |
| LangGraph node execution | ad-hoc spans | `instrument_graph()` wraps nodes OR callback handler | spans per node | link run_id + store summary as artifact |
| LLM calls inside nodes | manual attrs | `TracingLLM` wrapper (child spans) | `gen_ai.*` spans/events ([OpenTelemetry][3]) | log token/cost metrics; save prompts as artifacts |
| Tool calls | inconsistent | `TracingTool` wrapper | span per tool call | metrics + tool error artifacts |
| Optimizer logs (e.g., summary_log) | in-memory | `TelemetrySession.log_event/artifact` adapter | events or span events | artifacts (jsonl), aggregate metrics |
| Trainer metrics via BaseLogger | fragmented | `BaseLogger → UnifiedTelemetry` adapter | metrics (optional) | `mlflow.log_metric` series |
| Run metadata | scattered | `TelemetrySession(run_id, iteration_id, step)` | resource attrs | params/tags + run dir artifact |

**MLflow thread-safety must be addressed explicitly**: MLflow’s fluent API is not thread-safe; concurrent callers must use mutual exclusion, or use the lower-level client API. ([MLflow][7])
So M0 must state one of:

* “single-thread logging only (v1)” **or**
* “we use an internal lock for mlflow logging calls” **or**
* “we route all MLflow logging through `MlflowClient` in a single worker thread”

### Also: don’t over-assume MLflow auto-tracing will cover LangGraph

There are known gaps/issues around tracing LangGraph top-level calls with some autologging approaches. ([GitHub][8])
So his plan should not hinge on “just turn on mlflow autolog and it traces the graph”.

---

## 7) Tests: what M0 must commit to (StubLLM + deterministic assertions)

He must specify exactly what tests will exist, not just “we’ll add tests”.

Minimum pytest plan:

1. **Unit**: `instrument_graph` produces spans with required attributes for:

* normal node completion
* node exceptions (status)
* truncation/redaction rules
2. **Unit**: wrapper LLM emits `gen_ai.*` keys (and doesn’t crash on non-JSONable attrs) ([OpenTelemetry][3])
3. **Integration (StubLLM)**: full loop:

* run graph on 2–3 inputs
* flush OTLP
* convert OTLP→TGJ
* optimizer produces an update (even if toy)
* apply update
* rerun shows changed prompt/code snapshot
4. **Integration (MLflow local file store)**:

* start run
* log a metric + artifact
* verify artifact exists in store
* ensure no keys required

---

## 8) Notebook notes (add these at the end of your feedback, per your request)

Even without seeing his notebook, the acceptance requirements are clear:

* Good that he sent a notebook already executed (so you can inspect outputs). Keep that.
* Once it’s in GitHub, the notebook must:

1. Include an **“Open in Colab” badge** at the top.
2. Use **Colab Secrets** / environment injection for API keys (avoid passing keys as parameters).
3. Auto-save run artifacts to **Google Drive** (or a stable persistent path) to avoid losing long results on runtime reset.
4. Print the **artifact folder path** at the end (so reviewers can find outputs quickly).
5. Provide a clear **StubLLM path** that always runs in <5–10 minutes.

(You can reuse the same Drive helper pattern you used in the Trace‑Bench feedback.)


---

## Bottom line

For tomorrow’s meeting, you want to be able to decide in 5–10 minutes whether his M0 is “approval-worthy”. The gating signal is: **does the doc make M1 implementation obvious and generic, with the before/after diff, API matrix, telemetry matrix, and explicit tests/notebooks plan**.

If you paste or upload his actual M0 README + notebook here later, I can add file-specific comments (naming, module layout, missing knobs, security issues, etc.).

[1]: https://reference.langchain.com/python/langchain_core/callbacks/?utm_source=chatgpt.com "Callbacks | LangChain Reference"
[2]: https://github.com/microsoft/agent-lightning "GitHub - microsoft/agent-lightning: The absolute trainer to light up AI agents."
[3]: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/?utm_source=chatgpt.com "Semantic conventions for generative client AI spans"
[4]: https://microsoft.github.io/agent-lightning/latest/tutorials/traces/?utm_source=chatgpt.com "Work with Traces - Agent-lightning"
[5]: https://microsoft.github.io/agent-lightning/stable/reference/semconv/?utm_source=chatgpt.com "Semantic Conventions - Agent-lightning"
[6]: https://microsoft.github.io/agent-lightning/latest/reference/agent/?utm_source=chatgpt.com "Agent-lightning"
[7]: https://mlflow.org/docs/latest/python_api/mlflow.html?utm_source=chatgpt.com "module provides a high-level “fluent” API for starting and ..."
[8]: https://github.com/mlflow/mlflow/issues/12798?utm_source=chatgpt.com "[FR] Tracing for Langchain's Runnable.astream_events ..."
Loading