feat: Quasi-mature OpenTelemetry integration#223
Conversation
|
@Helveg I tried to get opentelemetry logs from the bsb reconstruction on HPC with the following command: |
|
Can you try some smaller debug examples first?
If none of them work, I think I have a CINECA login, or can use yours but I keep forgetting how hahaha :D I can try to debug the code there then. |
|
Ok I have a small test. The following command produces a jsonlines` file on my local PC but not on login node of cineca: # bla.yaml does not exist
opentelemetry-instrument --traces_exporter jsonlines bsb compile bla.yaml -v4 Should we look into python libs differences? |
|
Ok it seems I was lacking some opentelemetry libraries despite the code running without throwing any errors. opentelemetry-api 1.40.0
opentelemetry-distro 0.61b0
opentelemetry-exporter-otlp 1.40.0
opentelemetry-exporter-otlp-proto-common 1.40.0
opentelemetry-exporter-otlp-proto-grpc 1.40.0
opentelemetry-exporter-otlp-proto-http 1.40.0
opentelemetry-instrumentation 0.61b0
opentelemetry-proto 1.40.0
opentelemetry-sdk 1.40.0
opentelemetry-semantic-conventions 0.61b0
`` |
|
let's just say all of them. it's hard to tell because the otel package ecosystem for python specifically is a MESS.it's multiple monorepos on github,that all contain names pace packages for multiple names paces 🥲 it's really hard to tell what comes from what package. but I'll fix that as part of the move to a |
…bsb packages lint and format pass
Saving trace.get_tracer_provider() returns the proxy provider when no real provider is set. Restoring that into _TRACER_PROVIDER on exit makes ProxyTracerProvider.get_tracer recurse into itself. Snapshot the raw global instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers tracer registry idempotence, the explicit-version path that skips importlib.metadata lookup, and basic span context-manager behaviour. Keeps the package's CI test target from being a no-op. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…xture - Skip the rank/bcast dance entirely when MPI size == 1; in serial mode it is observably equivalent to a plain start_as_current_span. - When broadcasting, treat an invalid local SpanContext as a hard error. Rank 0 still bcasts None first so non-root ranks unblock and raise the same misconfiguration error instead of deadlocking. - Drop the dead set_span_in_context call (its return value was discarded). - OTelFixture: ignore blank trailing lines so the file-exporter readback works on Windows, where text-mode write doubles \r\n. - Make tests use OTelFixture and add opentelemetry-sdk to the test deps so bsb-otel's tests can exercise the broadcast path under a real provider. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous hard-error stance broke OTel's "API works without SDK" contract. Now rank 0 broadcasts None when its local tracer produced an invalid context, and non-root ranks treat that signal as "no parent to share" and fall through to their own local start_as_current_span (also a no-op when there's no provider). No NonRecordingSpan(invalid_ctx) ever gets attached, so line 65's is_valid check stays honest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Locks in the OTel "API works without SDK" contract for BsbTracer: with no provider configured, trace() yields a non-recording span, and under MPI the broadcast path stays deadlock-free. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The broadcast-on-no-parent design assumed every rank entered each trace() in lockstep. Rank-divergent callers (e.g. bsb-hdf5's per-rank file ops, serialized via MPILock rather than collectives) deadlock under that contract. Introduce an MPI-communicator contextvar that BsbTracer broadcasts on: - use_communicator(comm) overrides the contextual broadcast comm. - local_tracing() is the COMM_SELF shorthand; spans in the block stay per-rank (size==1 in the contextual comm → no broadcast). A pre-existing broadcast parent set up above the block is still inherited, so cross-rank correlation is preserved through that parent. - Without a configured SDK provider, skip the broadcast branch entirely — there's nothing meaningful to share, and forcing a collective on every trace() call would lock up rank-divergent code. This restores OTel's "API works without SDK" contract. mpi.rank/mpi.size span attributes still report the global communicator's rank/size; only bsb-otel's internal bcast follows the contextual comm. Includes an env-gated (BSB_OTEL_TRACE=1) _btrace diagnostic helper used to track down this very class of deadlock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every bsb-hdf5 trace call site (the @handles_handles decorator and the Resource read/write methods, plus _SpannedHandle.__enter__) wraps a per-rank h5py file operation. Cross-rank serialization is provided by MPILock, not by MPI collectives, so different ranks make different sequences of trace() calls. Letting bsb-otel attempt its broadcast on those spans deadlocks rank 1 on bcast as soon as rank 0 takes a divergent path (the test_pre_oob reproducer). Wrap the package's tracer once in _telemetry.py so every existing call site keeps its current shape. Includes an env-gated (BSB_OTEL_TRACE=1) _shtrace diagnostic helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tel intersphinx Two related docs failures on PR #223: - bsb-otel/project.json:docs had a corrupted sphinx-build command (`sphinx-build nW -b html . _build/html`) where `nW` was being parsed as the source directory, causing `ApplicationError: Cannot find source directory (.../docs/nW)`. Restore the canonical invocation used by every other package (`-nW` plus the cross-env BSB_LOCAL_INTERSPHINX_ONLY wrapper). - bsb's diagnostics doc references :meth:\`~bsb_otel.BsbTracer.trace\`, but bsb's docs target did not depend on bsb-otel:iso-docs, so the intersphinx inventory was missing and -W elevated the unresolved reference to a build error. Add bsb-otel:iso-docs to bsb's docs dependsOn list.
- Add docs/genindex.rst and docs/py-modindex.rst stubs (already present in every other monorepo package). The toctree in index.rst lists them, so without the files Sphinx -W errors with `nonexisting document 'py-modindex'`. - bsb_otel.use_communicator's docstring used :data:\`bsb.services.MPI\`, but the bsb-core docs do not document that module-level attribute, so intersphinx cannot resolve it and -W elevates the warning to an error. Switch to a literal so the cross-ref is dropped.
The class is already @skip_parallel, but combining that class-level skip with the @unittest.expectedFailure on the method left the test running under mpiexec — its os._exit(1) subprocess then takes the whole MPI job down with it. - Add @skip_parallel directly to the method so the skip is unambiguous. - Sanitize the env passed to the subprocess in _run_trace_subprocess by stripping OMPI_/PMI_/PMIX_/SLURM_ launcher vars, so a future child that inherits these can't accidentally attach to the parent rank's MPI runtime and turn an os._exit(1) into an MPI-wide abort.
Describe the work done
WIP branch for @drodarie to experiment with and see if it helps tracing MPI errors. VERY MUCH A DRAFT.
Still, might help already. First order of business is to make JSONLines replay possible, then make running spans immortal, then refactor everything into its own
bsb-otelpackage.@drodarie you can already work a bit in parallel to me and try to obtain the JSONLines logs the following way:
📚 Documentation preview 📚: https://bsb-core--223.org.readthedocs.build/en/223/
📚 Documentation preview 📚: https://bsb-hdf5--223.org.readthedocs.build/en/223/
📚 Documentation preview 📚: https://bsb--223.org.readthedocs.build/en/223/
📚 Documentation preview 📚: https://bsb-otel--223.org.readthedocs.build/en/223/
📚 Documentation preview 📚: https://nrn-patch--223.org.readthedocs.build/en/223/