Skip to content

feat(observability): add OpenTelemetry building blocks (OtelMetrics / OtelLogger / OtelTracer)#50

Draft
gcacace wants to merge 6 commits into
aws-devtools-labs:mainfrom
gcacace:feat/otel-observability-blocks
Draft

feat(observability): add OpenTelemetry building blocks (OtelMetrics / OtelLogger / OtelTracer)#50
gcacace wants to merge 6 commits into
aws-devtools-labs:mainfrom
gcacace:feat/otel-observability-blocks

Conversation

@gcacace

@gcacace gcacace commented Jun 19, 2026

Copy link
Copy Markdown

Problem

The observability blocks (Metrics, Logger, Tracer) are AWS-native and minimal — CloudWatch EMF, stdout JSON, and the X-Ray SDK. They give up OpenTelemetry's vendor-neutral, richly-typed telemetry model (typed metric instruments, span links/events, context propagation, semantic-convention resource attributes) and the ability to send telemetry to any OTLP backend. There was no first-class way to run OpenTelemetry on a Blocks app.

Issue #, if available: N/A

Changes

Adds a new OpenTelemetry observability block family that coexists with the existing blocks (and is recommended for new apps). The blocks initialize an in-process OTel SDK that exports OTLP/HTTP to a standalone opentelemetry-lambda collector layer, which SigV4-signs and forwards to CloudWatch's native OTLP endpoints (or any OTLP backend via an override). The shared handler force-flushes before returning, and the collector's decouple processor holds the invocation open until export completes.

New packages:

  • @aws-blocks/otel-common — support library: in-process SDK bootstrap + per-invocation flushOtel(), the collector-config renderer, the idempotent CDK shared-infra helper, and zero-arg provider accessors (getOtelMeterProvider / Tracer / Logger) for OTel-compatible libraries.
  • @aws-blocks/bb-otel-metrics (OtelMetrics) — emit() + typed OTel instruments (counter/histogram/upDownCounter/observableGauge) + rawMeter.
  • @aws-blocks/bb-otel-logger (OtelLogger) — debug/info/warn/error + child → OTel LogRecords; rawLogger.
  • @aws-blocks/bb-otel-tracer (OtelTracer) — startSegment → active span (kind/links/events, W3C inject/extract); rawTracer.

Supporting changes:

  • OTel-correct API & semantics — service identity uses semconv resource attributes (serviceName/serviceNamespace/serviceVersion), not a CloudWatch "namespace"; service.name defaults to the Lambda function name → BLOCKS_STACK_NAME → block fullId. AWS Lambda resource attributes (faas.*, cloud.*, aws.log.group.names) are auto-detected in-process (the collector layer omits resourcedetection — see OTel contrib #17584). Metric units are UCUM; no emitBatch/MetricDatum (OTel batches at export).
  • @aws-blocks/bb-dashboard — renders PromQL chart widgets for OtelMetrics (OTLP metrics are PromQL-queryable, with no namespace); classic Metrics dashboards are unaffected.
  • @aws-blocks/core — flushes in-process OTel telemetry after each invocation (no-op unless an OTel block is in use).
  • Umbrella exports, vendorize map, root README catalog, and docs reframed to recommend the OTel blocks over the AWS-native ones.

⚠️ Notable: the collector layer is region-pinned (default opentelemetry-collector-amd64-0_15_0, account 184161586896); traces require CloudWatch Transaction Search enabled in the account/region; the OTLP logs endpoint requires a pre-existing log group + stream (the CDK construct creates a dedicated /aws/otel/<fullId>).

Validation

  • Unit tests added across all packages: otel-common (34 — collector-config rendering, SDK/resource building + Lambda detection, idempotent CDK infra, provider accessors), bb-otel-metrics (14), bb-otel-logger (9), bb-otel-tracer (9), plus bb-dashboard PromQL widget tests. Umbrella conditional-exports (23) and vendorize-map (24) gates pass; full build:packages is clean.
  • Deployed AWS validation (throwaway resources in a sandbox account, since torn down): confirmed end-to-end that the architecture works — all three signals land in CloudWatch with zero export errors, and verified via the CloudWatch PromQL API that metrics carry the expected @resource.faas.* / @resource.service.* labels. This is also how the architecture pivot was discovered (the ADOT layer drops async exports on the sandbox freeze; the standalone collector layer + decouple + force-flush works).
  • Local dev: the mock runtime exercises the same OTel SDK with console exporters (metrics/logs → stdout) and a file span exporter (traces → .bb-data/<fullId>/traces.json).

Checklist

  • PR description included
  • Tests are changed or added
  • Relevant documentation is changed or added (and PR referenced)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

gcacace added 6 commits June 19, 2026 15:11
…Traces)

Add a vendor-neutral OpenTelemetry observability family that complements the
AWS-native Logger/Metrics/Tracer blocks and is recommended for new applications.

New packages:
- otel-common: in-process OTel SDK bootstrap + per-invocation flush, the
  collector-config renderer, and the idempotent CDK shared-infra helper.
- bb-otel-metrics (OtelMetrics): emit/emitBatch/child + typed Counter/Histogram/
  UpDownCounter/ObservableGauge + raw Meter escape hatch.
- bb-otel-logger (OtelLogger): debug/info/warn/error + child -> OTel LogRecords
  with safe attribute coercion; raw Logger escape hatch; dedicated log group.
- bb-otel-tracer (OtelTracer): startSegment -> active span, annotations/metadata/
  events, setHttpStatus, getTraceId, W3C inject/extract, raw Tracer escape hatch.

Architecture (validated by deploy spikes + a smoke test): blocks initialize an
in-process OTel SDK that exports OTLP/HTTP to a standalone opentelemetry-lambda
collector layer (decouple processor + per-service sigv4auth), which signs with
SigV4 and forwards to CloudWatch's native OTLP endpoints. The handler force-flushes
before returning so async exports survive the Lambda sandbox freeze. CloudWatch
ingests OTLP metrics as PromQL-queryable series (no namespace).

Supporting changes:
- bb-dashboard: render PromQL chart widgets for OtelMetrics refs (metricsKind:
  'otlp'); classic Metrics path unchanged.
- core: flush in-process OTel telemetry after each invocation (no-op unless an
  OTel block is in use).
- blocks umbrella: export the new blocks, vendorize entries, sdk-identifiers
  overload, tsconfig refs.
- docs: recommend the OTel blocks over the AWS-native ones across READMEs, the
  block catalog/decision tree, and the issue template.
…uto-enrich Lambda

Align the OTel metrics API with OpenTelemetry semantics instead of the EMF
`Metrics` block it was modelled on, and enrich all signals with AWS Lambda
resource attributes out of the box.

- Service identity now uses semconv resource attributes
  (serviceName/serviceNamespace/serviceVersion) on the SDK Resource — `namespace`
  is removed. `service.name` defaults to BLOCKS_STACK_NAME, then the block fullId.
- otel-common detects AWS Lambda resource attributes in-process via
  @opentelemetry/resource-detector-aws (awsLambdaDetector + envDetector). The
  collector's resourcedetection processor is NOT in the lambda layer (verified;
  see contrib#17584), so detection happens in the SDK. faas.*/cloud.* land as
  @resource.* PromQL labels — confirmed via the CloudWatch PromQL API.
- OtelMetrics: drop `emitBatch`/`MetricDatum` (OTel batches at export, not the API);
  keep `emit` + typed instruments + rawMeter. Units are UCUM. Drop BatchTooLarge.
- Add zero-arg provider accessors getOtelMeterProvider/TracerProvider/LoggerProvider
  to otel-common as the escape hatch for OTel-compatible libraries (plus the
  registered global providers).
- OtelLogger/OtelTracer gain serviceName/serviceNamespace/serviceVersion options.
- bb-dashboard: MetricsBBRef.namespace is optional; the metrics section renders for
  OTLP refs (no namespace) and routes to PromQL widgets.
- Regenerated API reports + READMEs/block docs; updated changeset.
The AWS Lambda detector sets faas.name, not service.name, and an unset
service.name becomes the `unknown_service:node` sentinel — so we keep always
defaulting it, but improve the default to the most specific real identity:
  explicit serviceName → AWS_LAMBDA_FUNCTION_NAME → BLOCKS_STACK_NAME → block fullId.

In local dev (mock runtime) the Lambda/stack env vars are absent, so it falls
through to the block's scope fullId (e.g. "my-app-metrics") — never unknown_service.
Resolution stays in otel-common's buildResource; no API/type changes. Adds unit
coverage for the function-name default and the local-dev fallthrough; READMEs +
changeset updated.
…eparated)

Metric names and attribute keys in the OTel blocks' examples/tests now follow
OpenTelemetry conventions (lowercase, dot-separated) instead of CloudWatch
PascalCase, which is CW-only:
- bb-dashboard README (OTel example): `orders.placed` + `faas.invoke_duration`
  (a FaaS-appropriate Lambda metric, replacing the server-oriented
  http.server.request.duration). The AWS-native `Metrics` example keeps
  CloudWatch PascalCase, which is correct there.
- bb-otel-tracer README + test: span name `fetch-user`/`fetch.user` and attribute
  `user.id` (was `userId`).
Add a table to the bb-otel-metrics README listing exactly what the in-process
awsLambdaDetector provides out of the box — cloud.provider/platform/region and
faas.name/version/max_memory/instance (+ aws.log.group.names) — with the source
Lambda env var for each, the SnapStart caveat (faas.instance / log group names
omitted), and that it needs no IAM and is a no-op off Lambda. Link the OTel cloud
+ FaaS resource conventions. Cross-reference the table from otel-common's README.
Bring in 5 commits from main (extract @aws-blocks/pipeline package, bb-tracer
per-trace sampling, bb-cron-job stepped-range fix, version bumps). Conflicts were
limited to root package.json and package-lock.json:
- package.json: union of both sides — keep main's `packages/pipeline` workspace +
  build:packages entry alongside our otel-common / bb-otel-* entries.
- package-lock.json: reset to origin/main and regenerated via
  `npm install --package-lock-only` against the merged package.json.

Verified: full build:packages clean (0 TS errors); OTel packages, bb-dashboard,
umbrella conditional-exports/vendorize gates, and the main-changed bb-tracer /
bb-cron-job suites all pass.
@gcacace gcacace requested a review from a team as a code owner June 19, 2026 17:07

@svidgen svidgen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires some thought. Having Otel prefixed blocks be the "recommended" way to do observability is odd when we have bare-named Logger, Metrics, and Tracing blocks. 🤔

Can the existing observability blocks be extended or switched out gracefully under the hood? Are there any cost, performance, or stability considerations?

@svidgen svidgen marked this pull request as draft June 19, 2026 22:08
@soberm

soberm commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

After taking a look, I'm leaning toward keeping these as separate blocks instead of building OTel into the existing blocks. For Logger and Tracer it's probably easy since the method surfaces already line up, so a backend: 'aws-native' | 'otel' switch would mostly just work. The problem is that the OTel path pulls in the whole in-process SDK, protobufjs, the OTLP exporters and the collector-layer CDK wiring. Every consumer is impacted by the bundle and cold-start implications even if they never use OTel.

However, Metrics is what bothers me most. Some of it could be shimmed, e.g., dimensions maps cleanly to OTel attributes, and we could fold namespace into the service.namespace resource attribute so its value survives as a PromQL filter. But that doesn't reproduce CloudWatch-namespace behavior: existing namespace-keyed alarms and dashboards just won't find OTel metrics. Since this is still preview it's not a blocker but still worth mentioning. And the rest doesn't map at all: emitBatch has no equivalent, units flip from CloudWatch's enum to UCUM strings, and timestamp/resolution disappear. Reusing the existing block there would mean either faking stuff that doesn't map or breaking callers, which feels worse than just keeping it separate.

Regarding making OTel the recommended default, I think that some consumers won't care about OTel at all. If they just want lightweight, zero-overhead observability, the AWS-native blocks (plain EMF to stdout, no collector layer, no extra bundle size) are probably better. Given the cold-start, bundle size (~7x), and per-invocation flush costs that come with OTel, I'm not sure we should be steering people toward it by default. I'd rather present them as equal options: pick OTel if you need vendor-neutral export or third-party backends, stick with AWS-native if you just want simple and cheap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants