Skip to content

Add remaining Network SDKStats short-interval signals#156

Merged
JacksonWeber merged 4 commits into
microsoft:mainfrom
JacksonWeber:feat/network-sdk-stats-extended
Jun 2, 2026
Merged

Add remaining Network SDKStats short-interval signals#156
JacksonWeber merged 4 commits into
microsoft:mainfrom
JacksonWeber:feat/network-sdk-stats-extended

Conversation

@JacksonWeber

@JacksonWeber JacksonWeber commented May 27, 2026

Copy link
Copy Markdown
Contributor

Follows up #145 by adding the remaining Network SDKStats short-interval metrics defined in the AppInsights SDKStats spec, aligned with the updated sdkstats.md (PRs #938 and #942).

New metrics

  • Request_Failure_Count (endpoint, host, statusCode)
  • Request_Duration (endpoint, host) - per-interval average
  • Retry_Count (endpoint, host, statusCode)
  • Throttle_Count (endpoint, host, statusCode)
  • Exception_Count (endpoint, host, exceptionType)

Existing Request_Success_Count is unchanged.

A365 status code classification (Breeze HTTP semantics)

Per the spec, endpoint=�365 follows the same HTTP code set as endpoint=�reeze :

  • success: 2xx except 206
  • ignored: 206, 307, 308
  • throttle: 402, 439
  • retry: 401, 403, 408, 429, 500, 502, 503, 504
  • failure: everything else

OTLP status code classification (OTLP/HTTP semantics)

The updated spec requires OTLP exporters report HTTP status codes per the OTLP/HTTP response section. The upstream @opentelemetry/otlp-exporter-base delegate exposes a status code on OTLPExporterError.code only for non-retryable failures; retryable HTTP responses (429/502/503/504) are collapsed into a synthetic message with the original code discarded. The wrapper classifies as follows:

  • Numeric HTTP status (100-599) on the error -> Request_Failure_Count (or Retry_Count for 429/502/503/504 if the failure path still carries it).
  • Synthetic "Export failed with retryable status" message -> Retry_Count with statusCode="unknown" (status is lost upstream; spec dimension preserved).
  • AbortError / TimeoutError / "Request timed out" -> Exception_Count(exceptionType="Timeout exception").
  • TypeError or Node socket errors (ECONNREFUSED/ENOTFOUND/...) -> Exception_Count(exceptionType="Network exception").
  • Any other thrown error -> Exception_Count(exceptionType="Client exception").

Throttle (429 + Retry-After) is intentionally bucketed as retry rather than throttle because the upstream delegate does not surface response headers; over-counting throttle would distort customer panic signals.

Wiring

  • A365 exporter: each retry attempt is timed; classifyStatusCode dispatches to the right recorder. Caught exceptions are bucketed (Timeout / Network / Client) mirroring the AzMon ExceptionType enum.
  • OTLP wrapper: every export is timed and Request_Duration is recorded; failures are routed through the OTLP-aware classifier described above. Also adds forceFlush to NetworkStatsLogExporter so it conforms to the current LogRecordExporter interface.

Tests

  • New unit coverage for each record function, duration averaging, classifyStatusCode buckets, A365 failure/retry/throttle/exception/duration paths, and the OTLP wrapper's failure / retry / retryable-loss / network / timeout / non-HTTP-code branches.
  • Full unit suite: 849 passed / 5 todo / 0 failed. Lint and build clean.
image

…eption, duration

Extends PR microsoft#145 with the remaining Network SDKStats signals per the Application Insights SDKStats spec:

- Request_Failure_Count (keyed by endpoint/host/statusCode)

- Retry_Count (keyed by endpoint/host/statusCode)

- Throttle_Count (keyed by endpoint/host/statusCode)

- Exception_Count (keyed by endpoint/host/exceptionType)

- Request_Duration (avg per endpoint/host)

A365 exporter classifies HTTP responses (success/retry/throttle/failure) per spec and records duration on every attempt; thrown fetch errors are bucketed into the Exception_Count metric. OTLP wrappers record duration on every export and emit an Exception_Count for FAILED results.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes the SDKStats “Network” short-interval metric set by adding failure/retry/throttle/exception counters plus a per-interval average request duration metric, and wiring them into both the A365 exporter and the OTLP exporter wrappers.

Changes:

  • Expanded sdkstats/networkStats with five new counters, a duration accumulator (sum/count → per-interval average on drain), and a shared classifyStatusCode helper.
  • Updated A365 exporter and OTLP wrapper decorators to record the new signals (including per-attempt duration for A365 retries).
  • Added/extended unit tests covering new recorders, duration averaging, status bucketing, and exporter wiring.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/sdkstats/networkStats.ts Adds new network metric constants, counters, duration averaging, and HTTP status classification.
src/sdkstats/metrics.ts Registers the additional observable gauges and maps variable-length keys to attributes via keyAttributes.
src/sdkstats/otlpWrapper.ts Records duration for all exports and emits a stable Exception_Count label on non-success results.
src/sdkstats/index.ts Re-exports the new metric names, recorders, and classifyStatusCode.
src/a365/exporter/Agent365Exporter.ts Records duration + status/exception counters per request attempt; adds exception-type bucketing helper.
test/internal/unit/sdkstats/networkStats.test.ts Adds unit tests for new recorders, duration averaging, and status bucketing.
test/internal/unit/sdkstats/metrics.test.ts Validates OTel metric emission (attributes + avg duration) for the new signals.
test/internal/unit/sdkstats/otlpWrapper.test.ts Verifies OTLP wrapper emits exception count + duration on FAILED and duration on SUCCESS.
test/internal/unit/a365/agent365NetworkStats.test.ts Adds A365 exporter wiring tests for failure/retry/throttle/exception/duration metrics.
Comments suppressed due to low confidence (1)

src/sdkstats/otlpWrapper.ts:89

  • wrapExport doesn't handle the case where the wrapped exporter throws synchronously (before invoking the callback). In that scenario the wrapper won't record Request_Duration / Exception_Count and the exception will escape to callers. Consider wrapping inner(settle) in a try/catch, recording duration/exception, and invoking resultCallback with a FAILED ExportResult (including the caught error) to keep wrapper behavior consistent with its docstring.
function wrapExport<T>(
  host: string,
  inner: (resultCallback: (result: ExportResult) => void) => void,
  resultCallback: (result: ExportResult) => void,
  _items: T,
): void {
  const start = Date.now();
  const settle = (result: ExportResult): void => {
    recordDuration(OTLP_ENDPOINT_CATEGORY, host, Date.now() - start);
    if (result.code === ExportResultCode.SUCCESS) {
      recordSuccess(OTLP_ENDPOINT_CATEGORY, host);
    } else {
      recordException(OTLP_ENDPOINT_CATEGORY, host, OTLP_FAILED_EXCEPTION_TYPE);
    }
    resultCallback(result);
  };

  inner(settle);
}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/internal/unit/a365/agent365NetworkStats.test.ts Outdated
Comment thread test/internal/unit/a365/agent365NetworkStats.test.ts Outdated
- otlpWrapper.wrapExport: catch synchronous throws from inner exporter; record Request_Duration + Exception_Count and invoke resultCallback with FAILED ExportResult so wrapper behavior matches its docstring.
- Add unit test covering the synchronous-throw path.
- a365NetworkStats test: rename misleading '429 throttle' test and remove inaccurate inline comment; the test now exercises only 439 (a pure throttle status) and documents the actual classifyStatusCode ordering (THROTTLE_STATUSES checked first; 429 is retry-only).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@JacksonWeber

Copy link
Copy Markdown
Contributor Author

Also addressed the suppressed-confidence finding about wrapExport not handling synchronous throws from the inner exporter (commit e2985a1):

  • Wrapped inner(settle) in a try/catch.
  • On a synchronous throw, the wrapper now records Request_Duration and Exception_Count(exceptionType=\"ExporterFailed\") and invokes resultCallback with { code: FAILED, error } so the exception never escapes to the caller.
  • Added a settled guard so the success/failure path can't double-record if inner both throws and invokes the callback.
  • New unit test in otlpWrapper.test.ts covers the synchronous-throw path end-to-end (asserts FAILED result, original error propagated, exception + duration recorded).

Full unit suite: 844 passed / 5 todo / 0 failed.

Comment thread src/sdkstats/networkStats.ts
Comment thread src/sdkstats/otlpWrapper.ts
JacksonWeber and others added 2 commits June 1, 2026 15:44
The aep-health-and-standards/Telemetry-Collection-Spec sdkstats spec was updated (#938, #942) to require non-Breeze exporters report network signals per the OTLP/HTTP response specification: success/failure/retry/throttle counts with the HTTP status code dimension, falling back to Exception_Count only when no response code is available.

- Classify FAILED OTLP exports using the error surfaced by the upstream delegate: numeric HTTP status (100-599) on OTLPExporterError -> Request_Failure_Count (or Retry_Count for 429/502/503/504); the delegate's synthetic 'Export failed with retryable status' message -> Retry_Count with statusCode='unknown' (status is lost upstream); network/timeout errors -> Exception_Count with a bounded type (Timeout/Network/Client exception).

- Add forceFlush forwarder to NetworkStatsLogExporter to satisfy the LogRecordExporter interface (pre-existing tsc error on the branch).

- Extend unit tests for the new classification paths and the bounded exceptionType labels.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolve conflicts in Agent365Exporter.ts (combine SDKStats imports with OpenTelemetryConstants import) and otlpWrapper.ts (keep extended doc comment, drop duplicate forceFlush).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@JacksonWeber JacksonWeber requested a review from lzchen June 2, 2026 18:30
Comment thread src/sdkstats/otlpWrapper.ts
Comment thread src/sdkstats/networkStats.ts

@rads-1996 rads-1996 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JacksonWeber JacksonWeber merged commit 278b781 into microsoft:main Jun 2, 2026
5 checks passed
JacksonWeber added a commit that referenced this pull request Jun 2, 2026
#165)

* refactor(sdkstats): consolidate shared constants under sdkstats/constants.ts

Addresses Radhika's follow-up suggestions on PR #156:

- Move OTLP wrapper magic constants (endpoint category, unknown-status sentinel, retryable-network error codes, exception-type labels, OTLP retryable HTTP statuses) into a new src/sdkstats/constants.ts.

- Move the wire-format metric-name constants (REQUEST_SUCCESS_NAME / REQUEST_FAILURE_NAME / REQUEST_DURATION_NAME / RETRY_COUNT_NAME / THROTTLE_COUNT_NAME / EXCEPTION_COUNT_NAME), the NETWORK_METRIC_NAMES tuple, and the HTTP status-code buckets out of networkStats.ts into the same constants module. networkStats.ts re-exports them for backwards compatibility.

- Have the A365 exporter use the shared A365_ENDPOINT_CATEGORY label and the EXC_TIMEOUT/EXC_NETWORK/EXC_CLIENT exceptionType constants instead of duplicating the literal strings.

- Document why we cannot just import StatsbeatCounter from @azure/monitor-opentelemetry-exporter (the enum lives at dist/{esm,commonjs}/export/statsbeat/types and the package's exports field only publishes '.' and './package.json', so the enum is not part of the public surface; we mirror its values and keep them in lockstep).

All 923 tests pass, lint clean, build green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs(sdkstats/constants): fix broken @link paths and clarify why StatsbeatCounter cannot be imported

- {@link} paths were written as if from a parent of src/sdkstats/; corrected to be relative to src/sdkstats/constants.ts itself (./networkStats, ./otlpWrapper, ../a365/exporter/Agent365Exporter).

- Expanded the StatsbeatCounter rationale with the exact TS2307 error produced under our moduleResolution=NodeNext config and pointed at the upstream Azure SDK repo for the public-export ask.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* test(sdkstats): consume new shared constants in tests instead of duplicating string literals

Updates the four SDKStats test files (networkStats, otlpWrapper, metrics, agent365NetworkStats) to import and use the constants now exposed from sdkstats/constants.ts (A365_ENDPOINT_CATEGORY, OTLP_ENDPOINT_CATEGORY, OTLP_UNKNOWN_STATUS, OTLP_HTTP_RETRYABLE_STATUSES, EXC_TIMEOUT/NETWORK/CLIENT) instead of duplicating the literal strings inline. The single test that intentionally pins the wire-format metric-name strings to their literal values (networkStats.test.ts \�xposes all six SDKStats network metric names\) is preserved so a constant rename still fails loudly.

Also rewrites the OTLP_HTTP_RETRYABLE_STATUSES iteration to drive the loop directly from the shared set so adding/removing a status updates both the wrapper and the test in lockstep.

923 tests pass, lint clean, build green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants