Skip to content
This repository was archived by the owner on Apr 17, 2026. It is now read-only.

feat: add application-level metrics to Traefik and FRP data plane#5

Open
vikramlayerv wants to merge 6 commits into
mainfrom
feat/traefik-metrics
Open

feat: add application-level metrics to Traefik and FRP data plane#5
vikramlayerv wants to merge 6 commits into
mainfrom
feat/traefik-metrics

Conversation

@vikramlayerv
Copy link
Copy Markdown
Collaborator

Summary

  • Traefik Prometheus metrics: Enabled built-in Prometheus exporter on :8082/metrics with entrypoint/service labels, latency histogram buckets, and JSON access logs
  • FRP Prometheus metrics: Enabled enablePrometheus = true on the FRP server so its /metrics endpoint is active
  • Custom NHP metrics middleware plugin: Traefik local plugin (traefik-plugin-metrics) that tracks 20+ application-level metrics via atomic counters and rolling-window histograms — exposed as structured log snapshots every 60s and a JSON endpoint at /.well-known/nhp-metrics

Metrics covered by the plugin

Category Metrics
Request volume request_total, in_flight
Proxy type proxy_standard, proxy_streaming, proxy_websocket, websocket_active
Cache cache_size, session_cache_hit, session_cache_miss
AC (Access Controller) private_ac_api_call, private_ac_cache_hit
Errors error_backend_connect, error_backend_timeout, error_unauthorized, error_panic
Status codes status_2xx, status_3xx, status_4xx, status_5xx
Latency (p50/p95/p99) request_duration, backend_duration, api_lookup, html_rewrite

Why structured logs + JSON instead of Prometheus in the plugin?

Traefik plugins run in Yaegi (Go interpreter) and cannot import external libraries like prometheus/client_golang. The plugin uses sync/atomic counters and a mutex-guarded ring-buffer histogram instead, exposing data via:

  1. Structured logs — parseable by CloudWatch Logs Insights, Datadog, Loki, etc.
  2. JSON endpoint (GET /.well-known/nhp-metrics) — scrapeable or curl-able

Performance

~300ns typical, <1µs worst case per proxied request. See review comment for full breakdown.

Test plan

  • Plugin unit tests pass (16/16, 96.3% coverage)
  • docker compose up -d and verify Traefik starts with plugin loaded
  • curl http://177.7.0.10:8082/metrics returns Prometheus metrics
  • curl http://177.7.0.10/.well-known/nhp-metrics returns JSON snapshot
  • Send traffic and confirm counters increment
  • Check container logs for [nhp-metrics] metrics_snapshot structured log lines

🤖 Generated with Claude Code

vikramlayerv and others added 3 commits March 17, 2026 12:41
Traefik had zero observability — every user request passes through it
with no metrics. This adds three layers of instrumentation:

1. Prometheus metrics on Traefik (:8082/metrics) with entrypoint and
   service labels, latency histograms, and JSON access logs
2. FRP server Prometheus export (enablePrometheus = true)
3. Custom Traefik middleware plugin (nhp-metrics) that exposes
   structured log snapshots every 60s and a JSON endpoint at
   /.well-known/nhp-metrics — covering request counts, in-flight
   gauge, proxy type breakdown, WebSocket tracking, session cache
   hit/miss, AC API calls, error classification, status code buckets,
   and p50/p95/p99 latency histograms

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add http.Flusher and http.Hijacker to wrappedWriter so WebSocket
  upgrades and SSE streaming are not broken by the middleware wrapper
- Fix goroutine leak: log emitter now accepts a context and stops on
  cancellation (config reloads no longer accumulate goroutines)
- Fix WriteHeader double-delegation: guard now also prevents calling
  the inner ResponseWriter.WriteHeader more than once
- Fix Percentile performance: sort each histogram once per snapshot
  instead of once per percentile read (7 alloc+sorts → 4)
- Remove redundant Histogram.cap field (len(samples) is sufficient)
- Fix misleading "lock-free" comment — the histogram uses a mutex
- Extract defaultMetricsPath constant to eliminate duplicate literals

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add 11 new test cases covering:
- Panic recovery (error_panic counter + 500 response)
- SSE/streaming detection (proxy_streaming counter)
- Write without explicit WriteHeader (implicit 200)
- Duplicate WriteHeader guard (second call ignored)
- Flusher delegation (SSE support)
- Hijacker delegation (WebSocket support)
- Empty MetricsPath fallback to default constant
- Log emitter lifecycle (start + context cancellation)
- Histogram ring buffer wrap-around behavior
- Empty histogram snapshot returns nil
- Single-element percentile edge case
- 3xx status code bucketing
- 403 Forbidden counted as unauthorized
- 503 ServiceUnavailable counted as backend connect error

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@justin-layerv justin-layerv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Application-level metrics for Traefik + FRP

The plugin architecture is solid — atomic counters, mutex-guarded ring-buffer histogram, and the Yaegi constraint workaround are all well-reasoned. Test coverage at 96.3% is excellent. A few issues to address before merging:


Security

Metrics endpoint is unauthenticated
/.well-known/nhp-metrics and :8082/metrics are both wide open. For a zero-trust reverse proxy, this leaks operational intelligence (error rates, in-flight counts, cache hit ratios, latency profiles). At minimum, bind :8082 to a non-public interface in docker-compose, and consider gating /.well-known/nhp-metrics behind an IP allowlist or auth header check in the plugin config.

Removed credential comment, but verify rotation
traefik.toml removes the line # Js147258! — this looks like a password previously hardcoded as a comment. Verify this credential hasn't been used anywhere and rotate if so. The FRP dashboard creds (opennhp-frp / opennhp-frp) in frps.toml are unchanged — confirm these are only for local dev.


Bugs

isWebSocketUpgrade is too strict
The Connection header can be a comma-separated list (e.g., Connection: keep-alive, Upgrade). strings.EqualFold on the full header value will miss these cases. Should use strings.Contains(strings.ToLower(...), "upgrade") instead.

Duration not recorded on panic
When next.ServeHTTP panics, the deferred recovery runs but RequestDuration.Record(elapsed) is never reached. Panicked requests also won't increment status_5xx, so error counters won't sum to request_total. Consider recording duration and status in the panic recovery defer.

cancel is never called — goroutine leak on config reload
NHPMetrics.cancel is stored but never invoked. There's no Close() or shutdown hook. On Traefik config reloads that create new plugin instances, the old log emitter goroutine will leak.


Dead metrics (never written to)

These are defined, exported in the JSON snapshot, but never incremented anywhere in ServeHTTP:

  • Histograms: BackendDuration, APILookupDuration, HTMLRewriteDuration — always report p50=0, p95=0, p99=0
  • Counters: CacheSize, SessionCacheHit, SessionCacheMiss, PrivateACAPICall, PrivateACCacheHit — always 0

Either wire these up (they'd need cooperation from backend/AC components) or remove them to avoid shipping misleading zeros.


Minor

  • traefik.toml is missing a trailing newline
  • TestLogEmitterStartsAndStops uses time.Sleep(1500ms) — will be flaky under CI load; consider a shorter interval or channel-based signal

Verdict

Well-structured plugin with strong test coverage. Requesting changes for the security exposure, WebSocket detection bug, and goroutine leak. The dead metrics should either be wired up or removed before merge.

vikramlayerv and others added 3 commits March 17, 2026 13:36
Security:
- Bind Prometheus metrics entrypoint to 127.0.0.1:8082 (not public)
- Add CIDR-based IP allowlist for /.well-known/nhp-metrics endpoint
  (configured via allowedCIDRs in provider.toml, defaults to open
  for backwards compatibility)
- Restrict metrics access to loopback + Docker network (177.7.0.0/16)

Bugs:
- Fix isWebSocketUpgrade: use strings.Contains on lowercased
  Connection header to handle comma-separated values like
  "keep-alive, Upgrade"
- Fix duration not recorded on panic: moved RequestDuration.Record
  and status bucketing into the defer so panicked requests are
  counted in request_duration and status_5xx
- Fix goroutine leak: add Close() method that calls cancel(), so
  Traefik can stop the log emitter on config reload

Dead metrics removed:
- Remove BackendDuration, APILookupDuration, HTMLRewriteDuration
  histograms (never written to from middleware)
- Remove CacheSize, SessionCacheHit, SessionCacheMiss,
  PrivateACAPICall, PrivateACCacheHit counters (never incremented)

Minor:
- Add trailing newline to traefik.toml
- Replace time.Sleep in TestLogEmitterStartsAndStops with
  channel-based signalling + timeout

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FRP web dashboard was listening on 0.0.0.0:7500 with default credentials,
exposing operational data to any network client. Bind to 127.0.0.1 so
it's only accessible from within the container.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vikramlayerv
Copy link
Copy Markdown
Collaborator Author

All review feedback addressed

Security

  • Metrics endpoint: Plugin endpoint gated by allowedCIDRs config (set to 127.0.0.1/8 + docker network 177.7.0.0/16 in provider.toml). Prometheus entrypoint bound to 127.0.0.1:8082 in traefik.toml.
  • FRP dashboard: Bound webServer.addr to 127.0.0.1 (was 0.0.0.0) so it's only accessible from within the container.
  • Credential comment Js147258!: Removed in initial commit. Was only ever a comment in the original traefik.toml on main, never used as an actual config value.

Bugs

  • isWebSocketUpgrade too strict: Fixed — uses strings.Contains(conn, "upgrade") to handle comma-separated Connection headers like keep-alive, Upgrade. Test TestWebSocketDetectionCommaHeader covers this.
  • Duration not recorded on panic: Fixed — deferred function records RequestDuration and Status5xx before checking for panic recovery. Test TestPanicRecovery verifies both counters.
  • cancel never called (goroutine leak): Fixed — Close() method calls cancel(). Traefik calls Close() on config reload. Test TestCloseStopsLogEmitter covers this.

Dead metrics

  • Removed all 8 dead metrics (BackendDuration, APILookupDuration, HTMLRewriteDuration, CacheSize, SessionCacheHit, SessionCacheMiss, PrivateACAPICall, PrivateACCacheHit). Test TestDeadMetricsRemoved verifies none appear in the snapshot.

Tests

All 22 tests pass with -race.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants