feat: add application-level metrics to Traefik and FRP data plane#5
feat: add application-level metrics to Traefik and FRP data plane#5vikramlayerv wants to merge 6 commits into
Conversation
Traefik had zero observability — every user request passes through it with no metrics. This adds three layers of instrumentation: 1. Prometheus metrics on Traefik (:8082/metrics) with entrypoint and service labels, latency histograms, and JSON access logs 2. FRP server Prometheus export (enablePrometheus = true) 3. Custom Traefik middleware plugin (nhp-metrics) that exposes structured log snapshots every 60s and a JSON endpoint at /.well-known/nhp-metrics — covering request counts, in-flight gauge, proxy type breakdown, WebSocket tracking, session cache hit/miss, AC API calls, error classification, status code buckets, and p50/p95/p99 latency histograms Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add http.Flusher and http.Hijacker to wrappedWriter so WebSocket upgrades and SSE streaming are not broken by the middleware wrapper - Fix goroutine leak: log emitter now accepts a context and stops on cancellation (config reloads no longer accumulate goroutines) - Fix WriteHeader double-delegation: guard now also prevents calling the inner ResponseWriter.WriteHeader more than once - Fix Percentile performance: sort each histogram once per snapshot instead of once per percentile read (7 alloc+sorts → 4) - Remove redundant Histogram.cap field (len(samples) is sufficient) - Fix misleading "lock-free" comment — the histogram uses a mutex - Extract defaultMetricsPath constant to eliminate duplicate literals Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add 11 new test cases covering: - Panic recovery (error_panic counter + 500 response) - SSE/streaming detection (proxy_streaming counter) - Write without explicit WriteHeader (implicit 200) - Duplicate WriteHeader guard (second call ignored) - Flusher delegation (SSE support) - Hijacker delegation (WebSocket support) - Empty MetricsPath fallback to default constant - Log emitter lifecycle (start + context cancellation) - Histogram ring buffer wrap-around behavior - Empty histogram snapshot returns nil - Single-element percentile edge case - 3xx status code bucketing - 403 Forbidden counted as unauthorized - 503 ServiceUnavailable counted as backend connect error Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review: Application-level metrics for Traefik + FRP
The plugin architecture is solid — atomic counters, mutex-guarded ring-buffer histogram, and the Yaegi constraint workaround are all well-reasoned. Test coverage at 96.3% is excellent. A few issues to address before merging:
Security
Metrics endpoint is unauthenticated
/.well-known/nhp-metrics and :8082/metrics are both wide open. For a zero-trust reverse proxy, this leaks operational intelligence (error rates, in-flight counts, cache hit ratios, latency profiles). At minimum, bind :8082 to a non-public interface in docker-compose, and consider gating /.well-known/nhp-metrics behind an IP allowlist or auth header check in the plugin config.
Removed credential comment, but verify rotation
traefik.toml removes the line # Js147258! — this looks like a password previously hardcoded as a comment. Verify this credential hasn't been used anywhere and rotate if so. The FRP dashboard creds (opennhp-frp / opennhp-frp) in frps.toml are unchanged — confirm these are only for local dev.
Bugs
isWebSocketUpgrade is too strict
The Connection header can be a comma-separated list (e.g., Connection: keep-alive, Upgrade). strings.EqualFold on the full header value will miss these cases. Should use strings.Contains(strings.ToLower(...), "upgrade") instead.
Duration not recorded on panic
When next.ServeHTTP panics, the deferred recovery runs but RequestDuration.Record(elapsed) is never reached. Panicked requests also won't increment status_5xx, so error counters won't sum to request_total. Consider recording duration and status in the panic recovery defer.
cancel is never called — goroutine leak on config reload
NHPMetrics.cancel is stored but never invoked. There's no Close() or shutdown hook. On Traefik config reloads that create new plugin instances, the old log emitter goroutine will leak.
Dead metrics (never written to)
These are defined, exported in the JSON snapshot, but never incremented anywhere in ServeHTTP:
- Histograms:
BackendDuration,APILookupDuration,HTMLRewriteDuration— always reportp50=0, p95=0, p99=0 - Counters:
CacheSize,SessionCacheHit,SessionCacheMiss,PrivateACAPICall,PrivateACCacheHit— always0
Either wire these up (they'd need cooperation from backend/AC components) or remove them to avoid shipping misleading zeros.
Minor
traefik.tomlis missing a trailing newlineTestLogEmitterStartsAndStopsusestime.Sleep(1500ms)— will be flaky under CI load; consider a shorter interval or channel-based signal
Verdict
Well-structured plugin with strong test coverage. Requesting changes for the security exposure, WebSocket detection bug, and goroutine leak. The dead metrics should either be wired up or removed before merge.
Security: - Bind Prometheus metrics entrypoint to 127.0.0.1:8082 (not public) - Add CIDR-based IP allowlist for /.well-known/nhp-metrics endpoint (configured via allowedCIDRs in provider.toml, defaults to open for backwards compatibility) - Restrict metrics access to loopback + Docker network (177.7.0.0/16) Bugs: - Fix isWebSocketUpgrade: use strings.Contains on lowercased Connection header to handle comma-separated values like "keep-alive, Upgrade" - Fix duration not recorded on panic: moved RequestDuration.Record and status bucketing into the defer so panicked requests are counted in request_duration and status_5xx - Fix goroutine leak: add Close() method that calls cancel(), so Traefik can stop the log emitter on config reload Dead metrics removed: - Remove BackendDuration, APILookupDuration, HTMLRewriteDuration histograms (never written to from middleware) - Remove CacheSize, SessionCacheHit, SessionCacheMiss, PrivateACAPICall, PrivateACCacheHit counters (never incremented) Minor: - Add trailing newline to traefik.toml - Replace time.Sleep in TestLogEmitterStartsAndStops with channel-based signalling + timeout Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FRP web dashboard was listening on 0.0.0.0:7500 with default credentials, exposing operational data to any network client. Bind to 127.0.0.1 so it's only accessible from within the container. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All review feedback addressedSecurity
Bugs
Dead metrics
TestsAll 22 tests pass with |
Summary
:8082/metricswith entrypoint/service labels, latency histogram buckets, and JSON access logsenablePrometheus = trueon the FRP server so its/metricsendpoint is activetraefik-plugin-metrics) that tracks 20+ application-level metrics via atomic counters and rolling-window histograms — exposed as structured log snapshots every 60s and a JSON endpoint at/.well-known/nhp-metricsMetrics covered by the plugin
request_total,in_flightproxy_standard,proxy_streaming,proxy_websocket,websocket_activecache_size,session_cache_hit,session_cache_missprivate_ac_api_call,private_ac_cache_hiterror_backend_connect,error_backend_timeout,error_unauthorized,error_panicstatus_2xx,status_3xx,status_4xx,status_5xxrequest_duration,backend_duration,api_lookup,html_rewriteWhy structured logs + JSON instead of Prometheus in the plugin?
Traefik plugins run in Yaegi (Go interpreter) and cannot import external libraries like
prometheus/client_golang. The plugin usessync/atomiccounters and a mutex-guarded ring-buffer histogram instead, exposing data via:GET /.well-known/nhp-metrics) — scrapeable or curl-ablePerformance
~300ns typical, <1µs worst case per proxied request. See review comment for full breakdown.
Test plan
docker compose up -dand verify Traefik starts with plugin loadedcurl http://177.7.0.10:8082/metricsreturns Prometheus metricscurl http://177.7.0.10/.well-known/nhp-metricsreturns JSON snapshot[nhp-metrics] metrics_snapshotstructured log lines🤖 Generated with Claude Code