Skip to content

Introduce jitter metric based on RFC 1889 Appendix A#8586

Open
GGraziadei wants to merge 8 commits intoapache:masterfrom
GGraziadei:8538-rfc-1889a-jitter-metric
Open

Introduce jitter metric based on RFC 1889 Appendix A#8586
GGraziadei wants to merge 8 commits intoapache:masterfrom
GGraziadei:8538-rfc-1889a-jitter-metric

Conversation

@GGraziadei
Copy link
Copy Markdown

What is the purpose of the change

In deterministic real-time processing, predictability of latency is as important as latency itself. This is a constraint to building a deterministic system.

  • Mcro-burst detection: high jitter reveals short spikes that average latency smooths out.
  • Compliance: modern SLAs rely on percentiles (e.g., P99). Jitter is a strong leading indicator of tail-latency degradation.
  • Root Cause Analysis: high component jitter means GC pressure or resource contention; instead, high global jitter with stable components suggests network congestion or shuffle bottlenecks.
  • Bottleneck identification: jitter enables precise identification of where bottlenecks occur in the topology and helps distinguish their underlying causes, making performance issues easier to diagnose and resolve.

To ensure negligible performance impact, I propose to use an Exponentially Weighted Moving Average (EWMA), following RFC 1889 logic https://www.rfc-editor.org/rfc/rfc1889#appendix-A.8

Mathematical Model:
J_new = J_old + (|D_current - D_previous| - J_old) * smoothing_factor

Performance impact

  • Minimal computational overhead: by utilizing an EWMA.
  • Memory efficiency: only two persistent variables (8 bytes) per task.
  • System calls: no system calls required to track the latency (the latencies are already computed).

How was the change tested

  • Unit test: introduced new test cases for Config, TaskMetrics, EwmaGauge
  • Smoke test in local: registered a topology metrics reporter and persisted captured metrics in the attached file
  • The package metrics2 doesn't affect it.
    worker_log.zip

Example results in worker logs

2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO] storm.worker.WordCountTopology-4-1777995769.ggraziadei-ThinkPad-E14-Gen-5.count.default.10.6700-__emit-count-default.m1_rate
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO]              value = 30.0
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO] storm.worker.WordCountTopology-4-1777995769.ggraziadei-ThinkPad-E14-Gen-5.count.default.10.6700-__execute-count-split:default.m1_rate
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO]              value = 30.0
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO] storm.worker.WordCountTopology-4-1777995769.ggraziadei-ThinkPad-E14-Gen-5.count.default.10.6700-__execute-latency-split:default
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO]              value = 0.0
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO] storm.worker.WordCountTopology-4-1777995769.ggraziadei-ThinkPad-E14-Gen-5.count.default.10.6700-__execute-rfc1889a-jitter-split:default
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO]              value = 0.2557194505051832
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO] storm.worker.WordCountTopology-4-1777995769.ggraziadei-ThinkPad-E14-Gen-5.count.default.10.6700-__process-latency-split:default
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO]              value = 0.3333333333333333
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO] storm.worker.WordCountTopology-4-1777995769.ggraziadei-ThinkPad-E14-Gen-5.count.default.10.6700-__process-rfc1889a-jitter-split:default
2026-05-05 17:52:07.993 c.c.m.ConsoleReporter metrics-console-reporter-1-thread-1 [INFO]              value = 0.145830156234796

In the context of #8583

@rzo1
Copy link
Copy Markdown
Contributor

rzo1 commented May 6, 2026

Question on the d <= 0 short-circuit in EwmaGauge.addValue

if (d <= 0) {
    return;
}

Is this skip intentional? Reading RFC 3550 §A.8 (which supersedes RFC 1889 with the same text):

d = transit - s->transit;
s->transit = transit;
if (d < 0) d = -d;
s->jitter += (1./16.) * ((double)d - s->jitter);

The update is unconditional. When d == 0 it collapses to J ← J · (1 − α)J · 15/16, i.e. the EWMA decays toward zero. The RFC explicitly chooses 1/16 for its "noise reduction ratio while maintaining a reasonable rate of convergence" — convergence back toward zero during quiet periods is part of the spec's intent.

For comparison, the major RTP stacks all apply the update unconditionally:

  • GStreamerrtpsource.c:997
    src->stats.jitter += diff - ((src->stats.jitter + 8) >> 4);
  • PJSIPrtcp.c:434
    sess->jitter += d - ((sess->jitter + 8) >> 4);
  • WebRTCreceive_statistics_impl.cc:165
    int32_t jitter_diff_q4 = (time_diff_samples << 4) - jitter_q4_;
    jitter_q4_ += ((jitter_diff_q4 + 8) >> 4);
    (the only guard here is a < 450000 anomaly cap, not a d == 0 short-circuit)

The test EwmaGaugeTest.zeroDeviationDecays appears to lock in the current behavior, but its display name ("Zero deviation decays jitter toward zero") suggests the original intent matched the spec while the assertion encodes the opposite. Worth double-checking which one was meant.

Suggested fix: drop the if (d <= 0) return; block — the CAS loop below is already correct for d == 0. (Optionally short-circuit the math with updatedJitter = currentJitter * (1.0 - alpha), but you must still write it back.)

Was the skip intentional?

@GGraziadei
Copy link
Copy Markdown
Author

Hello @rzo1 thank you for your comment.
You have reason, there is an implementation error, and I am fixing it.
If I do not update the jitter when the latency is stable d==0, the jitter doesn't decrease, and this is weird.
Regarding the test case Zero deviation decays jitter toward zero (alpha=0.5), the correct status serialization is reported here:

  • lastLat = UNSEED; lat=0; j=0
  • lastLat=0; lat=10; d=5; j=0+alpha* (d-j)=2.5
  • lastLat=10; lat=10; d=0; j=2.5 + alpha * (d-j) = 2.5 - alpha * 2.5 = 1.25

@rzo1 rzo1 added enhancement java Pull requests that update Java code labels May 6, 2026
@rzo1 rzo1 added this to the 3.0.0 milestone May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement java Pull requests that update Java code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants