The national CDA endpoint has become unresponsive at times, we should discuss adding broader rate limiting. This is both a security concern and a performance/reliability concern.
From the security side, rate limiting helps reduce the blast radius of abusive clients, credential stuffing, accidental request storms, scraping, and basic DoS behavior. From the performance side, it protects shared Tomcat threads, Oracle connections, expensive CWMS package calls, and large response serialization from being consumed by a small number of clients or malformed query patterns.
Current State
Based on the current source, CDA has a very small amount of application-level rate limiting. (As of 05/07/2026)
Currently rate limited:
POST /ratings/rate-ts/{office}/{rating-id}
POST /ratings/reverse-rate-ts/{office}/{rating-id}
POST /ratings/reverse-rate-values/{office}/{rating-id}
POST /ratings/rate-values/{office}/{rating-id}
Current behavior (of /ratings above):
- Uses Javalin
NaiveRateLimit.
- Defaults to
100 requests per minute.
- Configured by JVM system property:
cwms.dataapi.request.limit. (Confusing if this is not globa)
- Uses in-memory counters, so limits are per JVM/pod.
- Uses client IP/method/path as the effective key unless Javalin’s key function is changed.
- Authorized users with the required route roles can bypass the current limiter.
Important gap: most high-traffic public GET endpoints are not currently rate limited, including /timeseries, /catalog, location/catalog endpoints, and other unauthenticated read paths.
Assumptions
- We should rate limit anonymous users, API key users, and OIDC/CAC users differently.
- Limits should be configurable without code changes.
- Limits should support per-endpoint or endpoint-class policies, not just one global number.
- In production, limits probably need to be distributed across nodes/pods.
- The limiter key should use a trusted client IP from the edge/proxy, not unsanitized
X-Forwarded-For.
- Expensive endpoints should have stricter limits than cheap metadata endpoints.
- Rate limiting should return
429 Too Many Requests, ideally with Retry-After and/or standard rate limit headers.
- Rate limiting should be paired with max
page-size, max time-window, max request body, max response size, and metrics/alerting.
Candidate Approaches / Libraries
| Option |
Fits Current Javalin/Tomcat Structure |
Per Route |
Per IP |
Per User/API Key |
Distributed |
429 Handling |
Headers |
Metrics Friendly |
Notes |
Javalin NaiveRateLimit / RateLimitPlugin |
[x] |
[x] |
[x] |
[x] with custom key |
[ ] |
[x] |
[ ] manual |
[ ] manual |
Easiest incremental option, but explicitly basic and in-memory. Better for local/simple protection than national production. |
Apache Tomcat RateLimitFilter |
[x] |
[~] via filter mapping |
[x] |
[ ] |
[ ] |
[x] |
[x] |
[ ] manual |
Very low app-code impact. Good baseline IP protection at servlet layer. Needs trusted proxy/IP handling. |
| Bucket4j |
[x] |
[x] |
[x] |
[x] |
[x] with Redis/JCache/JDBC/etc. |
[x] manual |
[x] manual |
[x] manual/Micrometer possible |
Strong Java option for token-bucket policies. Probably the best app-level fit if we want distributed limits and custom keys. |
| Resilience4j RateLimiter |
[x] |
[x] |
[ ] needs wrapping/key map |
[ ] needs wrapping/key map |
[ ] |
[x] manual |
[ ] manual |
[x] Micrometer support |
Good resilience library, but its limiter is more naturally per protected operation than per arbitrary HTTP client key. |
Guava RateLimiter |
[x] |
[x] |
[ ] needs custom map |
[ ] needs custom map |
[ ] |
[x] manual |
[ ] manual |
[ ] manual |
Simple local throttling primitive. Useful internally, less suitable as the main public API limiter. |
Redisson RRateLimiter |
[x] |
[x] |
[x] |
[x] |
[x] Redis-backed |
[x] manual |
[x] manual |
[ ] manual |
Good if Redis is acceptable infrastructure. More direct Redis dependency than Bucket4j. |
| Traefik RateLimit middleware |
[x] at edge/proxy |
[x] |
[x] |
[~] header-based |
[~] depends deployment/features |
[x] |
[~] |
[x] via proxy metrics |
Good first line of defense before traffic reaches Tomcat. Should not be the only application-aware limiter. |
References:
Possible Policy Model
A starting policy could look something like this:
| Client Type |
Suggested Starting Limit |
Notes |
| Anonymous public users |
60 requests/minute/IP, burst 120 |
Good baseline for browsers, scripts, and casual users. Forces high-volume users toward API keys. |
| API key users |
300 requests/minute/key, burst 600 |
Supports legitimate automated use while making ownership and abuse tracing easier. |
| Authenticated OIDC/CAC users |
600 requests/minute/user, burst 1,200 |
Higher trust, but still not unlimited because DB/API capacity is shared. |
| Internal/trusted service accounts |
1,500 requests/minute/account, burst 3,000 |
Only for explicitly approved service identities with monitoring and contact owner. OPTIONALLY: setup via CWMS roles/auth? |
Burst: Helps avoid punishing normal behavior like a web page loading several resources at once, while still limiting sustained high-volume traffic. Let a client make up to 120 requests quickly if they have been idle, but over time they still average around 60 requests per minute.For example.
Endpoint classes could also have different weights, some examples:
| Endpoint Class |
Example |
Suggested Starting Limit |
| Cheap metadata |
/offices, /parameters, static lookup data |
300 requests/minute/IP anonymous; 1,000 requests/minute authenticated |
| Catalog queries |
/catalog, location catalog, TS catalog |
60 requests/minute anonymous; 300 requests/minute API key/authenticated |
| Time series reads |
/timeseries |
30 requests/minute anonymous; 180 requests/minute API key; 300 requests/minute authenticated |
| Rating calculations |
/ratings/rate-*, /ratings/reverse-rate-* |
Keep current 100 requests/minute anonymous; 300 requests/minute authenticated/API key |
| Writes |
POST, PATCH, DELETE |
60 requests/minute authenticated user/API key |
| Auth/key endpoints |
/auth/*, API key creation/deletion |
10 requests/minute/IP and 30 requests/hour/user |
Other Mitigations To Discuss
Rate limiting alone probably will not be enough. Related work items:
- Add global max
page-size validation.
- Reject or cap
page-size=0 where it means unlimited.
- Add maximum time-window limits for time series requests.
- What about cached (eventualy) TS and POR or Composite TS?
- Should 5min data allow for unlimited time window?
- Add maximum number of time series names per request.
- Add maximum request body sizes by endpoint, not only Tomcat
maxPostSize.
- Ensure response size or row-count caps for expensive formats.
- [Optional] Add request cost metrics by endpoint, office, authenticated principal/API key, response status, and duration.
- Add
429 response body and headers consistently.
- [Optional] Add dashboards for top clients, top endpoints, 429 counts, DB pool wait time, and slow queries.
- [Optional] Edge/WAF limits as the first layer and app-level limits as the second layer.
- Document rate limits publicly so legitimate users know how to behave.
- Should we also have 429 respond with a URL to the CDA read the docs on ratelimiting + POC below
- Provide a contact/escalation path for users who need higher limits
Open Questions
- Should rate limits be enforced at the edge, in CDA, or both?
- What should be the initial limits for anonymous, API key, and authenticated users? Examples above
- Should authenticated users ever fully bypass limits?
- Which endpoints are most expensive today based on production metrics?
- Do we have or want Redis/JCache/JDBC infrastructure for distributed counters?
- Should limits be configured in properties, feature flags, database tables, or deployment config?
- What headers should CDA return for rate-limited responses?
- How do we avoid breaking legitimate automated users of the national API?
- Should we introduce user-facing API tiers or require API keys for high-volume use?
TODO
- Initial endpoint scope.
- Selected library/architecture.
- Rate-limit key strategy.
- Default anonymous/authenticated/API-key limits.
- Required metrics and dashboards ( In Grafana?)
- Rollout plan with logging-only/dry-run mode before enforcement.
The national CDA endpoint has become unresponsive at times, we should discuss adding broader rate limiting. This is both a security concern and a performance/reliability concern.
From the security side, rate limiting helps reduce the blast radius of abusive clients, credential stuffing, accidental request storms, scraping, and basic DoS behavior. From the performance side, it protects shared Tomcat threads, Oracle connections, expensive CWMS package calls, and large response serialization from being consumed by a small number of clients or malformed query patterns.
Current State
Based on the current source, CDA has a very small amount of application-level rate limiting. (As of 05/07/2026)
Currently rate limited:
POST /ratings/rate-ts/{office}/{rating-id}POST /ratings/reverse-rate-ts/{office}/{rating-id}POST /ratings/reverse-rate-values/{office}/{rating-id}POST /ratings/rate-values/{office}/{rating-id}Current behavior (of
/ratingsabove):NaiveRateLimit.100requests per minute.cwms.dataapi.request.limit. (Confusing if this is not globa)Important gap: most high-traffic public
GETendpoints are not currently rate limited, including/timeseries,/catalog, location/catalog endpoints, and other unauthenticated read paths.Assumptions
X-Forwarded-For.429 Too Many Requests, ideally withRetry-Afterand/or standard rate limit headers.page-size, max time-window, max request body, max response size, and metrics/alerting.Candidate Approaches / Libraries
NaiveRateLimit/RateLimitPluginRateLimitFilterRateLimiterRRateLimiterReferences:
RRateLimiter: https://javadoc.io/doc/org.redisson/redisson/latest/org/redisson/api/RRateLimiter.htmlPossible Policy Model
A starting policy could look something like this:
Burst: Helps avoid punishing normal behavior like a web page loading several resources at once, while still limiting sustained high-volume traffic. Let a client make up to 120 requests quickly if they have been idle, but over time they still average around 60 requests per minute.For example.
Endpoint classes could also have different weights, some examples:
/offices,/parameters, static lookup data/catalog, location catalog, TS catalog/timeseries/ratings/rate-*,/ratings/reverse-rate-*POST,PATCH,DELETE/auth/*, API key creation/deletionOther Mitigations To Discuss
Rate limiting alone probably will not be enough. Related work items:
page-sizevalidation.page-size=0where it means unlimited.maxPostSize.429response body and headers consistently.Open Questions
TODO