feat: inject backend quota rate limit filter for QuotaPolicy#1869
feat: inject backend quota rate limit filter for QuotaPolicy#1869yuzisun wants to merge 83 commits into
Conversation
8a9cfb3 to
4841c05
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1869 +/- ##
==========================================
+ Coverage 84.42% 84.58% +0.15%
==========================================
Files 134 139 +5
Lines 19162 20203 +1041
==========================================
+ Hits 16177 17088 +911
- Misses 1998 2092 +94
- Partials 987 1023 +36 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
3580b41 to
2e78aec
Compare
|
Correct me if I am wrong
Question: How much latency is added if the request passes through the second ratelimit filter(if enabled)? |
created #1971
will address
The second ratelimit filter adds 5ms 99% and 3ms 50% to the total latency. |
daf7b79 to
9fcbcad
Compare
nacx
left a comment
There was a problem hiding this comment.
Did another review. Here are some additional notes to the comments:
- The
CostExpressionseems not yet to be implemented. Should we add a comment in the API saying the field will be implemented in the future, so users know the field is not taken into account yet? - The
modeseems not to be used, but it isrequiredin the API. Is this a leftover or something that needs to be taken into account?
33c823e to
2ee6e5a
Compare
Signed-off-by: Dan Sun <dsun20@bloomberg.net>
**Commit Message**
* adding e2e test making sure that we can use a tool and return the
proper response from the LLM
This PR also contains a few fixes that allow this test to work
* added validateToolCallID as its needed for the bedrock message or it
fails
* Validate and cast the openai content value into bedrock content block
- before we assumed it was always a string but it can also be an array
* Merge the content blocks if one has text and the other has toolcall
but no text - bedrockResp.Output.Message.Content is not 1:1 with openai
choice.Message.Content.. we get the result in chunks that should be
merged, for example the text and it's toolconfig were in seperate
elements but openai expects them to be in the same
* in openAIMessageToBedrockMessageRoleAssistant we assume that either
there is a refusal or a content text, but we actually dont pass in a
text. this was causing an error as the length of the array was set to 1
so the first was empty and there must be a key specified in each content
element.
note: i think this is another bug that im trying to look into/create
unit test for, but since there is already a lot in this PR, maybe its
best to follow up with that one the next one. basically, the assistant
openai param message's content doesn't seem to be translating to this
method, and we have no content. this doesn't prevent anything else from
working though, we are just missing a text field like
`{"content":[{"text":"Certainly! I can help you get the weather
information for New York City. To do that, I'll use the available
weather tool. Let me fetch that information for you right away."}`
the rests are tests
---------
Signed-off-by: Alexa Griffith <agriffith96@gmail.com>
Signed-off-by: Alexa Griffith <agriffith96@gmail.com>
Signed-off-by: Dan Sun <dsun20@bloomberg.net>
Co-authored-by: Dan Sun <dsun20@bloomberg.net>
Signed-off-by: Dan Sun <dsun20@bloomberg.net>
Signed-off-by: Dan Sun <dsun20@bloomberg.net>
Signed-off-by: Dan Sun <dsun20@bloomberg.net>
Signed-off-by: Dan Sun <dsun20@bloomberg.net>
Signed-off-by: achoo30 <achoo30@bloomberg.net>
Signed-off-by: achoo30 <achoo30@bloomberg.net>
32907aa to
29ffe6b
Compare
Signed-off-by: achoo30 <achoo30@bloomberg.net>
Signed-off-by: achoo30 <achoo30@bloomberg.net>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces backend-level quota rate limiting using the new QuotaPolicy CRD, implementing a controller, an xDS config runner, and translation logic to configure Envoy's rate limit service. Key feedback includes addressing non-deterministic map iteration in the controller's config merging to prevent random xDS snapshot flips, performing deep copies of descriptors during merging to avoid cache corruption, and adding validation to ensure parsed rule indices are non-negative to prevent potential out-of-bounds panics.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Aaron Choo <achoo30@bloomberg.net>
Signed-off-by: achoo30 <achoo30@bloomberg.net>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Aaron Choo <achoo30@bloomberg.net>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Aaron Choo <achoo30@bloomberg.net>
Signed-off-by: achoo30 <achoo30@bloomberg.net>
Signed-off-by: achoo30 <achoo30@bloomberg.net>
Signed-off-by: achoo30 <achoo30@bloomberg.net>
Signed-off-by: achoo30 <achoo30@bloomberg.net>
Signed-off-by: achoo30 <achoo30@bloomberg.net>
Signed-off-by: achoo30 <achoo30@bloomberg.net>
Signed-off-by: achoo30 <achoo30@bloomberg.net>
Signed-off-by: achoo30 <achoo30@bloomberg.net>
Signed-off-by: achoo30 <achoo30@bloomberg.net>
Signed-off-by: achoo30 <achoo30@bloomberg.net>
|
Quota Policy's model name is now based on the model name override. Updated documentation to outline that. RateLimitPerRoute is based on the httproute rule and cross-references the backend name + model name override. Multiple backends for a rule means the route will send multiple descriptors for the potential "match". Response path will only update the override+backend routed. |
Description
This is the first step for quota aware routing by injecting the backend quota rate limit filter when there is QuotaPolicy attached to the AIServiceBackend specified in the upstream cluster.
End-to-End Data Flow
User creates QuotaPolicy CR
↓
QuotaPolicy Controller reconciles
↓ translator.BuildRateLimitConfigs()
RateLimitConfig protobuf built (descriptor tree)
↓ runner.UpdateConfigs()
xDS snapshot pushed to rate limit service via gRPC
↓
Extension Server injects filter + actions into Envoy xDS
Request arrives at Envoy
↓ (request-time rate limit check)
If no available quota → 429; else continue
↓
ext_proc processes request/response, extracts token usage
↓ (stream-done rate limit action)
Rate limit filter sends HitsAddend to service
↓
Rate limit service increments Redis counter by token count
Related Issues/PRs (if applicable)
Related PR: #1709