diff --git a/app/.repos/kuma b/app/.repos/kuma index bf32790d84..57797fbb3c 160000 --- a/app/.repos/kuma +++ b/app/.repos/kuma @@ -1 +1 @@ -Subproject commit bf32790d84f9a05e8f5cddc8430c164913953f8b +Subproject commit 57797fbb3c28b48b5db457bc45ee3ad6415b3acf diff --git a/app/_includes/ai-gateway/circuit-breaker.md b/app/_includes/ai-gateway/v1/circuit-breaker.md similarity index 100% rename from app/_includes/ai-gateway/circuit-breaker.md rename to app/_includes/ai-gateway/v1/circuit-breaker.md diff --git a/app/_includes/ai-gateway/llm-metrics.md b/app/_includes/ai-gateway/v1/llm-metrics.md similarity index 100% rename from app/_includes/ai-gateway/llm-metrics.md rename to app/_includes/ai-gateway/v1/llm-metrics.md diff --git a/app/_includes/ai-gateway/redis-fallback.md b/app/_includes/ai-gateway/v1/redis-fallback.md similarity index 100% rename from app/_includes/ai-gateway/redis-fallback.md rename to app/_includes/ai-gateway/v1/redis-fallback.md diff --git a/app/_includes/ai-gateway/v2/circuit-breaker.md b/app/_includes/ai-gateway/v2/circuit-breaker.md new file mode 100644 index 0000000000..33c8452bbb --- /dev/null +++ b/app/_includes/ai-gateway/v2/circuit-breaker.md @@ -0,0 +1,9 @@ +The [load balancer](/ai-gateway/load-balancing/) supports health checks and circuit breakers to improve reliability. If the number of unsuccessful attempts to a target reaches [`config.balancer.max_fails`](/plugins/ai-proxy-advanced/reference/#schema--config-balancer-max-fails), the load balancer stops sending requests to that target until it reconsiders the target after the period defined by [`config.balancer.fail_timeout`](/plugins/ai-proxy-advanced/reference/#schema--config-balancer-fail-timeout). The diagram below illustrates this behavior: + +![Circuit breaker](/assets/images/ai-gateway/circuit-breaker.jpg){: style="display:block; margin-left:auto; margin-right:auto; width:50%; border-radius:10px" } + +Consider an example where [`config.balancer.max_fails`](/plugins/ai-proxy-advanced/reference/#schema--config-balancer-max-fails) is 3 and [`config.balancer.fail_timeout`](/plugins/ai-proxy-advanced/reference/#schema--config-balancer-fail-timeout) is 10 seconds. When failed requests for a target reach 3, the target is marked unhealthy and the load balancer stops sending requests to it. After 10 seconds, the target is reconsidered. If the request to this target still fails, the target remains unhealthy and the load balancer continues to exclude it. If the request succeeds, the target is marked healthy again and recovers from the circuit breaker. + +The failure counter tracks total failures, not consecutive failures. If a target receives 2 failed requests, then 1 successful request within the timeout window, the counter remains at 2. The counter resets only when a successful request occurs after [`config.balancer.fail_timeout`](/plugins/ai-proxy-advanced/reference/#schema--config-balancer-fail-timeout) has elapsed since the last failed request. + +If all targets become unhealthy simultaneously, requests fail with `HTTP 500`. \ No newline at end of file diff --git a/app/_includes/ai-gateway/v2/llm-metrics.md b/app/_includes/ai-gateway/v2/llm-metrics.md new file mode 100644 index 0000000000..981cda75dd --- /dev/null +++ b/app/_includes/ai-gateway/v2/llm-metrics.md @@ -0,0 +1,30 @@ +### LLM traffic metrics + +When the `config.ai_metrics` parameter is set to `true` in the Prometheus plugin, you can get the following [AI LLM metrics](/ai-gateway/monitor-ai-llm-metrics/#llm-traffic-metrics-overview): + +- **AI requests**: AI request sent to LLM providers. +- **AI cost**: AI cost charged by LLM providers. +- **AI tokens**: AI tokens counted by LLM providers. +- **AI LLM latency**: {% new_in 3.8 %} Time taken to return a response by LLM providers. +- **AI cache fetch latency**: {% new_in 3.8 %} Time taken to return a response from the cache. +- **AI cache embeddings latency**: {% new_in 3.8 %} Time taken to generate embedding during the cache. + +These metrics are available per provider, model, cache, database name (if cached), embeddings provider (if cached), embeddings model (if cached), and Workspace. The AI Tokens metrics are also available per token type. + +{:.info} +> **Note:** Starting with {% new_in 3.11 %}, AI metrics include the `consumer` label. This enables you to attribute AI usage and token counts to individual Consumers, helping you measure cost, performance, and client-specific behavior. +> +> Starting with {% new_in 3.12 %}, AI metrics (except `kong_ai_llm_tokens_total`) include the `request_mode` label. This label shows how the request was processed: +> - `oneshot`: A single response was returned. +> - `stream`: The response was delivered as a stream of tokens. +> - `realtime`: The request was handled as a real-time session. + +### MCP traffic metrics {% new_in 3.12 %} + +When the `config.ai_metrics` parameter is set to `true`, the following [MCP-specific metrics](/ai-gateway/monitor-ai-llm-metrics/#mcp-traffic-metrics-overview) are also available: + +- **MCP response body size**: Histogram of response body sizes (in bytes) returned by MCP servers. +- **MCP latency**: Histogram of request latencies (in milliseconds) for MCP server calls. +- **MCP error total**: Counter of total MCP server errors, labeled by error type. + +These metrics are labeled with `service`, `route`, `method`, `workspace`, and `tool_name`. The MCP error total metric also includes the type label. \ No newline at end of file diff --git a/app/_includes/ai-gateway/v2/redis-fallback.md b/app/_includes/ai-gateway/v2/redis-fallback.md new file mode 100644 index 0000000000..659abcbb04 --- /dev/null +++ b/app/_includes/ai-gateway/v2/redis-fallback.md @@ -0,0 +1,5 @@ +When the `redis` strategy is used and a {{site.base_gateway}} node is disconnected from Redis, the plugin will fall back to `local` rate limiting. +This can happen when the Redis server is down or the connection to Redis is broken. +{{site.base_gateway}} keeps the local counters for rate limiting and syncs with Redis once the connection is re-established. +{{site.base_gateway}} will still rate limit, but the {{site.base_gateway}} nodes can't sync the counters. As a result, users will be able +to perform more requests than the limit, but there will still be a limit per node. \ No newline at end of file diff --git a/app/ai-gateway/monitor-ai-llm-metrics.md b/app/ai-gateway/monitor-ai-llm-metrics.md index 8e5ab3685b..d18c07dbee 100644 --- a/app/ai-gateway/monitor-ai-llm-metrics.md +++ b/app/ai-gateway/monitor-ai-llm-metrics.md @@ -5,28 +5,20 @@ layout: reference products: - ai-gateway - - gateway breadcrumbs: - /ai-gateway/ tags: - ai - monitoring -plugins: - - prometheus - - ai-proxy - - ai-proxy-advanced - min_version: - gateway: '3.7' + ai-gateway: '2.0' description: "This guide walks you through collecting AI metrics and sending them to Prometheus." related_resources: - text: "{{site.ai_gateway}}" url: /ai-gateway/ - - text: "{{site.ai_gateway}} plugins" - url: /plugins/?category=ai - text: Status API url: /api/gateway/status/ - text: Admin API @@ -35,45 +27,32 @@ related_resources: url: /how-to/visualize-llm-metrics-with-grafana/ works_on: - - on-prem - konnect --- -{{site.ai_gateway}} calls LLM-based services according to the settings of the [AI Proxy](/plugins/ai-proxy/) and [AI Proxy Advanced](/plugins/ai-proxy-advanced/) plugins. -You can aggregate the LLM provider responses to count the number of tokens used by the AI plugins. -If you have defined input and output costs in the models, you can also calculate cost aggregation. -The metrics details also expose whether the requests have been cached by {{site.base_gateway}}, saving the cost of contacting the LLM providers, which improves performance. - -{% new_in 3.12 %} In addition to LLM usage, {{site.ai_gateway}} also tracks MCP server traffic. MCP metrics provide visibility into latency, response sizes, and error rates when AI plugins invoke external MCP tools and servers. +{{site.ai_gateway}} calls LLM-based services according to the settings of your [Providers](/ai-gateway/entities/ai-provider/) and [Models](/ai-gateway/entities/ai-model/). You can use the built in logging and a [Prometheus](/plugins/prometheus/) Policy to aggregate the LLM provider responses to count the number of tokens sent through {{site.ai_gateway}}. If you have defined input and output costs in the models, you can also calculate aggregate costs. You can also track whether the requests have been cached by {{site.ai_gateway}}, saving the cost of contacting the LLM providers, which improves performance. -{{site.ai_gateway}} exposes metrics related to Kong and proxied upstream services in -[Prometheus](https://prometheus.io/docs/introduction/overview/) -exposition format, which can be scraped by a Prometheus server. +In addition to LLM usage, {{site.ai_gateway}} can also log MCP server traffic. [MCP logging](/ai-gateway/entities/ai-mcp-server/#logging-and-audits) provides visibility into latency, response sizes, and error rates when AI plugins invoke external MCP tools and servers. -The metrics are available on both the [Admin API](/api/gateway/admin-ee/) and the -[Status API](/api/gateway/status/) at the `http://{host}:{port}/metrics` endpoint. -Note that the URL to those APIs is specific to your -installation. See [Accessing the metrics](#accessing-the-metrics) for more information. +Create a [Prometheus Policy](/plugins/prometheus/) to expose metrics in the [Prometheus](https://prometheus.io/docs/introduction/overview/) exposition format, which can be scraped by a Prometheus server. -The [Prometheus plugin](/plugins/prometheus/) records and exposes metrics at the node level. Your Prometheus -server will need to discover all Kong nodes via a service discovery mechanism, -and consume data from each node's configured `/metrics` endpoint. +The [Prometheus Policy](/plugins/prometheus/) records and exposes metrics at the node level. Your Prometheus server will need to discover all Kong nodes via a service discovery mechanism, +and consume data from each node's Prometheus `/metrics` endpoint. -AI metrics exported by the plugin can be graphed in Grafana using [{{site.ai_gateway}} Dashboard](https://grafana.com/grafana/dashboards/21162-kong-cx-ai/). +AI metrics exported by the Prometheus plugin can be graphed in Grafana using [{{site.ai_gateway}} Dashboard](https://grafana.com/grafana/dashboards/21162-kong-cx-ai/). ## Available metrics The following sections describe the AI metrics that are available. -{% include /ai-gateway/llm-metrics.md %} +{% include /ai-gateway/v2/llm-metrics.md %} ## Overview -AI metrics are disabled by default as it may create high cardinality of metrics and may -cause performance issues. To enable them: +AI metrics are disabled by default as it may create high number of metrics and may cause performance issues. To enable them: -* Set `config.ai_metrics` to `true` in the [Prometheus plugin configuration](/plugins/prometheus/reference/). -* Set `config.logging.log_statistics` to `true` in the [AI Proxy](/plugins/ai-proxy/reference/) or [AI Proxy Advanced plugin](/plugins/ai-proxy-advanced/reference/). +* Set `config.ai_metrics` to `true` in the [Prometheus Policy configuration](/plugins/prometheus/reference/). +* Set `config.logging.log_statistics` to `true` in the [Model](/ai-gateway/entities/ai-model/). ### LLM traffic metrics overview @@ -112,10 +91,10 @@ ai_llm_provider_latency{ai_provider="provider1",ai_model="model1",cache_status=" ``` {:.info} -> **Note:** If you don't use any cache plugins, then `cache_status`, `vector_db`, +> **Note:** If you don't use any caching, then `cache_status`, `vector_db`, `embeddings_provider`, and `embeddings_model` values will be empty. > -> To expose the `ai_llm_cost_total` metric, you must define the `model.options.input_cost` `model.options.output_cost` parameters. See the [AI Proxy](/plugins/ai-proxy/reference/#schema--config-model-options-input-cost) and [AI Proxy Advanced](/plugins/ai-proxy-advanced/reference/#schema--config-targets-model-options-input-cost) configuration references for more details. +> To expose the `ai_llm_cost_total` metric, you must define the `model.options.input_cost` `model.options.output_cost` parameters. See the [Model](/ai-gateway/entities/ai-model/) configuration reference for more details. ### MCP traffic metrics overview @@ -137,12 +116,11 @@ kong_ai_mcp_error_total{service="svc1",route="route1",type="Invalid Request",met ## Accessing the metrics -In most configurations, the Kong Admin API will be behind a firewall or would +In most configurations, the Kong Admin API and Prometheus Policy will be behind a firewall or would need to be set up to require authentication. Here are a couple of options to allow access to the `/metrics` endpoint to Prometheus: - -* If the Status API is enabled with the `status_listen` parameter in the [{{site.base_gateway}} configuration](/gateway/configuration/#status-listen), then its `/metrics` endpoint can be used. This is the preferred method, and this is also the only method compatible with {{site.konnect_short_name}}, since Data Planes can't use the Admin API. +* If the Status API is enabled with the `status_listen` parameter in the [{{site.base_gateway}} configuration](/ai-gateway/configuration/#status-listen), then its `/metrics` endpoint can be used. This is the preferred method, and this is also the only method compatible with {{site.konnect_short_name}}, since Data Planes can't use the Admin API. * The `/metrics` endpoint is also available on the Admin API, which can be used if the Status API is not enabled. Note that this endpoint is unavailable