Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,10 @@ Kernel-level per-tenant admission controller built with Rust and eBPF. Enforces

## Use cases

1. **CPU/network tenant admission**: enforce per-cgroup and optional L4/L7 packet admission policies at the kernel boundary, with durable userspace policy management and Prometheus counters.
2. **LLM inference admission**: run the `examples/inference-admission` controller next to an inference stack to translate token-budget usage, KV-cache pressure, and GPU utilization into Vantage base policies and runtime overrides for inference endpoints.
1. **vLLM inference admission**: run the `examples/inference-admission` controller next to vLLM, scrape `/metrics`, and translate KV-cache pressure, queued/running request counts, and token-budget proxy metrics into Vantage base policies and runtime overrides for inference endpoints. File-backed metrics remain available as a portable fallback for demos and tests.
2. **CPU/network tenant admission**: enforce per-cgroup and optional L4/L7 packet admission policies at the kernel boundary, with durable userspace policy management and Prometheus counters.

Vantage enforces network admission at the kernel boundary. Inference controllers are userspace adapters that translate model-serving pressure, such as vLLM metrics, file fixtures, and future NVML sources, into Vantage admission policy. Semantic scheduling inside vLLM, CUDA, or the inference runtime itself is out of scope.

## How it works

Expand Down
80 changes: 64 additions & 16 deletions examples/inference-admission/README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,71 @@
# Inference Admission Example

`vantage-inference-admission` is a userspace controller for LLM inference workloads. It reads token-budget, KV-cache, and GPU utilization samples, then writes Vantage policies for one cgroup and one inference HTTP endpoint.
`vantage-inference-admission` is a userspace controller for LLM inference workloads. In vLLM mode it scrapes `/metrics`, converts KV-cache pressure, waiting/running request counts, and token-budget proxy metrics into a pressure sample, then writes Vantage policies for one cgroup and one inference HTTP endpoint.

This example keeps inference semantics outside the eBPF ABI. Vantage still enforces packet admission through its existing `tc` classifier; the example maps inference pressure into base policies and manual runtime overrides.
Vantage enforces network admission at the kernel boundary. Inference controllers are userspace adapters that translate model-serving pressure, such as vLLM metrics, file fixtures, and future NVML sources, into Vantage admission policy. Semantic scheduling inside vLLM, CUDA, or the inference runtime itself is out of scope.

## Run
## Demo

Start `vantage` first, then run the example:
Run the single-command demo from the repository root:

```shell
examples/inference-admission/demo.sh
```

The demo starts a mock vLLM `/metrics` server and a mock Vantage-compatible API server, then runs the controller and prints visible `Normal -> Throttled -> Exhausted` transitions. It does not require a real GPU, real vLLM process, root, or eBPF attachment.

## vLLM Mode

Start `vantage`, run vLLM, then run the controller:

```shell
cargo run -p inference-admission -- \
--tenant cg:12345 \
--inference-port 8000 \
--inference-http-path /v1/chat/completions \
--metrics-file-path /tmp/vantage-inference-metrics.json \
--gpu-util-file-path /tmp/vantage-gpu-util.json
--metrics-source vllm \
--vllm-metrics-base-url http://127.0.0.1:8000 \
--vllm-metrics-path /metrics
```

The controller writes:
The vLLM adapter parses:

- `PUT /policy/cg:{id}` for the normal base policy.
- `PUT /runtime-policy/cg:{id}` when GPU, KV-cache, or token budget pressure is high.
- `DELETE /runtime-policy/cg:{id}` when all pressure signals recover below their low watermarks.
- `vllm:gpu_cache_usage_perc`
- `vllm:num_requests_waiting`
- `vllm:num_requests_running`
- `vllm:prompt_tokens_total` plus `vllm:generation_tokens_total` as token-budget proxy metrics

Runtime overrides are written through the public API as manual overrides. Do not use the same tenant/flow selector for another manual override while this example is running.
Token counters are converted into scrape-to-scrape deltas for the controller's current budget window. Exhaustion is only enforced when `--disabled-on-exhaustion` is set.

## File-Backed Fallback

## Input files
File mode is the default and remains useful for portable tests and demos:

```shell
cargo run -p inference-admission -- \
--tenant cg:12345 \
--inference-port 8000 \
--inference-http-path /v1/chat/completions \
--metrics-source file \
--metrics-file-path /tmp/vantage-inference-metrics.json \
--gpu-util-file-path /tmp/vantage-gpu-util.json
```

Inference pressure:

```json
{
"ts_unix_ms": 1710000000000,
"tokens_used_current_minute": 54000,
"token_budget_per_minute": 60000,
"kv_cache_percent": 87.5,
"active_requests": 12,
"queued_requests": 3
}
```

The older byte-based KV-cache fields are still accepted:

```json
{
"ts_unix_ms": 1710000000000,
Expand All @@ -41,7 +78,7 @@ Inference pressure:
}
```

GPU utilization:
GPU utilization fallback:

```json
{
Expand All @@ -50,16 +87,27 @@ GPU utilization:
}
```

Missing input files are treated as empty/no-signal samples. Invalid JSON is treated as a tick failure; the controller retains its previously applied state and retries on the next tick.
Missing input files are treated as empty/no-signal samples. Invalid JSON or invalid vLLM metrics are treated as tick failures; the controller retains its previously applied state and retries on the next tick.

## Vantage Writes

The controller writes:

- `PUT /policy/cg:{id}` for the normal base policy.
- `PUT /runtime-policy/cg:{id}` when GPU, KV-cache, or token budget pressure is high.
- `DELETE /runtime-policy/cg:{id}` when all pressure signals recover below their low watermarks.

Runtime overrides are written through the public API as manual overrides. Do not use the same tenant/flow selector for another manual override while this example is running.

## Scope

In scope for this example:
In scope:

- Single tenant cgroup.
- Single TCP inference endpoint.
- `POST` HTTP path selectors.
- File-backed metrics inputs.
- vLLM Prometheus metrics input.
- File-backed metrics fallback.
- Hysteresis-based normal, throttled, and exhausted modes.

Out of scope:
Expand Down
206 changes: 206 additions & 0 deletions examples/inference-admission/demo.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
#!/usr/bin/env bash
set -euo pipefail

if ! command -v python3 >/dev/null 2>&1; then
echo "python3 is required for the inference admission demo" >&2
exit 1
fi

ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
TMP_DIR="$(mktemp -d)"
VLLM_PORT="18080"
VANTAGE_PORT="13000"
CONTROLLER_PID=""
VLLM_PID=""
VANTAGE_PID=""

cleanup() {
for pid in "$CONTROLLER_PID" "$VLLM_PID" "$VANTAGE_PID"; do
if [[ -n "$pid" ]] && kill -0 "$pid" >/dev/null 2>&1; then
kill "$pid" >/dev/null 2>&1 || true
wait "$pid" >/dev/null 2>&1 || true
fi
done
rm -rf "$TMP_DIR"
}
trap cleanup EXIT INT TERM

cat >"$TMP_DIR/mock_vllm.py" <<'PY'
import socketserver
import sys
import time

START = time.monotonic()

def metrics():
elapsed = time.monotonic() - START
if elapsed < 20:
phase, gpu, waiting, running, prompt, generation = "Normal", "0.20", 0, 1, 12, 8
elif elapsed < 40:
phase, gpu, waiting, running, prompt, generation = "Throttled", "0.95", 8, 16, 50, 30
else:
phase, gpu, waiting, running, prompt, generation = "Exhausted", "0.98", 12, 20, 170, 150
body = f"""# HELP vllm:gpu_cache_usage_perc GPU KV-cache utilization.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc {gpu}
# HELP vllm:num_requests_waiting Number of waiting requests.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting {waiting}
# HELP vllm:num_requests_running Number of running requests.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running {running}
# HELP vllm:prompt_tokens_total Prompt tokens in the current synthetic demo window.
# TYPE vllm:prompt_tokens_total counter
vllm:prompt_tokens_total {prompt}
# HELP vllm:generation_tokens_total Generation tokens in the current synthetic demo window.
# TYPE vllm:generation_tokens_total counter
vllm:generation_tokens_total {generation}
"""
return phase, body.encode()

def read_request(sock):
sock.settimeout(0.25)
data = b""
while b"\r\n\r\n" not in data and len(data) < 8192:
try:
chunk = sock.recv(1024)
except TimeoutError:
break
if not chunk:
break
data += chunk
return data

def respond(sock, status, body=b"", content_type="text/plain"):
reason = "OK" if status == 200 else "Not Found"
headers = (
f"HTTP/1.1 {status} {reason}\r\n"
f"Content-Length: {len(body)}\r\n"
f"Content-Type: {content_type}\r\n"
"Connection: close\r\n\r\n"
).encode()
sock.sendall(headers + body)

class Handler(socketserver.BaseRequestHandler):
def handle(self):
data = read_request(self.request)
first = data.split(b"\r\n", 1)[0].decode(errors="replace")
path = first.split(" ")[1] if " " in first else ""
if path != "/metrics":
respond(self.request, 404)
return
phase, body = metrics()
print(f"mock-vllm phase={phase}", flush=True)
respond(self.request, 200, body, "text/plain; version=0.0.4")

class Server(socketserver.ThreadingTCPServer):
allow_reuse_address = True

Server(("127.0.0.1", int(sys.argv[1])), Handler).serve_forever()
PY

cat >"$TMP_DIR/mock_vantage.py" <<'PY'
import json
import socketserver
import sys
from urllib.parse import urlparse

LAST_MODE = None

def announce(mode):
global LAST_MODE
if mode != LAST_MODE:
print(f"admission transition: {mode}", flush=True)
LAST_MODE = mode

def read_request(sock):
sock.settimeout(0.25)
data = b""
while b"\r\n\r\n" not in data and len(data) < 65536:
try:
chunk = sock.recv(4096)
except TimeoutError:
break
if not chunk:
break
data += chunk
headers, _, body = data.partition(b"\r\n\r\n")
length = 0
for line in headers.decode(errors="replace").splitlines():
if line.lower().startswith("content-length:"):
length = int(line.split(":", 1)[1].strip())
while len(body) < length:
try:
chunk = sock.recv(length - len(body))
except TimeoutError:
break
if not chunk:
break
body += chunk
return headers, body

def respond(sock, status=204):
reason = "No Content" if status == 204 else "Not Found"
sock.sendall(
f"HTTP/1.1 {status} {reason}\r\nContent-Length: 0\r\nConnection: close\r\n\r\n".encode()
)

class Handler(socketserver.BaseRequestHandler):
def handle(self):
headers, body = read_request(self.request)
first = headers.split(b"\r\n", 1)[0].decode(errors="replace")
parts = first.split(" ")
method = parts[0] if len(parts) > 0 else ""
raw_path = parts[1] if len(parts) > 1 else ""
path = urlparse(raw_path).path
if method == "PUT" and path.startswith("/policy/cg:"):
respond(self.request, 204)
return
if method == "PUT" and path.startswith("/runtime-policy/cg:"):
payload = json.loads(body.decode() or "{}")
announce("Exhausted" if payload.get("enabled") is False else "Throttled")
respond(self.request, 204)
return
if method == "DELETE" and path.startswith("/runtime-policy/cg:"):
announce("Normal")
respond(self.request, 204)
return
respond(self.request, 404)

class Server(socketserver.ThreadingTCPServer):
allow_reuse_address = True

Server(("127.0.0.1", int(sys.argv[1])), Handler).serve_forever()
PY

python3 "$TMP_DIR/mock_vllm.py" "$VLLM_PORT" &
VLLM_PID="$!"
python3 "$TMP_DIR/mock_vantage.py" "$VANTAGE_PORT" &
VANTAGE_PID="$!"

sleep 1

echo "Starting inference admission controller demo."
echo "Expected visible transitions: Normal -> Throttled -> Exhausted"

cargo build -p inference-admission

"${ROOT_DIR}/target/debug/vantage-inference-admission" \
--vantage-base-url "http://127.0.0.1:${VANTAGE_PORT}" \
--tenant cg:42 \
--inference-port 8000 \
--inference-http-path /v1/chat/completions \
--metrics-source vllm \
--vllm-metrics-base-url "http://127.0.0.1:${VLLM_PORT}" \
--vllm-metrics-path /metrics \
--tick-ms 1000 \
--token-budget-per-minute 100 \
--disabled-on-exhaustion &
CONTROLLER_PID="$!"

sleep 62
kill "$CONTROLLER_PID" >/dev/null 2>&1 || true
wait "$CONTROLLER_PID" >/dev/null 2>&1 || true
CONTROLLER_PID=""

echo "Demo complete."
15 changes: 15 additions & 0 deletions examples/inference-admission/fixtures/vllm_metrics_exhausted.prom
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# HELP vllm:gpu_cache_usage_perc GPU KV-cache utilization.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc 0.98
# HELP vllm:num_requests_waiting Number of waiting requests.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting 12
# HELP vllm:num_requests_running Number of running requests.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running 20
# HELP vllm:prompt_tokens_total Prompt tokens in the current synthetic demo window.
# TYPE vllm:prompt_tokens_total counter
vllm:prompt_tokens_total 70
# HELP vllm:generation_tokens_total Generation tokens in the current synthetic demo window.
# TYPE vllm:generation_tokens_total counter
vllm:generation_tokens_total 50
15 changes: 15 additions & 0 deletions examples/inference-admission/fixtures/vllm_metrics_normal.prom
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# HELP vllm:gpu_cache_usage_perc GPU KV-cache utilization.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc 0.20
# HELP vllm:num_requests_waiting Number of waiting requests.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting 0
# HELP vllm:num_requests_running Number of running requests.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running 1
# HELP vllm:prompt_tokens_total Prompt tokens in the current synthetic demo window.
# TYPE vllm:prompt_tokens_total counter
vllm:prompt_tokens_total 12
# HELP vllm:generation_tokens_total Generation tokens in the current synthetic demo window.
# TYPE vllm:generation_tokens_total counter
vllm:generation_tokens_total 8
15 changes: 15 additions & 0 deletions examples/inference-admission/fixtures/vllm_metrics_throttled.prom
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# HELP vllm:gpu_cache_usage_perc GPU KV-cache utilization.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc 0.95
# HELP vllm:num_requests_waiting Number of waiting requests.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting 8
# HELP vllm:num_requests_running Number of running requests.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running 16
# HELP vllm:prompt_tokens_total Prompt tokens in the current synthetic demo window.
# TYPE vllm:prompt_tokens_total counter
vllm:prompt_tokens_total 50
# HELP vllm:generation_tokens_total Generation tokens in the current synthetic demo window.
# TYPE vllm:generation_tokens_total counter
vllm:generation_tokens_total 30
Loading
Loading