Skip to content

Latest commit

 

History

History
590 lines (454 loc) · 24.9 KB

File metadata and controls

590 lines (454 loc) · 24.9 KB

API Reference

OlliteRT exposes an OpenAI-compatible HTTP API on your local network. Default port is 8000 (configurable in Settings).

Table of Contents


Endpoints

Method Endpoint Description
POST /v1/chat/completions OpenAI Chat Completions API (streaming + non-streaming)
POST /v1/completions OpenAI Text Completions API
POST /v1/responses OpenAI Responses API
POST /v1/messages Anthropic Messages API (streaming + non-streaming)
POST /v1/messages/count_tokens Anthropic input-token estimator
POST /v1/audio/transcriptions Audio transcription
GET /v1/models List available models
GET /v1/models/{id} Get detail for a specific model
GET / or /v1 Server info (version, status, endpoints)
GET /health Health check (add ?metrics=true for detailed JSON stats)
GET /metrics Prometheus metrics (exposition format)
GET /ping Simple liveness check — returns {"status":"ok"}

Authentication

Bearer token authentication is optional and disabled by default. When disabled, all endpoints are open — no API key or header is needed.

To enable authentication, go to Settings → Server Configuration and toggle Require Bearer Token. When enabled, include the token in the Authorization header:

Authorization: Bearer your-token

Anthropic SDK clients (Claude Code, the official Python/TypeScript SDKs) send credentials in x-api-key instead. OlliteRT accepts either header — x-api-key carries the raw token with no Bearer prefix:

x-api-key: your-token

In every example below the literal string your-token is purely a placeholder — when auth is disabled (the default) OlliteRT ignores the header value entirely, so any non-empty string works. When auth is enabled, the value must match the token configured in Settings → Server Configuration. The phone never relays credentials to the real OpenAI or Anthropic APIs.

See the Security Guide for details on network exposure and credential storage.

Tip

All inference endpoints accept the same core parameters (temperature, top_p, top_k, max_tokens, stream). The parameter tables below document each endpoint's full set.

Chat Completions — POST /v1/chat/completions

Request Body

Parameter Type Required Description
model string Yes Model name (e.g. Gemma-4-E2B-it)
messages array Yes Array of message objects (role + content)
stream boolean No Enable SSE streaming (default: false)
stream_options object No Streaming options. Set {"include_usage": true} to receive a usage chunk before [DONE]
temperature number No Sampling temperature (0.0 - 2.0)
top_p number No Nucleus sampling threshold
top_k integer No Top-k sampling
max_tokens integer No Maximum tokens to generate
max_completion_tokens integer No Alias for max_tokens
stop string or array No Stop sequence(s)
tools array No Tool/function definitions for tool calling
tool_choice string or object No Tool selection strategy (auto, none, or specific tool)
response_format object No Response format ({"type": "json_object"} for JSON mode)

Message Object

Field Type Description
role string system, user, assistant, or tool
content string or array Text content, or array of content parts for multimodal
tool_call_id string Required for role: "tool" — references the tool call being responded to
name string Function name (for tool messages)

Multimodal Content

For vision and audio input, use content parts:

Image:

{
  "role": "user",
  "content": [
    {"type": "text", "text": "What's in this image?"},
    {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
  ]
}

Audio:

{
  "role": "user",
  "content": [
    {"type": "input_audio", "input_audio": {"data": "<base64-encoded-audio>", "format": "wav"}}
  ]
}

Supported audio formats: wav, mp3, ogg, flac. Audio must be mono — stereo is automatically downmixed.

Tip

For dedicated audio transcription, use the /v1/audio/transcriptions endpoint instead.

Response

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "Gemma-4-E2B-it",
  "system_fingerprint": null,
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 8,
    "total_tokens": 18
  }
}

finish_reason values: "stop" (natural end or stop sequence), "length" (output truncated by max_tokens), "tool_calls" (model invoked a tool).

Note: The system_fingerprint field is always null. The LiteRT runtime does not expose a tokenizer or model configuration hash, so there is no meaningful fingerprint to generate. Clients that check this field should treat null as "unknown configuration."

Streaming Response

When stream: true, the response is sent as Server-Sent Events:

data: {"id":"chatcmpl-...","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-...","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-...","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}

data: {"id":"chatcmpl-...","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

When stream_options: {"include_usage": true} is set, a usage chunk is emitted before [DONE]:

data: {"id":"chatcmpl-...","choices":[],"usage":{"prompt_tokens":10,"completion_tokens":8,"total_tokens":18}}

data: [DONE]

Without stream_options (the default), no usage chunk is emitted — the stream ends with the finish_reason chunk followed by [DONE].

Text Completions — POST /v1/completions

Parameter Type Required Description
model string Yes Model name
prompt string Yes Text prompt
stream boolean No Enable SSE streaming
temperature number No Sampling temperature
max_tokens integer No Maximum tokens to generate

Responses API — POST /v1/responses

Alternative API format. Accepts either messages (array) or input (string) field.

Parameter Type Required Description
model string Yes Model name
input string or array Yes Input text or messages array
stream boolean No Enable SSE streaming
tools array No Tool definitions
tool_choice string or object No Tool selection strategy (auto, none, or specific tool)
temperature number No Sampling temperature
top_p number No Nucleus sampling threshold
top_k integer No Top-k sampling
max_output_tokens integer No Maximum tokens to generate

Streaming Response

When stream: true, the Responses API uses typed Server-Sent Events with an event: prefix (unlike Chat Completions which uses data:-only lines). Each SSE frame has the format:

event: <event-type>
data: <JSON payload>

The full event sequence for a text response:

event: response.created
data: {"type":"response.created","response":{"id":"resp-...","object":"response","created_at":1234567890,"status":"in_progress","model":"Gemma-4-E2B-it","output":[]}}

event: response.in_progress
data: {"type":"response.in_progress","response":{"id":"resp-...","object":"response","created_at":1234567890,"status":"in_progress","model":"Gemma-4-E2B-it","output":[]}}

event: response.output_item.added
data: {"type":"response.output_item.added","item":{"id":"msg-...","type":"message","status":"in_progress","content":[],"role":"assistant"},"output_index":0,"sequence_number":0}

event: response.content_part.added
data: {"type":"response.content_part.added","content_index":0,"item_id":"msg-...","output_index":0,"part":{"type":"output_text","annotations":[],"logprobs":[],"text":""}}

event: response.output_text.delta
data: {"type":"response.output_text.delta","content_index":0,"delta":"Hello","item_id":"msg-...","output_index":0}

event: response.output_text.delta
data: {"type":"response.output_text.delta","content_index":0,"delta":"!","item_id":"msg-...","output_index":0}

event: response.output_text.done
data: {"type":"response.output_text.done","content_index":0,"item_id":"msg-...","output_index":0,"text":"Hello!"}

event: response.content_part.done
data: {"type":"response.content_part.done","content_index":0,"item_id":"msg-...","output_index":0,"part":{"type":"output_text","annotations":[],"logprobs":[],"text":"Hello!"}}

event: response.output_item.done
data: {"type":"response.output_item.done","item":{"id":"msg-...","type":"message","status":"completed","content":[{"type":"output_text","annotations":[],"logprobs":[],"text":"Hello!"}],"role":"assistant"},"output_index":0}

event: response.completed
data: {"type":"response.completed","response":{"id":"resp-...","object":"response","created_at":1234567890,"status":"completed","model":"Gemma-4-E2B-it","output":[...],"usage":{"input_tokens":10,"output_tokens":2,"total_tokens":12}}}

data: [DONE]

The final data: [DONE] line has no event: prefix — it signals the end of the stream (same as Chat Completions).

Anthropic Messages — POST /v1/messages

Anthropic-compatible Messages API. Lets Claude Code and the official Anthropic SDKs (Python, TypeScript) target the phone directly with no proxy. The handler translates the Anthropic request into the internal chat-completion pipeline and re-shapes the response into Anthropic's content-block format.

Warning

Experimental. Wire-level support for the Messages API is implemented and stable, but on-device models in the Gemma-4-E2B / 3n class do not have the context budget or instruction-following headroom to drive Claude Code (large system prompt, dense tool surface) reliably. Expect long prefill, frequent tool-call mistakes, and the LiteRT-LM #2418 parse failures noted below. Use the OpenAI-compatible endpoints for production workflows; treat this surface as a smoke test for the Anthropic API.

Request Body

Parameter Type Required Description
model string Yes Model name (e.g. Gemma-4-E2B-it)
messages array Yes Array of message objects (role + content)
max_tokens integer Yes Maximum tokens to generate
system string or array No System prompt — string for the simple form, or an array of {type:"text", text:"..."} blocks
stream boolean No Enable SSE streaming (default: false)
temperature number No Sampling temperature
top_p number No Nucleus sampling threshold
top_k integer No Top-k sampling
stop_sequences array No Stop strings
tools array No Tool definitions in Anthropic shape ({name, description, input_schema})
tool_choice object No {type:"auto"}, {type:"any"}, {type:"none"}, or {type:"tool", name:"..."}
thinking object No {type:"enabled"} / {type:"disabled"} — per-request override of the model's persisted thinking setting (only applied when the model supports thinking)

The following Anthropic features are accepted on the wire but silently dropped because LiteRT-LM has no equivalent: metadata, service_tier, cache_control, parallel_tool_calls, echoed thinking blocks. URL-sourced images, document blocks, and computer_* / text_editor_* / bash_* tool types return HTTP 400.

Response (non-streaming)

{
  "id": "msg_...",
  "type": "message",
  "role": "assistant",
  "model": "Gemma-4-E2B-it",
  "content": [
    {"type": "text", "text": "Hello!"}
  ],
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {"input_tokens": 12, "output_tokens": 4}
}

stop_reason is one of end_turn, max_tokens, stop_sequence, or tool_use. When stop_sequence fires, stop_sequence echoes the matched string. Tool calls produce {type:"tool_use", id, name, input} content blocks.

Streaming

When stream: true, the response is a Server-Sent Events stream that follows Anthropic's documented event sequence:

event: message_start
data: {"type":"message_start","message":{...}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null},"usage":{...}}

event: message_stop
data: {"type":"message_stop"}

OlliteRT also emits event: ping events every 10 s while the model is still in prefill so SDK clients don't time out on long on-device prefill (Gemma-4-E2B routinely takes 30–60 s to first token). Errors mid-stream surface as event: error with {"type":"error","error":{"type","message"}}.

Known Issues

Warning

Gemma 4 native tool calling is unreliable. When a tool argument is a string containing quoted content (Bash command, Edit old_string, WebFetch URL, JSON-in-a-string), Gemma-4 emits its trained <|"|> quote delimiter for the inner quotes. LiteRT-LM 0.11.0 / 0.12.0's ANTLR function-call parser does not understand this token and raises INVALID_ARGUMENT, which surfaces as a 500 to the client. Affects every Anthropic tool-using client (notably Claude Code, which always sends Bash / Edit / Read tool definitions). Tracking upstream: google-ai-edge/LiteRT-LM#2418. Workaround: turn off Settings → Schema Injection so tool calls go through the text-mode parser instead.

Example (curl, non-streaming)

curl http://PHONE_IP:8000/v1/messages \
  -H "x-api-key: your-token" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Gemma-4-E2B-it",
    "max_tokens": 256,
    "messages": [{"role": "user", "content": "Say hello"}]
  }'

Example (Claude Code)

ANTHROPIC_BASE_URL=http://PHONE_IP:8000 \
ANTHROPIC_AUTH_TOKEN=your-token \
claude

Claude Code maps ANTHROPIC_AUTH_TOKEN to the x-api-key header. The /v1 segment is appended automatically.

Anthropic Token Counter — POST /v1/messages/count_tokens

Estimates the input-token count for a Messages-shaped request without running inference. Works even when no model is loaded.

The body accepts the same fields as /v1/messages; max_tokens is optional here. The response is:

{"input_tokens": 1042}

Counts are estimated as chars / 4 (the same heuristic OlliteRT uses across the request log). This is not a tokenizer-exact count — there is no public LiteRT tokenizer API — but it tracks within ±20% of the runtime count for English chat traffic.

Audio Transcriptions — POST /v1/audio/transcriptions

Accepts an audio file via multipart/form-data and returns a text transcription.

Requires a model with audio capability (e.g. Gemma 4, Gemma 3n).

Request Body (multipart/form-data)

Field Type Required Description
file file Yes Audio file to transcribe (max 25 MB)
model string No Model name (ignored — uses the currently loaded model)
language string No Language hint (e.g. en, de, ja)
prompt string No Context hint to guide transcription
temperature number No Sampling temperature override
response_format string No json (default), text, or verbose_json

Supported audio formats: WAV, MP3, OGG (Vorbis), FLAC. Stereo WAV (16-bit PCM) is automatically downmixed to mono; other formats should be mono before sending.

Response Formats

json (default) — Content-Type: application/json

{"text": "The transcribed text from the audio file."}

textContent-Type: text/plain

The transcribed text from the audio file.

verbose_jsonContent-Type: application/json

{
  "task": "transcribe",
  "language": "en",
  "duration": 3.456,
  "text": "The transcribed text from the audio file.",
  "segments": [{
    "id": 0,
    "seek": 0,
    "start": 0.0,
    "end": 3.456,
    "text": "The transcribed text from the audio file."
  }]
}

duration reflects LLM inference time, not audio length. The model returns raw text without word-level timing, so the output contains a single segment spanning the full duration.

Note: srt and vtt formats are not supported — the LiteRT runtime does not provide word-level timing data required for subtitle generation. Requesting these formats returns HTTP 400.

Example (curl)

curl http://PHONE_IP:8000/v1/audio/transcriptions \
  -H "Authorization: Bearer your-token" \
  -F file=@recording.wav \
  -F response_format=json

Models — GET /v1/models

Returns a list of available models with their capabilities and update status.

{
  "object": "list",
  "data": [{
    "id": "Gemma-4-E2B-it",
    "object": "model",
    "created": 1234567890,
    "owned_by": "ollitert",
    "capabilities": {
      "image": true,
      "audio": true,
      "thinking": true,
      "speculative_decoding": true
    },
    "update_available": false
  }]
}
Field Type Description
id string Model name
object string Always "model"
created integer Unix timestamp
owned_by string Always "ollitert"
capabilities object image, audio, thinking, speculative_decoding booleans. thinking indicates the model supports chain-of-thought AND it is currently enabled in settings (not just model capability). speculative_decoding indicates MTP is supported AND enabled.
update_available boolean true if a newer version of this model is available in the allowlist

Model Detail — GET /v1/models/{id}

Returns detail for a specific model by name. The model ID is case-insensitive. Returns 404 if the model is not loaded (or not idle-unloaded by keep-alive).

The response has the same shape as a single entry from the /v1/models list.

Health — GET /health

Returns server health status. Also available at /v1/health.

Base Response

{
  "status": "ok",
  "model": "Gemma-4-E2B-it",
  "uptime_seconds": 3600,
  "update_available": false
}
Field Type Description
status string ok, idle (keep-alive unloaded), loading, stopped, error
model string Currently loaded (or idle-unloaded) model name. Omitted if no model.
uptime_seconds integer Seconds since server entered RUNNING state. Omitted if not running.
update_available boolean true if a newer OlliteRT version exists

Extended Response — GET /health?metrics=true

Appends server info and a metrics object to the base response:

Field Type Description
version string OlliteRT version string
thinking_enabled boolean Whether chain-of-thought mode is active
speculative_decoding_enabled boolean Whether speculative decoding (MTP) is active
accelerator string gpu, cpu, or gpu,cpu
is_idle_unloaded boolean true if model was unloaded by keep-alive timeout
metrics.requests_total integer Total requests processed
metrics.errors_total integer Total request errors
metrics.prompt_tokens_total integer Total prompt tokens (estimated)
metrics.generation_tokens_total integer Total generated tokens (estimated)
metrics.requests_text integer Total text-only requests
metrics.requests_image integer Total image multimodal requests
metrics.requests_audio integer Total audio multimodal requests
metrics.ttfb_last_ms number Last request time to first token (ms)
metrics.ttfb_avg_ms number Average time to first token (ms)
metrics.decode_tokens_per_second number Last request decode throughput (tokens/s)
metrics.decode_tokens_per_second_peak number Peak decode throughput since start
metrics.prefill_tokens_per_second number Last request prefill throughput (tokens/s)
metrics.inter_token_latency_ms number Last inter-token latency (ms)
metrics.request_latency_last_ms number Last request total latency (ms)
metrics.request_latency_avg_ms number Average request latency (ms)
metrics.request_latency_peak_ms number Peak request latency (ms)
metrics.context_utilization_percent number Last request context window usage (%)
metrics.model_load_time_seconds number Model load/warmup time (seconds)
metrics.is_inferring boolean true if a request is currently being processed

Server Info — GET / or GET /v1

Returns server identity, version, status, update availability, and the full list of supported endpoints. Does not require authentication.

{
  "name": "OlliteRT",
  "version": "1.2.0",
  "build": 42,
  "git_hash": "abc1234",
  "status": "running",
  "model": "Gemma-4-E2B-it",
  "uptime_seconds": 3600,
  "update_available": false,
  "allowlist_content_version": 3,
  "allowlist_source": "asset",
  "model_update_available": false,
  "compatibility": "openai",
  "endpoints": ["/v1/models", "/v1/completions", "/v1/chat/completions", "..."]
}
Field Type Description
name string Always "OlliteRT"
version string App version (e.g. "1.2.0")
build integer Version code
git_hash string Build git commit hash
status string running, idle (keep-alive unloaded), loading, stopped, error
model string Currently loaded model name (omitted if none)
uptime_seconds integer Seconds since RUNNING state (omitted if not running)
update_available boolean true if a newer OlliteRT version exists
latest_version string Newest available version (only present when update_available is true)
release_url string GitHub release URL (only present when update_available is true)
allowlist_content_version integer Version number of the model allowlist currently cached
allowlist_source string Source of the active allowlist: "asset", "external:<path>", "empty", or "error"
model_update_available boolean true if the currently loaded model has a newer version in the allowlist
compatibility string Always "openai"
endpoints array List of supported endpoint paths

Error Responses

Note

All errors follow the standard OpenAI error format, so existing client libraries handle them correctly.

{
  "error": {
    "message": "Model is not loaded",
    "type": "server_error",
    "param": null,
    "code": null
  }
}
Status When
400 Malformed request, missing required fields
401 Missing or invalid bearer token
404 Not Found — model or endpoint doesn't exist
405 Method Not Allowed — wrong HTTP method for endpoint
413 Payload Too Large — request body exceeds size limit
500 Internal server error
503 Model not loaded or server not ready

See Troubleshooting → Connection Issues for detailed explanations of each error code.

Prometheus Metrics — GET /metrics

Returns server metrics in Prometheus exposition format (text/plain; version=0.0.4). Includes 10 counters and 19 gauges covering throughput, latency, token counts, memory, and more.

For the full list of metrics and Grafana setup, see the Prometheus Integration Guide.