fix(ollama): keep_alive=-1 + bump api timeout 10s→30s by mrviduus · Pull Request #234 · mrviduus/textstack

mrviduus · 2026-05-08T00:26:29Z

Why

User noticed prod RAM showed only 3.1 GiB used despite gemma4:e4b being 9.8 GiB — the model unloads after idle. Cold reload takes ~50-60s on CPU and Ollama's pre-load memory check is conservative ("requires 9.8 GiB but only 8.2 GiB available" even though host has 27 GiB available) — both make fire-and-forget distractor calls fail intermittently.

What

`OLLAMA_KEEP_ALIVE=-1` in compose env. Once the model loads, it stays resident for the lifetime of the container. Cold-load only happens on container restart (i.e. on each deploy), not after every 5-min idle window.
Api/appsettings.json `Ollama:TimeoutSeconds` 10s → 30s. Worker was already 30s; API was 10s. Warm gemma4 distractor inference measured at 2.8s (plenty of headroom under 30s) but 10s leaves no margin and definitely breaks on cold-load. Aligning to 30s removes the asymmetry.

Verified on prod

After manual `ollama run --keepalive=24h gemma4:e4b`:
```
$ docker compose exec ollama ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
gemma4:e4b c6eb396dbd59 10 GB 100% CPU 4096 24 hours from now

$ free -h
total used free buff/cache available
Mem: 30Gi 13Gi 351Mi 18Gi 17Gi
```

Distractor inference timing:
```
$ time ollama run gemma4:e4b "Generate 5 single-word distractors for linearizability..."
consistency, atomicity, serialization, concurrency, visibility
real 0m2.837s
```

Post-deploy step

After this merges + auto-deploys, prod's container will recreate with the new env. Need to trigger first load once:

```bash
ssh asus
cd ~/projects/onlinelib/textstack
docker compose exec ollama ollama run gemma4:e4b "" --keepalive=-1
docker compose exec ollama ollama ps # verify UNTIL=Forever
```

After that, `KEEP_ALIVE=-1` env on the daemon side keeps it resident across all subsequent inference calls.

🤖 Generated with Claude Code

Two reliability fixes for the gemma4:e4b setup on prod. 1. OLLAMA_KEEP_ALIVE=-1 in compose env Default keep_alive is 5min. After idle, model unloads. Reload from cold takes ~50-60s on CPU + Ollama's pre-load memory check goes conservative (says 9.8 GiB needed but only 8.2 GiB available, even though host has 27 GiB free) — both make the API's fire-and-forget distractor call fail intermittently. -1 keeps model resident as long as the container is up, so cold-load only happens on container restart (deploy). 2. Api/appsettings.json Ollama:TimeoutSeconds 10s → 30s Worker is already 30s. API was 10s — works for warm gemma4 (measured 2.8s for distractor inference) but leaves no margin and breaks on cold-load. Aligning to 30s removes the asymmetry. Verified on prod (after manual `ollama run --keepalive=24h`): - ollama ps: gemma4:e4b loaded, UNTIL=24h - distractor inference: 2.8s for 5 single-word answers - free -h: 13 GiB used (was 3 GiB) — model is in RAM Post-deploy step: `docker compose exec ollama ollama run gemma4:e4b ""` once to trigger first load. Then it stays via keep_alive=-1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mrviduus merged commit a8f3bcf into main May 8, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ollama): keep_alive=-1 + bump api timeout 10s→30s#234

fix(ollama): keep_alive=-1 + bump api timeout 10s→30s#234
mrviduus merged 1 commit into
mainfrom
fix/ollama-keepalive-and-timeout

mrviduus commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrviduus commented May 8, 2026

Why

What

Verified on prod

Post-deploy step

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant