Skip to content

fix(ollama): keep_alive=-1 + bump api timeout 10s→30s#234

Merged
mrviduus merged 1 commit into
mainfrom
fix/ollama-keepalive-and-timeout
May 8, 2026
Merged

fix(ollama): keep_alive=-1 + bump api timeout 10s→30s#234
mrviduus merged 1 commit into
mainfrom
fix/ollama-keepalive-and-timeout

Conversation

@mrviduus
Copy link
Copy Markdown
Owner

@mrviduus mrviduus commented May 8, 2026

Why

User noticed prod RAM showed only 3.1 GiB used despite gemma4:e4b being 9.8 GiB — the model unloads after idle. Cold reload takes ~50-60s on CPU and Ollama's pre-load memory check is conservative ("requires 9.8 GiB but only 8.2 GiB available" even though host has 27 GiB available) — both make fire-and-forget distractor calls fail intermittently.

What

  1. `OLLAMA_KEEP_ALIVE=-1` in compose env. Once the model loads, it stays resident for the lifetime of the container. Cold-load only happens on container restart (i.e. on each deploy), not after every 5-min idle window.

  2. Api/appsettings.json `Ollama:TimeoutSeconds` 10s → 30s. Worker was already 30s; API was 10s. Warm gemma4 distractor inference measured at 2.8s (plenty of headroom under 30s) but 10s leaves no margin and definitely breaks on cold-load. Aligning to 30s removes the asymmetry.

Verified on prod

After manual `ollama run --keepalive=24h gemma4:e4b`:
```
$ docker compose exec ollama ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
gemma4:e4b c6eb396dbd59 10 GB 100% CPU 4096 24 hours from now

$ free -h
total used free buff/cache available
Mem: 30Gi 13Gi 351Mi 18Gi 17Gi
```

Distractor inference timing:
```
$ time ollama run gemma4:e4b "Generate 5 single-word distractors for linearizability..."
consistency, atomicity, serialization, concurrency, visibility
real 0m2.837s
```

Post-deploy step

After this merges + auto-deploys, prod's container will recreate with the new env. Need to trigger first load once:

```bash
ssh asus
cd ~/projects/onlinelib/textstack
docker compose exec ollama ollama run gemma4:e4b "" --keepalive=-1
docker compose exec ollama ollama ps # verify UNTIL=Forever
```

After that, `KEEP_ALIVE=-1` env on the daemon side keeps it resident across all subsequent inference calls.

🤖 Generated with Claude Code

Two reliability fixes for the gemma4:e4b setup on prod.

1. OLLAMA_KEEP_ALIVE=-1 in compose env
   Default keep_alive is 5min. After idle, model unloads. Reload from
   cold takes ~50-60s on CPU + Ollama's pre-load memory check goes
   conservative (says 9.8 GiB needed but only 8.2 GiB available, even
   though host has 27 GiB free) — both make the API's fire-and-forget
   distractor call fail intermittently. -1 keeps model resident as long
   as the container is up, so cold-load only happens on container
   restart (deploy).

2. Api/appsettings.json Ollama:TimeoutSeconds 10s → 30s
   Worker is already 30s. API was 10s — works for warm gemma4 (measured
   2.8s for distractor inference) but leaves no margin and breaks on
   cold-load. Aligning to 30s removes the asymmetry.

Verified on prod (after manual `ollama run --keepalive=24h`):
- ollama ps: gemma4:e4b loaded, UNTIL=24h
- distractor inference: 2.8s for 5 single-word answers
- free -h: 13 GiB used (was 3 GiB) — model is in RAM

Post-deploy step: `docker compose exec ollama ollama run gemma4:e4b ""`
once to trigger first load. Then it stays via keep_alive=-1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mrviduus mrviduus merged commit a8f3bcf into main May 8, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant