A fully local voice assistant. No cloud, no subscriptions. It listens, thinks, sees (when asked), and speaks — all on your machine.
- Continuous speech recognition via OpenAI Whisper (runs locally)
- Local LLM responses via Ollama (default: llama3.1)
- Two TTS backends: pyttsx3 (basic) or Piper (neural quality)
- Vision on demand — webcam object/scene description via a multimodal LLM, activated only when you ask
- Face blurring before any image reaches the LLM (MTCNN)
- Animated cat that lip-syncs to audio amplitude (Pygame + Piper)
- Two run modes: classic voice loop or deterministic LangGraph StateGraph agent
| Component | Tool |
|---|---|
| Speech recognition | Whisper |
| Conversation LLM | Ollama (llama3.1) |
| Vision LLM | Ollama (minicpm-v) |
| Text-to-speech | Piper / pyttsx3 |
| Face detection | MTCNN |
| Animation | Pygame |
| Orchestration | LangGraph StateGraph |
- Python 3.11+
- Ollama running locally (
http://localhost:11434) - A working microphone
- For vision: a webcam
- For Piper TTS: download a
.onnxvoice model (see Piper voices)
pip install -r requirements.txtPull the required Ollama models:
ollama pull llama3.1 # conversation
ollama pull minicpm-v # vision (optional)python -m cat.src.mainpython -m cat.src.main \
--agent-mode \
--tts-backend piper \
--piper-model voice_models/en_US-lessac-medium.onnx \
--enable-vision \
--show-catSpeech Recognition:
--whisper-model tiny | base | small | medium | large (default: base)
--listen-timeout seconds to wait for speech to start (default: 5.0)
--phrase-timeout pause before considering speech done (default: 3.0)
AI Processing:
--ai-model Ollama model name (default: llama3.1)
--ollama-url Ollama server URL (default: http://localhost:11434)
--temperature response temperature 0.0–1.0 (default: 0.7)
--system-prompt system prompt text
Text-to-Speech:
--tts-backend pyttsx3 | piper (default: pyttsx3)
--piper-model path to .onnx voice model
--speech-rate words per minute (pyttsx3 only) (default: 150)
--speech-volume volume 0.0–1.0 (pyttsx3 only) (default: 1.0)
Agent Mode:
--agent-mode use LangGraph StateGraph agent
--show-cat show animated cat window
--assets-dir directory with cat sprite images
--enable-vision enable webcam object detection
--camera-index camera device index (default: 0)
The core loop is: listen → process → speak → repeat. The recognizer pauses during speech output so Cat doesn't hear itself and spiral.
Instead of a ReAct agent (tried it, unreliable for daily use), Cat uses a deterministic StateGraph:
listen → analyze → [route] → respond / vision / exit → react → speak → END
The analyze node classifies intent (conversational / vision / exit) and the graph routes accordingly — no ambiguity.
Vision is off by default. When you ask "what do you see?", the intent classifier routes to the vision node, which captures a webcam frame, blurs any faces with MTCNN, and sends the image to a multimodal LLM for a natural description.
Faces are blurred before the image is analyzed or saved. The LLM never receives identifiable faces.
pytest tests/