Skip to content

Gauravpadam/Cat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cat

A fully local voice assistant. No cloud, no subscriptions. It listens, thinks, sees (when asked), and speaks — all on your machine.

Features

  • Continuous speech recognition via OpenAI Whisper (runs locally)
  • Local LLM responses via Ollama (default: llama3.1)
  • Two TTS backends: pyttsx3 (basic) or Piper (neural quality)
  • Vision on demand — webcam object/scene description via a multimodal LLM, activated only when you ask
  • Face blurring before any image reaches the LLM (MTCNN)
  • Animated cat that lip-syncs to audio amplitude (Pygame + Piper)
  • Two run modes: classic voice loop or deterministic LangGraph StateGraph agent

Stack

Component Tool
Speech recognition Whisper
Conversation LLM Ollama (llama3.1)
Vision LLM Ollama (minicpm-v)
Text-to-speech Piper / pyttsx3
Face detection MTCNN
Animation Pygame
Orchestration LangGraph StateGraph

Requirements

  • Python 3.11+
  • Ollama running locally (http://localhost:11434)
  • A working microphone
  • For vision: a webcam
  • For Piper TTS: download a .onnx voice model (see Piper voices)

Installation

pip install -r requirements.txt

Pull the required Ollama models:

ollama pull llama3.1        # conversation
ollama pull minicpm-v       # vision (optional)

Usage

Classic mode (simple loop)

python -m cat.src.main

Agent mode with vision and animated cat

python -m cat.src.main \
    --agent-mode \
    --tts-backend piper \
    --piper-model voice_models/en_US-lessac-medium.onnx \
    --enable-vision \
    --show-cat

All options

Speech Recognition:
  --whisper-model   tiny | base | small | medium | large  (default: base)
  --listen-timeout  seconds to wait for speech to start   (default: 5.0)
  --phrase-timeout  pause before considering speech done  (default: 3.0)

AI Processing:
  --ai-model        Ollama model name                     (default: llama3.1)
  --ollama-url      Ollama server URL                     (default: http://localhost:11434)
  --temperature     response temperature 0.0–1.0          (default: 0.7)
  --system-prompt   system prompt text

Text-to-Speech:
  --tts-backend     pyttsx3 | piper                       (default: pyttsx3)
  --piper-model     path to .onnx voice model
  --speech-rate     words per minute (pyttsx3 only)       (default: 150)
  --speech-volume   volume 0.0–1.0 (pyttsx3 only)        (default: 1.0)

Agent Mode:
  --agent-mode      use LangGraph StateGraph agent
  --show-cat        show animated cat window
  --assets-dir      directory with cat sprite images
  --enable-vision   enable webcam object detection
  --camera-index    camera device index                   (default: 0)

How it works

Voice loop

The core loop is: listen → process → speak → repeat. The recognizer pauses during speech output so Cat doesn't hear itself and spiral.

Agent mode

Instead of a ReAct agent (tried it, unreliable for daily use), Cat uses a deterministic StateGraph:

listen → analyze → [route] → respond / vision / exit → react → speak → END

The analyze node classifies intent (conversational / vision / exit) and the graph routes accordingly — no ambiguity.

Vision

Vision is off by default. When you ask "what do you see?", the intent classifier routes to the vision node, which captures a webcam frame, blurs any faces with MTCNN, and sends the image to a multimodal LLM for a natural description.

Privacy

Faces are blurred before the image is analyzed or saved. The LLM never receives identifiable faces.

Running tests

pytest tests/

About

Cat is my voice/video assistant, he does all the heavy lifting around my house. He is lazy and keeps asking for churu

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages