Skip to content

coderTanisha22/Jarvis2.O

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🤖✨ Jarvis 2.0

Jarvis 2.0 is a next-generation multimodal conversational AI assistant 🗣️, designed for real-time ⚡, low-latency, and emotionally intelligent ❤️ interaction.

This project integrates 🔗 the high-performance, websocket-based audio streaming 🌊 architecture of Unmute with the powerful audio-language reasoning 🦩 of Audio Flamingo 3.

We utilize Unmute's robust Voice Activity Detection (VAD) 🎙️ and its integration with Kyutai's STT/TTS models to create a seamless, responsive conversational pipeline. Instead of a standard text LLM, Jarvis 2.0 uses Nvidia's Audio Flamingo 3 as its central "brain" 🧠, allowing for a deeper understanding 👂 of not just what is said, but how it's said.

⚙️ How It Works

Jarvis 2.0 functions by creating a real-time, bidirectional audio stream 🔄🔊 between the user and the AI.

  1. VAD & Streaming: 🎤 The frontend captures user audio and, using Unmute's VAD implementation, streams it over a websocket 🕸️ to the backend as the user speaks.
  2. Transcription: ✍️ The backend forwards this audio to Kyutai's Speech-to-Text (STT) model, which generates a live transcription.
  3. Core Reasoning: 💡 The transcribed text is sent to the Audio Flamingo 3 🦩 model. This advanced Audio-Language Model (ALM) generates a context-aware, nuanced, and intelligent response.
  4. Speech Synthesis: 🗣️ The text response from Audio Flamingo 3 is streamed, as it's generated, to Kyutai's Text-to-Speech (TTS) model.
  5. Response: 🎧 The TTS model generates audio, which is streamed back to the user's browser 💻, enabling a fluid, low-latency conversation.
graph LR
    UVI[User Voice Input] --> F(Frontend)
    F -->|Audio File| B(Backend)
    B <-->|WEB SOCKET| STT(STT)
    B <-->|WEB SOCKET| TTS(TTS)
    B <-->|HTTP| AF3(AF3)
    B <--> LLM(LLM)
    LLM <--> SDK(OpenAI Agent SDK)
    SDK <--> TC(Tool Calling)
Loading

🌟 Features

  • ⚡ Extremely Low Latency: Built on Unmute's architecture, streaming STT, LLM, and TTS tokens simultaneously for lower time-to-first-word."
  • 🧠 Advanced AI Reasoning: Powered by Audio Flamingo 3 🦩, providing state-of-the-art responses.
  • 🌊 Real-time Streaming: Full-duplex audio transport over websockets.
  • 🎙️ Robust VAD: Intelligently detects end-of-speech or natural spaces to provide a natural turn-taking experience.
  • 🧩 Modular: Easily swap out the core model (Audio Flamingo 3) for other backends like GPT-4o, Ollama, or Mistral.
  • 👂 Spatial & Emotion Detection: The core model (Audio Flamingo 3) understands audio and is able to detect the surrounding environment 🌍 and the user's tone 😄😢 from the input audio, something which has not yet been achieved by other open source models.

🛠️ Running without Docker (Dockerless)

Alternatively, you can run all services manually. This is more complex due to dependencies.

💻 Software requirements:

  • uv: Install with curl -LsSf https://astral.sh/uv/install.sh | sh
  • cargo: Install with curl https://sh.rustup.rs -sSf | sh
  • pnpm: Install with curl -fsSL https://get.pnpm.io/install.sh | sh -
  • cuda 12.1: Needed for the Rust processes (tts and stt).

▶️ Start services: Start each of the services one by one in a different terminal 🖥️:

./dockerless/start_frontend.sh
./dockerless/start_backend.sh
./dockerless/start_llm.sh        # Requires GPU VRAM
./dockerless/start_stt.sh        # Requires GPU VRAM
./dockerless/start_tts.sh        # Requires GPU VRAM

The website should now be accessible at 🌐 http://localhost:3000.

📡 Connecting to a Remote Server

If you're running Jarvis 2.0 on a remote machine (e.g., jarvis-box) and accessing it from your local machine, you must use SSH port forwarding.

Note

🔒 Browsers restrict microphone 🎤 access on non-secure (http://) connections, except for localhost. Port forwarding makes the remote server accessible via your localhost, bypassing this restriction.

🐳 For Docker Compose: The default setup runs on port 80. Forward this to your local port 3333 🔑:

ssh -N -L 3333:localhost:80 jarvis-box

Now open http://localhost:3333 in your browser.

🛠️ For Dockerless: You must forward the frontend (3000) and backend (8000) ports separately 🔑:

ssh -N -L 8000:localhost:8000 -L 3000:localhost:3000 jarvis-box

Now open http://localhost:3000 in your browser.

🔐 HTTPS Support

For simplicity, HTTPS is not included in the default setups. For production deployments, we recommend using a reverse proxy like Caddy or Nginx, or adapting the Docker Swarm documentation provided by the Unmute project.

🔧 Modifying Jarvis 2.0

💬🐞 Subtitles and Dev Mode

  • Press "S" to toggle subtitles for both you and Jarvis.
  • A dev mode can be enabled in useKeyboardShortcuts.ts by changing ALLOW_DEV_MODE to true. Press "D" to see the debug view.

🎭🗣️ Changing Characters/Voices

All character prompts, voices, and system messages are defined in voices.yaml. To add a new character, simply add a new entry. The backend caches this file on startup, so you will need to restart the backend service to see changes.

🔄🧠 Using Different LLM/ALM Servers

The backend is compatible with any OpenAI-compatible API. While it's configured for our VLLM-hosted Audio Flamingo 3 by default, you can easily point it to another service.

Edit your docker-compose.yml and change the environment variables for the backend service.

Example: Using Ollama (🦙)

  backend:
    image: jarvis-backend:latest
    [..]
    environment:
      [..]
      - KYUTAI_LLM_URL=http://host.docker.internal:11434
      - KYUTAI_LLM_MODEL=llama3 # or any model you have pulled
      - KYUTAI_LLM_API_KEY=ollama
    extra_hosts:
      - "host.docker.internal:host-gateway"

Example: Using OpenAI (🤖)

  backend:
    image: jarvis-backend:latest
    [..]
    environment:
      [..]
      - KYUTAI_LLM_URL=https://api.openai.com/v1
      - KYUTAI_LLM_MODEL=gpt-4o
      - KYUTAI_LLM_API_KEY=sk-..

If you use an external API, you can remove the llm (VLLM) service from your docker-compose.yml to save 💾 GPU resources.

🛠️📞 Tool Calling

Tool calling is not yet natively supported by the backend, but it's a highly requested feature.

The easiest way to integrate it is to make it invisible to the Jarvis backend. You can create a small FastAPI server that wraps VLLM, intercepts the requests, performs tool calls, and then returns the final response. See this comment for a conceptual overview.

🙏 Acknowledgements

Jarvis 2.0 stands on the shoulders of giants 🧑‍🔬. This project would not be possible without the foundational work from the Kyutai team on Unmute. We extend our sincere thanks 💖 to them for open-sourcing their high-performance audio pipeline, which serves as the backbone of this project.

📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

About

Voice-to-voice conversational AI: integrated Kyutai STT/TTS, Unmute VAD & vLLM inference for sub-second response. Scaled to 500+ concurrent users via Docker. FastAPI • React • WebSockets • RAG • MCP.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jinja 52.3%
  • Python 29.5%
  • TypeScript 13.6%
  • Shell 1.3%
  • JavaScript 1.0%
  • Dockerfile 1.0%
  • Other 1.3%