A Persistent, Emotionally Reactive 3D Avatar Powered by Gemini 2.5 Flash Native Audio.
π View Documentation Gateway | βοΈ View Live Cloud Run App
Bird's Eye View: A fully embodied 3D digital instance built for the Google Gemini Live Agent Challenge. It merges the Gemini Live Multimodal API with a React Three Fiber driven Ready Player Me avatar, achieving sub-100ms conversational latency with procedural "life-like" ARKit blendshape expressions.
Warning
Active Development: This project was rapidly architected for the Gemini Live Agent Challenge and is under heavy, active development. Some configurations may break, and unstable branches exhibit experimental behaviors. Check the issues log before deploying to production.
Digital Persona is a state-of-the-art multimodal AI implementation. It doesn't simply return text β it is an embodied entity capable of sustaining natural face-to-face interactions. Using a Next.js 16 Client-to-Server WebSockets architecture, the avatar leverages Google's Gemini Multimodal Live API to hear, see, and express emotions natively.
The application uses native audio, Voice Activity Detection (barge-in), session management, and Ephemeral Tokens for secure, low-latency browser streaming.
We poured hundreds of hours into solving complex constraints. Traditional chatbots ping APIs β we faced the visceral challenge of translating an LLM's audio output into physical 3D behaviors in under 100 milliseconds. The result: custom real-time audio chunkers, mathematical sine-wave breathers, co-articulating lip-sync extractors, and an entirely custom "Nervous System" that links Gemini Tool Calling directly into Three.js WebGL animations. Every pixel, floating panel, and blink speed curve was obsessively tuned for maximum realism.
- Client-to-Server WebSockets: Ephemeral Tokens provisioned by a Next.js backend β the frontend connects directly to Gemini for pure audio streaming, dropping the proxy overhead.
- Seamless Barge-In (Interruption Handling): Gemini's native Voice Activity Detection allows natural interruptions. If a user speaks over the avatar, it instantly discards audio buffers, cancels animations, and listens.
- Session Resumption & Memory: Maintains a 128,000-token context window; if the connection drops, it resumes the session without losing conversation history.
- Affective Dialog (Emotion Awareness): The model processes native raw audio to detect the acoustic nuances of your voice (frustration, joy, hesitation) and automatically adjusts its own spoken tone, pitch, and pace to match.
- Co-articulation & Procedural Life: Combines ARKit blendshapes and Oculus Visemes for smooth phoneme transitions alongside involuntary micro-twitches and respiratory breathing curves.
- Real-Time Vision (Webcam Streaming): Continuously captures the user's environment via webcam (Base64 JPEG @ ~1 FPS), allowing Gemini to literally "see" you and ground answers on visual context.
- Dual Audio/Text Transcriptions: Real-time transcripts of both the user's spoken words and the model's generated audio are streamed and displayed simultaneously in the UI for seamless accessibility.
- Google Search Grounding: Gives the avatar access to the live internet to answer questions with real-time, factual accuracy while it speaks.
- Unified State Control: The "Nervous System" exposes consolidated tools like
update_persona_stateand enum-lockedtrigger_animationdirectly over WebSocket. - Synchronized Physicality: The Gemini model dynamically maps conversational intent and sentiment to ARKit expressions and gestural triggers concurrently with audio delivery.
A glassmorphism "Control Center" with granular runtime configurations:
- Floating Chatbox β Live transcript feed (user + avatar) with API debugging logs.
- Persona Modes β Switch the system prompt dynamically (Tutor, Therapist, Interviewer).
- Gemini Toggles β Enable/disable
Proactive Audio,Search Grounding, andAffective Dialog.
The system uses an Ephemeral Token proxy to establish a direct, low-latency WebSocket connection to the Gemini Multimodal Live API from the browser, bypassing heavy backend routing.
graph TD
%% Styling Definition
classDef client fill:#f3f4f6,stroke:#374151,stroke-width:2px,color:#111827
classDef backend fill:#e0e7ff,stroke:#4338ca,stroke-width:2px,color:#312e81
classDef gemini fill:#dbeafe,stroke:#1d4ed8,stroke-width:2px,color:#1e3a8a
classDef component fill:#ffffff,stroke:#9ca3af,stroke-width:1px,color:#374151,rx:8px,ry:8px
subgraph Client ["π₯οΈ Next.js Frontend (Browser)"]
A["ποΈ Webcam & Mic"]:::component
B["β‘ useGeminiLive Hook"]:::component
C["π¨ React Three Fiber Canvas"]:::component
D["π€ Ready Player Me Avatar"]:::component
A --> B
B --> C
C --> D
end
subgraph Backend ["βοΈ Google Cloud Platform"]
E["π Next.js Route /api/token"]:::component
F["π§ Gemini 2.5 Flash Native Audio"]:::gemini
E --> F
end
%% Add styling to subgraphs
class Client client
class Backend backend
%% Security & Connection Logic
Client -->|"1οΈβ£ Request Token"| E
E -->|"2οΈβ£ Ephemeral Token"| Client
Client <===>|"3οΈβ£ Bi-directional WSS"| F
This sequence illustrates the sub-100ms latency loop with native interruption handling.
sequenceDiagram
autonumber
participant User
participant Client as Frontend Client (useGeminiLive)
participant Avatar as R3F Avatar
participant API as Auth API (/api/token)
participant Gemini as Gemini 2.5 Live
%% Authentication Phase
rect rgb(245,245,245)
Note over Client,API: Authentication Phase
Client->>API: POST /api/token
API-->>Client: 200 OK (ephemeral JWT)
end
%% Realtime Session Setup
rect rgb(230,240,255)
Note over Client,Gemini: Realtime WebSocket Session
Client->>Gemini: Establish Live Connection
end
%% Multimodal User Interaction
par Audio & Video Streaming
User->>Client: Speak: "What am I holding?"
Client->>Gemini: sendRealtimeInput (16kHz PCM audio stream)
and
Client->>Gemini: sendRealtimeInput (Base64 JPEG @ 1fps)
and Text Chat Fallback
User->>Client: Types: "Explain quantum physics."
Client->>Gemini: send({ clientContent: { text } })
end
Note right of Gemini: Multimodal processing (Audio + Video + Text)
Gemini-->>Client: serverContent (audio response stream)
Gemini-->>Client: toolCall(trigger_animation: point)
%% Parallel Processing
par Avatar Rendering
Client->>Avatar: Stream audio + visemes
Client->>Avatar: Trigger animation "point"
and Tool Response
Client->>Gemini: sendToolResponse(result: ok)
end
Avatar-->>User: Avatar gestures and speaks
%% Barge-in / Interruption
rect rgb(255,240,240)
Note over User,Gemini: Interruption Handling (Barge-in)
User->>Client: "Stop, that's enough"
Client->>Gemini: Stream new audio input
Gemini-->>Client: serverContent(interrupted=true)
Client->>Avatar: Drop current audio queue
end
| Layer | Technologies |
|---|---|
| Framework | Next.js 16 β’ React 19 β’ TypeScript β’ Tanstack Query |
| 3D Rendering | Three.js β’ React Three Fiber β’ Drei β’ Three-stdlib |
| AI Core | @google/genai SDK β’ Gemini 2.5 Flash β’ wawa-lipsync β’ Phonemize |
| Audio Pipeline | 16-bit PCM β’ 16kHz mono β’ WebSocket streaming |
| State Management | Zustand β’ React Context |
| UI / Styling | Tailwind CSS v4 β’ Framer Motion β’ Radix UI β’ Lucide β’ CVA |
| Utilities | Pino (Structured Logging) β’ Sentiment β’ Simplex-noise β’ Usehooks-ts |
| Testing | Vitest β’ Testing Library β’ Happy DOM β’ JSDOM |
| Deployment | Docker β’ Cloud Run β’ GitHub Actions β’ GitHub Pages |
| Documentation | VitePress β’ Mermaid |
- Node.js v20 or later
- A Google Cloud Project with the Gemini API enabled
- A
GEMINI_API_KEY(get one here)
git clone https://github.com/Kshitijm7/digital-persona.git
cd digital-persona
npm installCreate a .env.local file in the project root:
GEMINI_API_KEY=your_gemini_api_key_here
# Logging (local/dev)
LOG_LEVEL_DEV=debug
NEXT_PUBLIC_LOG_LEVEL_DEV=debug
# Optional generic overrides
# LOG_LEVEL=debug
# NEXT_PUBLIC_LOG_LEVEL=debug
# Production controls (used in deploy pipeline)
# DISABLE_LOGS_IN_PROD=true
# LOG_LEVEL_PROD=silent
# NEXT_PUBLIC_LOG_LEVEL_PROD=silent# Validate the avatar model and animations
npm run setup-avatar
# Start the development server
npm run devNavigate to http://localhost:3000. Grant Microphone & Camera permissions when prompted.
| Script | Description |
|---|---|
npm run dev |
Start development server |
npm run build |
Production build |
npm run lint |
Run ESLint |
npm run typecheck |
TypeScript type checking |
npm run test:run |
Run test suite |
npm run preflight |
Lint β Typecheck β Test β Build (full CI locally) |
npm run docs:build |
Build VitePress documentation |
npm run health-check |
Pre-deploy system validation |
The app ships with a multi-stage Dockerfile optimized for Cloud Run:
# Build the image
docker build -t digital-persona .
# Run locally
docker run -p 8080:8080 -e GEMINI_API_KEY=your_key digital-personaThe CI/CD pipeline automatically builds, pushes to Artifact Registry, and deploys to Cloud Run on every push to main.
digital-persona/
βββ src/
β βββ app/ # Next.js pages and API routes
β βββ components/ # React & Three.js components
β βββ hooks/ # Custom hooks (useGeminiLive, useWebcam, etc.)
β βββ store/ # Zustand stores (lip-sync, emotion, animation)
β βββ lib/ # Constants, utilities, and shared logic
β βββ config/ # Runtime configuration
βββ public/ # Avatar .glb, animations, and static assets
βββ docs/ # VitePress documentation site
βββ scripts/ # Automation (avatar validation, animation download)
βββ project_docs/ # Internal architecture documents
Note
See the individual subdirectory README.md files for detailed internal documentation.
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/my-feature) - Make your changes and add tests
- Run the preflight check:
npm run preflight - Submit a Pull Request
See the issues log for areas where help is needed.
This project is licensed under the Apache License 2.0 β see the LICENSE file for details.