Skip to content

Kshitijm7/digital-persona

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

152 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Digital Persona Hero

Digital Persona

A Persistent, Emotionally Reactive 3D Avatar Powered by Gemini 2.5 Flash Native Audio.

App Deployment App CI Docs Deployment License
Gemini 2.5 Flash Vertex AI Cloud Run Next.js 16 React 19 TypeScript
Three.js Tailwind CSS v4 Framer Motion VitePress Vitest

πŸ“š View Documentation Gateway | ☁️ View Live Cloud Run App

Bird's Eye View: A fully embodied 3D digital instance built for the Google Gemini Live Agent Challenge. It merges the Gemini Live Multimodal API with a React Three Fiber driven Ready Player Me avatar, achieving sub-100ms conversational latency with procedural "life-like" ARKit blendshape expressions.

Warning

Active Development: This project was rapidly architected for the Gemini Live Agent Challenge and is under heavy, active development. Some configurations may break, and unstable branches exhibit experimental behaviors. Check the issues log before deploying to production.


πŸ“– Overview

Digital Persona is a state-of-the-art multimodal AI implementation. It doesn't simply return text β€” it is an embodied entity capable of sustaining natural face-to-face interactions. Using a Next.js 16 Client-to-Server WebSockets architecture, the avatar leverages Google's Gemini Multimodal Live API to hear, see, and express emotions natively.

The application uses native audio, Voice Activity Detection (barge-in), session management, and Ephemeral Tokens for secure, low-latency browser streaming.

πŸ’– The Engineering Effort

We poured hundreds of hours into solving complex constraints. Traditional chatbots ping APIs β€” we faced the visceral challenge of translating an LLM's audio output into physical 3D behaviors in under 100 milliseconds. The result: custom real-time audio chunkers, mathematical sine-wave breathers, co-articulating lip-sync extractors, and an entirely custom "Nervous System" that links Gemini Tool Calling directly into Three.js WebGL animations. Every pixel, floating panel, and blink speed curve was obsessively tuned for maximum realism.


πŸš€ Key Features

1. Zero-Latency Conversational Architecture

  • Client-to-Server WebSockets: Ephemeral Tokens provisioned by a Next.js backend β€” the frontend connects directly to Gemini for pure audio streaming, dropping the proxy overhead.
  • Seamless Barge-In (Interruption Handling): Gemini's native Voice Activity Detection allows natural interruptions. If a user speaks over the avatar, it instantly discards audio buffers, cancels animations, and listens.
  • Session Resumption & Memory: Maintains a 128,000-token context window; if the connection drops, it resumes the session without losing conversation history.

2. Emotional Intelligence & Emotive Realism

  • Affective Dialog (Emotion Awareness): The model processes native raw audio to detect the acoustic nuances of your voice (frustration, joy, hesitation) and automatically adjusts its own spoken tone, pitch, and pace to match.
  • Co-articulation & Procedural Life: Combines ARKit blendshapes and Oculus Visemes for smooth phoneme transitions alongside involuntary micro-twitches and respiratory breathing curves.

3. Deep Multimodal Awareness

  • Real-Time Vision (Webcam Streaming): Continuously captures the user's environment via webcam (Base64 JPEG @ ~1 FPS), allowing Gemini to literally "see" you and ground answers on visual context.
  • Dual Audio/Text Transcriptions: Real-time transcripts of both the user's spoken words and the model's generated audio are streamed and displayed simultaneously in the UI for seamless accessibility.
  • Google Search Grounding: Gives the avatar access to the live internet to answer questions with real-time, factual accuracy while it speaks.

4. The Puppeteer (Tool Calling Orchestration)

  • Unified State Control: The "Nervous System" exposes consolidated tools like update_persona_state and enum-locked trigger_animation directly over WebSocket.
  • Synchronized Physicality: The Gemini model dynamically maps conversational intent and sentiment to ARKit expressions and gestural triggers concurrently with audio delivery.

5. 🎨 Extensive UI & Control Panel

A glassmorphism "Control Center" with granular runtime configurations:

  • Floating Chatbox β€” Live transcript feed (user + avatar) with API debugging logs.
  • Persona Modes β€” Switch the system prompt dynamically (Tutor, Therapist, Interviewer).
  • Gemini Toggles β€” Enable/disable Proactive Audio, Search Grounding, and Affective Dialog.

πŸ—οΈ Architecture

The system uses an Ephemeral Token proxy to establish a direct, low-latency WebSocket connection to the Gemini Multimodal Live API from the browser, bypassing heavy backend routing.

graph TD
    %% Styling Definition
    classDef client fill:#f3f4f6,stroke:#374151,stroke-width:2px,color:#111827
    classDef backend fill:#e0e7ff,stroke:#4338ca,stroke-width:2px,color:#312e81
    classDef gemini fill:#dbeafe,stroke:#1d4ed8,stroke-width:2px,color:#1e3a8a
    classDef component fill:#ffffff,stroke:#9ca3af,stroke-width:1px,color:#374151,rx:8px,ry:8px

    subgraph Client ["πŸ–₯️ Next.js Frontend (Browser)"]
        A["πŸŽ™οΈ Webcam & Mic"]:::component
        B["⚑ useGeminiLive Hook"]:::component
        C["🎨 React Three Fiber Canvas"]:::component
        D["πŸ‘€ Ready Player Me Avatar"]:::component
        
        A --> B
        B --> C
        C --> D
    end

    subgraph Backend ["☁️ Google Cloud Platform"]
        E["πŸ” Next.js Route /api/token"]:::component
        F["🧠 Gemini 2.5 Flash Native Audio"]:::gemini
        
        E --> F
    end

    %% Add styling to subgraphs
    class Client client
    class Backend backend

    %% Security & Connection Logic
    Client -->|"1️⃣ Request Token"| E
    E -->|"2️⃣ Ephemeral Token"| Client
    Client <===>|"3️⃣ Bi-directional WSS"| F
Loading

πŸ”„ Real-Time Interaction Sequence (VAD & Tool Calling)

This sequence illustrates the sub-100ms latency loop with native interruption handling.

sequenceDiagram
    autonumber
    
    participant User
    participant Client as Frontend Client (useGeminiLive)
    participant Avatar as R3F Avatar
    participant API as Auth API (/api/token)
    participant Gemini as Gemini 2.5 Live

    %% Authentication Phase
    rect rgb(245,245,245)
    Note over Client,API: Authentication Phase
    Client->>API: POST /api/token
    API-->>Client: 200 OK (ephemeral JWT)
    end

    %% Realtime Session Setup
    rect rgb(230,240,255)
    Note over Client,Gemini: Realtime WebSocket Session
    Client->>Gemini: Establish Live Connection
    end

    %% Multimodal User Interaction
    par Audio & Video Streaming
        User->>Client: Speak: "What am I holding?"
        Client->>Gemini: sendRealtimeInput (16kHz PCM audio stream)
    and
        Client->>Gemini: sendRealtimeInput (Base64 JPEG @ 1fps)
    and Text Chat Fallback
        User->>Client: Types: "Explain quantum physics."
        Client->>Gemini: send({ clientContent: { text } })
    end

    Note right of Gemini: Multimodal processing (Audio + Video + Text)

    Gemini-->>Client: serverContent (audio response stream)
    Gemini-->>Client: toolCall(trigger_animation: point)

    %% Parallel Processing
    par Avatar Rendering
        Client->>Avatar: Stream audio + visemes
        Client->>Avatar: Trigger animation "point"
    and Tool Response
        Client->>Gemini: sendToolResponse(result: ok)
    end

    Avatar-->>User: Avatar gestures and speaks

    %% Barge-in / Interruption
    rect rgb(255,240,240)
    Note over User,Gemini: Interruption Handling (Barge-in)
    User->>Client: "Stop, that's enough"
    Client->>Gemini: Stream new audio input
    Gemini-->>Client: serverContent(interrupted=true)
    Client->>Avatar: Drop current audio queue
    end
Loading

πŸ›  Technology Stack

Layer Technologies
Framework Next.js 16 β€’ React 19 β€’ TypeScript β€’ Tanstack Query
3D Rendering Three.js β€’ React Three Fiber β€’ Drei β€’ Three-stdlib
AI Core @google/genai SDK β€’ Gemini 2.5 Flash β€’ wawa-lipsync β€’ Phonemize
Audio Pipeline 16-bit PCM β€’ 16kHz mono β€’ WebSocket streaming
State Management Zustand β€’ React Context
UI / Styling Tailwind CSS v4 β€’ Framer Motion β€’ Radix UI β€’ Lucide β€’ CVA
Utilities Pino (Structured Logging) β€’ Sentiment β€’ Simplex-noise β€’ Usehooks-ts
Testing Vitest β€’ Testing Library β€’ Happy DOM β€’ JSDOM
Deployment Docker β€’ Cloud Run β€’ GitHub Actions β€’ GitHub Pages
Documentation VitePress β€’ Mermaid

βš™οΈ Getting Started

Prerequisites

  • Node.js v20 or later
  • A Google Cloud Project with the Gemini API enabled
  • A GEMINI_API_KEY (get one here)

Installation

git clone https://github.com/Kshitijm7/digital-persona.git
cd digital-persona
npm install

Configuration

Create a .env.local file in the project root:

GEMINI_API_KEY=your_gemini_api_key_here

# Logging (local/dev)
LOG_LEVEL_DEV=debug
NEXT_PUBLIC_LOG_LEVEL_DEV=debug

# Optional generic overrides
# LOG_LEVEL=debug
# NEXT_PUBLIC_LOG_LEVEL=debug

# Production controls (used in deploy pipeline)
# DISABLE_LOGS_IN_PROD=true
# LOG_LEVEL_PROD=silent
# NEXT_PUBLIC_LOG_LEVEL_PROD=silent

Running Locally

# Validate the avatar model and animations
npm run setup-avatar

# Start the development server
npm run dev

Navigate to http://localhost:3000. Grant Microphone & Camera permissions when prompted.

Available Scripts

Script Description
npm run dev Start development server
npm run build Production build
npm run lint Run ESLint
npm run typecheck TypeScript type checking
npm run test:run Run test suite
npm run preflight Lint β†’ Typecheck β†’ Test β†’ Build (full CI locally)
npm run docs:build Build VitePress documentation
npm run health-check Pre-deploy system validation

🐳 Docker Deployment

The app ships with a multi-stage Dockerfile optimized for Cloud Run:

# Build the image
docker build -t digital-persona .

# Run locally
docker run -p 8080:8080 -e GEMINI_API_KEY=your_key digital-persona

The CI/CD pipeline automatically builds, pushes to Artifact Registry, and deploys to Cloud Run on every push to main.


πŸ“‚ Project Structure

digital-persona/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ app/           # Next.js pages and API routes
β”‚   β”œβ”€β”€ components/    # React & Three.js components
β”‚   β”œβ”€β”€ hooks/         # Custom hooks (useGeminiLive, useWebcam, etc.)
β”‚   β”œβ”€β”€ store/         # Zustand stores (lip-sync, emotion, animation)
β”‚   β”œβ”€β”€ lib/           # Constants, utilities, and shared logic
β”‚   └── config/        # Runtime configuration
β”œβ”€β”€ public/            # Avatar .glb, animations, and static assets
β”œβ”€β”€ docs/              # VitePress documentation site
β”œβ”€β”€ scripts/           # Automation (avatar validation, animation download)
└── project_docs/      # Internal architecture documents

Note

See the individual subdirectory README.md files for detailed internal documentation.


🀝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/my-feature)
  3. Make your changes and add tests
  4. Run the preflight check: npm run preflight
  5. Submit a Pull Request

See the issues log for areas where help is needed.


πŸ“„ License

This project is licensed under the Apache License 2.0 β€” see the LICENSE file for details.


Built with ❀️ by Kshitij Mittal with the Google Gemini Live API