Real-time AI physical therapy coaching powered by Vision Language Models and MediaPipe — fully local, no cloud required.
PhysioCoach watches you exercise through your webcam, analyzes your form using a local VLM, counts your reps using pose estimation, and speaks coaching cues aloud in real time. Everything runs on-device.
Built at the Dell × NVIDIA Hackathon 2026 by NYU students. Reached Top-8 out of 30 teams from NYU CDS.
PhysioCoach is built on top of NVIDIA's open-source Live VLM WebUI (Apache-2.0). That project provides the foundation we started from:
- the WebRTC + WebSocket streaming server (
server.py), - the OpenAI-compatible VLM client (
vlm_service.py), - GPU/system monitoring (
gpu_monitor.py) and RTSP camera support (rtsp_track.py).
On top of that base, our team built the physical-therapy layer that makes PhysioCoach: pose-based rep counting, range-of-motion measurement, the exercise library, the coaching/feedback pipeline, and the dual-camera mode. See What We Built for the file-level breakdown, and NOTICE for attribution details. NVIDIA copyright headers and the Apache-2.0 LICENSE are retained throughout.
- 📷 Streams your webcam via WebRTC at 30fps from any browser
- 🤖 Analyzes your form using Qwen2.5-VL-7B running locally (~800ms per frame)
- 🦴 Counts your reps by tracking joint angles with MediaPipe Pose
- 🔊 Speaks coaching cues aloud via browser Text-to-Speech
- 📐 Estimates ROM (range-of-motion) joint angles to track progress
- 📷 Supports dual cameras — front and side view simultaneously
The modules below are PhysioCoach's original contribution — the physical-therapy layer on top of NVIDIA's streaming base:
| Module | What it does | Origin |
|---|---|---|
pose_detector.py |
MediaPipe Pose wrapper: 33 landmarks → joint angles → rep-counting state machine, ROM angles, skeleton overlay | Ours |
exercise_library.py |
15 exercise definitions: joint triplets, rep thresholds, ROM targets, and per-exercise VLM prompt templates | Ours |
session_manager.py |
Session persistence (SQLite), rep-count state, progress tracking | Ours |
video_processor.py |
Per-frame pipeline running the VLM coaching and MediaPipe pose/ROM paths in parallel | Heavily modified |
vlm_service.py |
Structured PT-coaching prompts + JSON feedback parsing | Modified |
static/index.html |
PhysioCoach browser UI: exercise picker, live ROM cards, rep counter, dual-camera grid, TTS | Ours |
server.py, gpu_monitor.py, rtsp_track.py |
WebRTC/VLM server, monitoring, RTSP | NVIDIA (inherited) |
| Category | Exercises |
|---|---|
| Lower body | Bodyweight Squat, Forward Lunge, Calf Raise, Seated Knee Extension, Side-Lying Leg Raise, Standing Hip Abduction, Seated Tennis Ball Squeeze |
| Upper body | Wall Push-Up, Shoulder Raise, Bicep Curl, Hand Tennis Ball Squeeze, Wall Slide with Towel, Seated Water Bottle Overhead Press |
| Stretch | Neck Rotation |
| General | Auto-detect mode — the AI identifies the exercise automatically |
All exercises are defined in exercise_library.py.
Webcam (30fps)
│
▼
WebRTC stream → server.py → video_processor.py
│
├──► Every 15 frames (coaching mode) → Qwen2.5-VL-7B (local)
│ └── JSON coaching cue → natural-language feedback → TTS spoken aloud
│
└──► Every 3rd frame → MediaPipe Pose
└── 33 landmarks → joint angle → threshold crossing → rep count + ROM
The VLM and MediaPipe run in parallel — pose tracking never waits for VLM inference. Pose detection is cheap (~5–15ms on CPU) and runs continuously; the heavier VLM call is throttled to every 15th frame in coaching mode.
src/live_vlm_webui/
├── server.py # WebRTC + WebSocket server (aiohttp + aiortc) [NVIDIA base]
├── video_processor.py # Per-frame pipeline: VLM calls + pose, in parallel
├── vlm_service.py # OpenAI-compatible VLM API client
├── pose_detector.py # MediaPipe Pose: landmarks, joint angles, rep counter, ROM, skeleton overlay
├── exercise_library.py # Exercise definitions, joint configs, ROM targets, VLM prompt templates
├── session_manager.py # Session persistence + rep-counting state
├── gpu_monitor.py # GPU/CPU/RAM monitoring [NVIDIA base]
├── rtsp_track.py # RTSP / IP-camera video track [NVIDIA base]
└── static/index.html # Browser frontend (exercise UI, ROM cards, TTS)
Note: the internal Python import package is still named
live_vlm_webui(kept to preserve the upstream module paths and git history). The installable distribution is namedphysiocoach.
MediaPipe Pose detects 33 body landmarks. For each exercise, a specific 3-joint triplet is tracked and the angle is measured at the middle joint:
| Exercise | Joint triplet (angle at middle joint) | Down threshold | Up threshold |
|---|---|---|---|
| Bodyweight Squat | hip → knee → ankle | 100° | 155° |
| Bicep Curl | shoulder → elbow → wrist | 50° | 140° |
| Calf Raise | hip → knee → ankle | 160° | 172° |
| Shoulder Raise | hip → shoulder → wrist | 30° | 70° |
A rep is counted when the tracked angle passes through both thresholds, completing one full movement cycle. Each exercise has its own joint config and thresholds defined in exercise_library.py.
For shoulder and elbow exercises, the active arm is auto-detected each frame by comparing which wrist is raised or which elbow is more bent. Some exercises (e.g. Neck Rotation, Tennis Ball Squeeze) are tracked by ROM angle or VLM feedback rather than threshold-based rep counting.
We tested four models before choosing Qwen2.5-VL-7B:
| Model | Latency | Result |
|---|---|---|
llama3.2-vision:11b |
4–8s | Too slow for real-time |
llama3.2-vision:90b |
60s+ | OOM |
qwen2.5vl:32b |
— | OOM |
qwen2.5vl:7b |
~800ms | ✅ Used |
The prompt went through three iterations — the final version removes all fallback phrases so the model always comments on what it actually sees.
- Python 3.10+
- Ollama with
qwen2.5vl:7bpulled - A webcam accessible from your browser
git clone https://github.com/deepanshumody/physiocoach.git
cd physiocoach
python3 -m venv .venv
source .venv/bin/activate
pip install -e .ollama pull qwen2.5vl:7b
ollama serve./scripts/start_server.shOpen https://localhost:8090 in your browser. Accept the self-signed certificate warning (Advanced → Proceed), then grant camera access.
- Select an exercise from the dropdown (or leave on General for auto-detection)
- Click Start — the AI begins analyzing your form every ~15 frames
- Listen to coaching cues — spoken aloud via your browser
- Watch the rep counter — increments automatically as you move
- Check ROM angles — displayed live on the video overlay
For exercises where front and side views both matter:
- Connect a second webcam (or use a phone as a second camera)
- Enable Dual Camera in the UI
- The AI receives both feeds and gives form feedback with full 3D context
| Package | Purpose |
|---|---|
aiortc |
WebRTC implementation |
aiohttp |
Async HTTP + WebSocket server |
mediapipe |
Pose landmark detection |
opencv-python |
Frame processing, skeleton overlay |
openai |
OpenAI-compatible VLM API client |
nvidia-ml-py / psutil |
GPU + system monitoring |
Built by Deepanshu Mody, Taruni Nugooru, and Anagha Palandye — NYU Center for Data Science — at the Dell × NVIDIA Hackathon, February 2026. It was a close, hands-on collaboration; the rough split:
- Deepanshu Mody — Real-time pipeline and integration: the pose-based rep-counting engine, the dual-camera WebRTC relay, VLM coaching-prompt engineering, and tying the components together.
- Taruni Nugooru — The range-of-motion (ROM) system end to end: per-exercise joint auto-detection, live on-video and sidebar angle readouts with patient-friendly "degrees-to-go" guidance, the MediaPipe skeleton overlay, active-arm auto-detection for upper-body exercises, and the responsive coaching UI.
- Anagha Palandye — Clinical and exercise design: the physical-therapy exercise library (form criteria, common mistakes, and target ROM angles for every exercise), coaching-feedback UX, cross-exercise testing, and the demo and final presentation.
