A voice-controlled desktop application for Windows, macOS, and Linux — open apps, close browser tabs, run scripts, and ask AI questions, all by speaking
"The best interface is no interface."
Desktop launchers have existed for decades. Every major operating system ships one. Not one of them solves the problem correctly. Spotlight requires typing. PowerToys Run requires typing. Albert and Rofi require typing. They reduce mouse dependency but do not eliminate it, and none of them can close a specific browser tab, execute a developer script, or answer a question — all of which are things developers do dozens of times a day.
Universal AI Launcher is a voice-controlled desktop application built natively for Windows, macOS, and Linux. The user presses a single hotkey, speaks a command, and the application executes it — opening applications, closing individual browser tabs without touching the browser itself, running custom scripts, or querying an AI assistant — then confirms the result aloud. The entire critical path works completely offline. AI features are additive, not required.
This is not an Electron application wrapped in a shell. It is a native Python application compiled to a standalone executable for each platform, with no runtime dependencies required from the end user.
Three decisions separate this project from every other launcher in the space. First, the entire interaction model is voice-first — not voice-added. Every component was designed for spoken commands from the ground up. Second, the application is tab-aware: saying "close YouTube" closes the YouTube tab, not the browser. This requires connecting to the browser's internal debug protocol, and no other open-source launcher implements it. Third, each platform — Windows, macOS, Linux — has its own dedicated codebase using the correct native mechanisms for that platform. Nothing is cross-platform by assumption.
Universal AI Launcher is built for developers. Specifically, developers who spend their working day switching between tools, running scripts, and managing browser sessions — and who want to do all of it without touching a mouse or breaking their flow.
Current status: This repository is the complete theoretical and architectural foundation of Universal AI Launcher. The application has not yet been built. Every design decision, module responsibility, platform-specific implementation detail, and failure mode is fully documented here. When the application is built, it will be published as a versioned GitHub Release attached to this repository.
Universal AI Launcher ships in two versions.
Version 1 — Core Launcher is a completely offline, voice-controlled desktop launcher. It opens and closes applications, manages browser tabs, and provides spoken feedback — with no AI, no external APIs, and no network dependency of any kind.
Version 2 — Developer Edition inherits everything from Version 1 and adds GPT-4o for AI question answering, Sarvam AI for human-quality voice responses, custom script execution bound to voice commands, multi-app command groups, and a full settings panel with encrypted API key storage.
| Feature | V1 Core | V2 Developer |
|---|---|---|
| Global hotkey activation | ✅ | ✅ |
| Offline voice recognition (Vosk) | ✅ | ✅ |
| Open installed applications | ✅ | ✅ |
| Close running applications | ✅ | ✅ |
| Open websites in new browser tab | ✅ | ✅ |
| Close specific browser tabs | ✅ | ✅ |
| Robotic TTS voice feedback (pyttsx3) | ✅ | ✅ |
| Visual waveform during listening | ✅ | ✅ |
| GPT-4o AI assistant | ❌ | ✅ |
| Sarvam AI human voice output | ❌ | ✅ |
| Custom script upload (.py and .bat) | ❌ | ✅ |
| Voice command groups | ❌ | ✅ |
| Settings and customisation panel | ❌ | ✅ |
| Encrypted API key storage | ❌ | ✅ |
| Configurable hotkey | ❌ | ✅ |
The user presses the global hotkey — Ctrl+Space on Windows and Linux, Cmd+Shift+Space on macOS. The launcher window appears at the centre of the screen and the microphone activates. The user speaks a command. Vosk transcribes it offline. The Intent Parser classifies it as one of four types — OPEN, CLOSE, RUN, or ASK — and routes it to the correct handler. The result is spoken aloud and the window closes after three seconds.
Figure 1 — The complete command pipeline from hotkey activation to spoken confirmation, showing all four intent paths and the shared input layer.
The four command types work as follows. An OPEN command first checks whether the target is a website (by domain pattern) or an installed application (by name match), then either opens a new browser tab or launches the application — focusing it instead if it is already running. A CLOSE command checks whether the target is an application or a website tab, then either terminates the process or closes matching tabs via the Chrome DevTools Protocol without touching the browser itself. A RUN command (Developer Edition) looks up the spoken phrase in commands.json, executes the bound script silently in the background, and speaks any captured output. An ASK command (Developer Edition) sends the question to GPT-4o — with real system data injected for PC-related questions — and speaks the response in Sarvam AI's human voice.
Each platform has its own source directory. The shared module contains voice engine, intent parser, and audio output logic that is identical across all three. Platform-specific modules handle app scanning, process management, window focus, hotkey registration, and browser control using the correct native mechanisms for each operating system.
| Capability | Windows | macOS | Linux |
|---|---|---|---|
| App scanning | Registry + Start Menu .lnk files |
/Applications + Spotlight |
.desktop files + Flatpak + Snap |
| Launch application | subprocess |
open -a command |
subprocess |
| Focus running app | win32gui |
AppleScript | xdotool |
| Close application | taskkill |
osascript quit + pkill |
pkill |
| Browser tab control | CDP via pychrome |
CDP via pychrome |
CDP via pychrome |
| Global hotkey | keyboard library |
pynput |
pynput (X11) / ydotool (Wayland) |
| TTS voice engine | SAPI5 via pyttsx3 |
say command via pyttsx3 |
espeak-ng via pyttsx3 |
| Packaged output | .exe |
.app |
.AppImage |
| Layer | Technology | Purpose |
|---|---|---|
| Voice input | Vosk | Offline speech recognition — no internet required |
| Intent classification | Custom rule-based parser | OPEN / CLOSE / RUN / ASK detection |
| App discovery | winreg, plistlib, configparser |
Platform-native app scanning |
| Process management | psutil |
Cross-platform process detection |
| App launch and close | subprocess, win32gui, AppleScript, xdotool |
Platform-appropriate app control |
| Browser tab control | pychrome (Chrome DevTools Protocol) |
Tab-level browser control |
| Hotkey registration | keyboard (Windows), pynput (macOS/Linux), ydotool (Wayland) |
Global keyboard shortcut |
| Voice output (V1) | pyttsx3 |
Offline robotic TTS feedback |
| AI assistant (V2) | OpenAI GPT-4o | Natural language question answering |
| Human voice (V2) | Sarvam AI TTS API | Human-quality voice responses |
| API key storage (V2) | cryptography (Fernet) + keyring |
Encrypted key storage in OS keychain |
| UI | PyQt6 | Native cross-platform desktop UI |
| Packaging | PyInstaller | Single executable for each platform |
universal-ai-launcher/
│
├── README.md # This file
├── CHANGELOG.md # Version history
├── LICENSE # MIT License
│
├── docs/
│ ├── 01-overview.md # Project overview and version comparison
│ ├── 02-architecture.md # Full system architecture and module specs
│ ├── 03-version-roadmap.md # Staged delivery plan with validation criteria
│ ├── 04-how-it-works.md # Complete operational flow for all command types
│ ├── 05-windows.md # Windows-specific implementation
│ ├── 06-macos.md # macOS-specific implementation
│ ├── 07-linux.md # Linux-specific implementation
│ ├── 08-ai-layer.md # Developer Edition AI features
│ ├── 09-developer-edition.md # Scripts, command groups, and settings panel
│ └── 10-fallbacks-and-solutions.md # All 20 failure modes with exact responses
│
├── shared/
│ ├── voiceEngine.py # Vosk base class
│ ├── intentParser.py # Intent classification logic
│ └── audioOutput.py # Audio queue management
│
├── windows/src/ # Windows implementation
├── macos/src/ # macOS implementation
├── linux/src/ # Linux implementation
│
└── assets/
└── flowcharts/ # Architecture and flow diagrams
The complete specification is in the docs/ directory. The recommended reading order for a new contributor is: overview → architecture → how it works → your target platform → fallbacks. The version roadmap and developer edition documents are relevant when extending the application beyond Version 1.
| Document | What It Covers |
|---|---|
01-overview.md |
What the project is, who it is for, version comparison |
02-architecture.md |
Three-layer architecture, all module specifications, threading model |
03-version-roadmap.md |
Nine staged delivery plan with validation criteria per stage |
04-how-it-works.md |
Full command flow trace for OPEN, CLOSE, RUN, and ASK |
05-windows.md |
Registry scanning, win32gui, taskkill, CDP, PyInstaller on Windows |
06-macos.md |
Permissions, plistlib, AppleScript, pynput, Gatekeeper on macOS |
07-linux.md |
.desktop parsing, Wayland/X11 hotkey, espeak, AppImage on Linux |
08-ai-layer.md |
GPT-4o integration, Sarvam TTS, encrypted key storage, all AI failures |
09-developer-edition.md |
Script manager, command groups, settings panel, complete feature reference |
10-fallbacks-and-solutions.md |
All 20 failure modes — cause, detection, and exact user-facing response |
This project is currently in the architecture specification phase. No code has been written. The documentation in this repository is the complete pre-build specification.
The build plan is structured in nine stages across two versions. Each stage has a specific validation criterion — a concrete, observable outcome that confirms the stage is complete.
Version 1 — Core Launcher (Stages 1–5)
Stage 1 — Voice Input and Hotkey: global hotkey activates Vosk, spoken text returned correctly on all three platforms. Stage 2 — App Discovery and Control: apps open, focus, and close by voice on all three platforms. Stage 3 — Browser Tab Control: websites open in new tabs, specific tabs close without closing the browser. Stage 4 — Error Handling: all 20 failure modes produce their defined spoken responses with no crashes. Stage 5 — Packaging: single executable for each platform passes all validation criteria on a clean machine.
Version 2 — Developer Edition (Stages 6–9)
Stage 6 — AI Assistant and Human Voice: GPT-4o answers questions via Sarvam AI voice, with pyttsx3 fallback. Stage 7 — Custom Script Execution: scripts upload, bind, execute, and return spoken output correctly. Stage 8 — Settings Panel: all configurable options apply instantly without restart. Stage 9 — V2 Packaging: Developer Edition executable passes all V1 and V2 validation criteria.
MIT License. See LICENSE for full terms.
Document Version: 1.0 — April 2026 Status: Architecture specification — application not yet built Contributions, questions, and challenges to the architecture are welcome via GitHub Issues.