Skip to content

rudtest6/UAL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Universal AI Launcher

A voice-controlled desktop application for Windows, macOS, and Linux — open apps, close browser tabs, run scripts, and ask AI questions, all by speaking


"The best interface is no interface."


1. Introduction — What This Project Is

Desktop launchers have existed for decades. Every major operating system ships one. Not one of them solves the problem correctly. Spotlight requires typing. PowerToys Run requires typing. Albert and Rofi require typing. They reduce mouse dependency but do not eliminate it, and none of them can close a specific browser tab, execute a developer script, or answer a question — all of which are things developers do dozens of times a day.

Universal AI Launcher is a voice-controlled desktop application built natively for Windows, macOS, and Linux. The user presses a single hotkey, speaks a command, and the application executes it — opening applications, closing individual browser tabs without touching the browser itself, running custom scripts, or querying an AI assistant — then confirms the result aloud. The entire critical path works completely offline. AI features are additive, not required.

This is not an Electron application wrapped in a shell. It is a native Python application compiled to a standalone executable for each platform, with no runtime dependencies required from the end user.


What Makes This Different

Three decisions separate this project from every other launcher in the space. First, the entire interaction model is voice-first — not voice-added. Every component was designed for spoken commands from the ground up. Second, the application is tab-aware: saying "close YouTube" closes the YouTube tab, not the browser. This requires connecting to the browser's internal debug protocol, and no other open-source launcher implements it. Third, each platform — Windows, macOS, Linux — has its own dedicated codebase using the correct native mechanisms for that platform. Nothing is cross-platform by assumption.


Who This Is For

Universal AI Launcher is built for developers. Specifically, developers who spend their working day switching between tools, running scripts, and managing browser sessions — and who want to do all of it without touching a mouse or breaking their flow.


What This Repository Contains

Current status: This repository is the complete theoretical and architectural foundation of Universal AI Launcher. The application has not yet been built. Every design decision, module responsibility, platform-specific implementation detail, and failure mode is fully documented here. When the application is built, it will be published as a versioned GitHub Release attached to this repository.


2. Version Overview

Universal AI Launcher ships in two versions.

Version 1 — Core Launcher is a completely offline, voice-controlled desktop launcher. It opens and closes applications, manages browser tabs, and provides spoken feedback — with no AI, no external APIs, and no network dependency of any kind.

Version 2 — Developer Edition inherits everything from Version 1 and adds GPT-4o for AI question answering, Sarvam AI for human-quality voice responses, custom script execution bound to voice commands, multi-app command groups, and a full settings panel with encrypted API key storage.

Feature V1 Core V2 Developer
Global hotkey activation
Offline voice recognition (Vosk)
Open installed applications
Close running applications
Open websites in new browser tab
Close specific browser tabs
Robotic TTS voice feedback (pyttsx3)
Visual waveform during listening
GPT-4o AI assistant
Sarvam AI human voice output
Custom script upload (.py and .bat)
Voice command groups
Settings and customisation panel
Encrypted API key storage
Configurable hotkey

3. How It Works

The user presses the global hotkey — Ctrl+Space on Windows and Linux, Cmd+Shift+Space on macOS. The launcher window appears at the centre of the screen and the microphone activates. The user speaks a command. Vosk transcribes it offline. The Intent Parser classifies it as one of four types — OPEN, CLOSE, RUN, or ASK — and routes it to the correct handler. The result is spoken aloud and the window closes after three seconds.

Main command flow from hotkey to spoken confirmation
Figure 1 — The complete command pipeline from hotkey activation to spoken confirmation, showing all four intent paths and the shared input layer.

The four command types work as follows. An OPEN command first checks whether the target is a website (by domain pattern) or an installed application (by name match), then either opens a new browser tab or launches the application — focusing it instead if it is already running. A CLOSE command checks whether the target is an application or a website tab, then either terminates the process or closes matching tabs via the Chrome DevTools Protocol without touching the browser itself. A RUN command (Developer Edition) looks up the spoken phrase in commands.json, executes the bound script silently in the background, and speaks any captured output. An ASK command (Developer Edition) sends the question to GPT-4o — with real system data injected for PC-related questions — and speaks the response in Sarvam AI's human voice.


4. Platform Support

Each platform has its own source directory. The shared module contains voice engine, intent parser, and audio output logic that is identical across all three. Platform-specific modules handle app scanning, process management, window focus, hotkey registration, and browser control using the correct native mechanisms for each operating system.

Capability Windows macOS Linux
App scanning Registry + Start Menu .lnk files /Applications + Spotlight .desktop files + Flatpak + Snap
Launch application subprocess open -a command subprocess
Focus running app win32gui AppleScript xdotool
Close application taskkill osascript quit + pkill pkill
Browser tab control CDP via pychrome CDP via pychrome CDP via pychrome
Global hotkey keyboard library pynput pynput (X11) / ydotool (Wayland)
TTS voice engine SAPI5 via pyttsx3 say command via pyttsx3 espeak-ng via pyttsx3
Packaged output .exe .app .AppImage

5. Tech Stack

Layer Technology Purpose
Voice input Vosk Offline speech recognition — no internet required
Intent classification Custom rule-based parser OPEN / CLOSE / RUN / ASK detection
App discovery winreg, plistlib, configparser Platform-native app scanning
Process management psutil Cross-platform process detection
App launch and close subprocess, win32gui, AppleScript, xdotool Platform-appropriate app control
Browser tab control pychrome (Chrome DevTools Protocol) Tab-level browser control
Hotkey registration keyboard (Windows), pynput (macOS/Linux), ydotool (Wayland) Global keyboard shortcut
Voice output (V1) pyttsx3 Offline robotic TTS feedback
AI assistant (V2) OpenAI GPT-4o Natural language question answering
Human voice (V2) Sarvam AI TTS API Human-quality voice responses
API key storage (V2) cryptography (Fernet) + keyring Encrypted key storage in OS keychain
UI PyQt6 Native cross-platform desktop UI
Packaging PyInstaller Single executable for each platform

6. Repository Structure

universal-ai-launcher/
│
├── README.md                         # This file
├── CHANGELOG.md                      # Version history
├── LICENSE                           # MIT License
│
├── docs/
│   ├── 01-overview.md                # Project overview and version comparison
│   ├── 02-architecture.md            # Full system architecture and module specs
│   ├── 03-version-roadmap.md         # Staged delivery plan with validation criteria
│   ├── 04-how-it-works.md            # Complete operational flow for all command types
│   ├── 05-windows.md                 # Windows-specific implementation
│   ├── 06-macos.md                   # macOS-specific implementation
│   ├── 07-linux.md                   # Linux-specific implementation
│   ├── 08-ai-layer.md                # Developer Edition AI features
│   ├── 09-developer-edition.md       # Scripts, command groups, and settings panel
│   └── 10-fallbacks-and-solutions.md # All 20 failure modes with exact responses
│
├── shared/
│   ├── voiceEngine.py                # Vosk base class
│   ├── intentParser.py               # Intent classification logic
│   └── audioOutput.py               # Audio queue management
│
├── windows/src/                      # Windows implementation
├── macos/src/                        # macOS implementation
├── linux/src/                        # Linux implementation
│
└── assets/
    └── flowcharts/                   # Architecture and flow diagrams

7. Documentation

The complete specification is in the docs/ directory. The recommended reading order for a new contributor is: overview → architecture → how it works → your target platform → fallbacks. The version roadmap and developer edition documents are relevant when extending the application beyond Version 1.

Document What It Covers
01-overview.md What the project is, who it is for, version comparison
02-architecture.md Three-layer architecture, all module specifications, threading model
03-version-roadmap.md Nine staged delivery plan with validation criteria per stage
04-how-it-works.md Full command flow trace for OPEN, CLOSE, RUN, and ASK
05-windows.md Registry scanning, win32gui, taskkill, CDP, PyInstaller on Windows
06-macos.md Permissions, plistlib, AppleScript, pynput, Gatekeeper on macOS
07-linux.md .desktop parsing, Wayland/X11 hotkey, espeak, AppImage on Linux
08-ai-layer.md GPT-4o integration, Sarvam TTS, encrypted key storage, all AI failures
09-developer-edition.md Script manager, command groups, settings panel, complete feature reference
10-fallbacks-and-solutions.md All 20 failure modes — cause, detection, and exact user-facing response

8. Current Status and Roadmap

This project is currently in the architecture specification phase. No code has been written. The documentation in this repository is the complete pre-build specification.

The build plan is structured in nine stages across two versions. Each stage has a specific validation criterion — a concrete, observable outcome that confirms the stage is complete.

Version 1 — Core Launcher (Stages 1–5)

Stage 1 — Voice Input and Hotkey: global hotkey activates Vosk, spoken text returned correctly on all three platforms. Stage 2 — App Discovery and Control: apps open, focus, and close by voice on all three platforms. Stage 3 — Browser Tab Control: websites open in new tabs, specific tabs close without closing the browser. Stage 4 — Error Handling: all 20 failure modes produce their defined spoken responses with no crashes. Stage 5 — Packaging: single executable for each platform passes all validation criteria on a clean machine.

Version 2 — Developer Edition (Stages 6–9)

Stage 6 — AI Assistant and Human Voice: GPT-4o answers questions via Sarvam AI voice, with pyttsx3 fallback. Stage 7 — Custom Script Execution: scripts upload, bind, execute, and return spoken output correctly. Stage 8 — Settings Panel: all configurable options apply instantly without restart. Stage 9 — V2 Packaging: Developer Edition executable passes all V1 and V2 validation criteria.


9. License

MIT License. See LICENSE for full terms.


Document Version: 1.0 — April 2026 Status: Architecture specification — application not yet built Contributions, questions, and challenges to the architecture are welcome via GitHub Issues.

About

Universal AI Launcher is a voice-controlled, cross-platform desktop app. Users press a hotkey, speak commands to open/close apps, manage browser tabs, run scripts, or get AI answers. It responds in a natural voice. Three versions exist: core, AI-enhanced, and a full developer edition with deep customization.

Topics

Resources

License

Stars

Watchers

Forks

Contributors