A real-time speech-to-text transcription system using a Python WebSocket server powered by Vosk, and a modern browser-based client using Vite and vanilla JavaScript.
Supports Multiple languages based on the used model. Designed for low-latency and efficient bandwidth usage with silence detection on the client side.
./
├── client/ # Vite-based client
│ ├── index.html # Entry HTML file
│ ├── package.json # Vite + dependencies
│ ├── package-lock.json
│ └── src/
│ ├── main.js # App entrypoint
│ └── style.css # App styling
├── main.py # Python WebSocket server (Vosk-based)
├── pyproject.toml # Python dependency spec
├── uv.lock # Lock file (e.g., for uv or pip)
├── README.md
git clone https://github.com/Abdulkhalek-1/realtime-transcriptor.git
cd realtime-transcriptorIt's recommended to use uv:
pip install uv
uv syncModels are not included due to their size. Download a suitable model for your language from the official Vosk models page:
-
Arabic: https://alphacephei.com/vosk/models/vosk-model-ar-0.22-linto-1.1.0.zip
-
English (small): https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
-
English (large, more accurate): https://alphacephei.com/vosk/models/vosk-model-en-us-0.42-gigaspeech.zip
-
All Models: https://alphacephei.com/vosk/models
After downloading, extract the model zip to a folder named model inside your project root:
unzip ./<your-model-name>
mv ./<your-model-name> ./modeluv run main.pyThe server will start listening on WebSocket ws://localhost:8765.
cd client
npm install
npm run devOpen your browser at the URL printed by Vite (usually http://localhost:5173).
The server calls recognizer.FinalResult() when a client disconnects to finalize the last transcribed sentence. This ensures no spoken words are lost due to streaming buffering.
To reduce bandwidth and improve transcription segmentation:
- The client detects silence by monitoring audio levels.
- When silence is detected for more than 1 second, the client disconnects and reconnects the WebSocket.
- This triggers the server to finalize and send the last transcription segment.
- This approach keeps server logic simple and offloads smart behavior to the client.
MIT License — feel free to use and modify.