A terminal-based LLM chat app that runs locally and interacts with your vLLM server
TermiLLM is a client. It does not run model inference itself.
The intended design is:
- TermiLLM runs as a terminal client in its own Python environment
- The inference backend runs as a separate service
- The two communicate over HTTP using OpenAI-compatible endpoints such as
/v1/modelsand/v1/chat/completions
This means you can:
- Run vLLM in a different local Python environment
- Run vLLM on another machine and point TermiLLM at it
- Replace vLLM with another OpenAI-compatible backend later
TermiLLM works well with vLLM, but vLLM is expected to be started separately from the TermiLLM client. Before using TermiLLM:
- Install vLLM in a separate environment if needed
- Start a vLLM server with your preferred model, for example:
python -m vllm.entrypoints.api_server --model Qwen/Qwen2.5-Coder-3B-Instruct --port 8000For local development, a common setup is:
- terminal A: activate your
vllmenvironment and start the vLLM server on port8000 - terminal B: activate TermiLLM's environment and run
./run.sh
- Interactive Chat Interface: Connect to your local vLLM backend with streaming responses
- User Experience:
- Colorful output using Rich for a pleasant terminal experience
- Keyboard navigation to review chat history
- Stream responses from your local LLM in real-time
- Command System:
/help- Display available commands/clear- Clear the current conversation/exit- Exit the application/model- Change the model on the fly/temp- Adjust temperature setting/max_tokens- Change maximum token output
- Configuration Management:
- Persistent settings via JSON configuration file
- Dynamic model switching without restarting
source ./venv.sh
./run.shBy default, TermiLLM connects to http://localhost:8000.
You can also specify a different model or server:
./run.sh --model Qwen/Qwen2.5-Coder-3B-Instruct --base-url http://localhost:8000If your inference service is already running in another local environment or on another machine, only the base_url and model name need to match that backend.
TermiLLM creates a configuration file named termillm_config.json in the application directory that stores your settings. You can edit this file directly to customize your preferences:
{
"model": "Qwen/Qwen2.5-Coder-3B-Instruct",
"base_url": "http://localhost:8000",
"temperature": 0.7,
"max_tokens": 2048
}Settings can also be changed from within the application using commands like /model, /temp, and /max_tokens.
Interesting? Feel free to contribute or create a PR for features you want or bugs you found! The following is the plan:
- Basic Chat Feature
- Connect to vLLM backend
- Send/receive message to/from backend
- Support Streamed Output
- Support keyboard move
- Slash Commands
- /help
- /clear
- /exit
- Configurable model
- Support diff model through vllm
- Change model use '/model'
- Save previous model selection
- Check model (backend connection) before start
- Move setting to JSON
- Colorful Output: Use rich to make UX more pleasant
-
Provided more message during generating - Documentation
- Restructure the Python app into replaceable modules
- Add a Python MVP agent loop
- Add confirmation and safety policy for command execution
- Add pytest
- CI/CD
- Highlight the code in output(may use a buffer)
- Local file support
- READ file, such as cpp, py, txt, md
- Write to file
- Generate file
- Linux command
- API Config
- Integrated vLLM as part of the project
- Docker
- A LangChain Mode
- Moving to bubbletea style
- Integrated local inference into it
- Integrated model into it