LlamaFIM is a VS Code extension that provides real‑time inline completions powered by a local Llama.CPP server. It works great for developers who want an on‑premises LLM assistant without sending data to the cloud.
The extension implements the inline completion provider API introduced in VS Code 1.78 and forwards user input to a running Llama.CPP endpoint. The server returns an infill response containing the next chunk of text, which the extension then displays as an inline suggestion.
- Modern inline completion experience (no separate suggestion list).
- Lightweight client – all heavy lifting occurs on the local Llama.CPP server.
- Configurable debounce delay, timeout, and server URL.
- Automatic request cancellation when the cursor moves.
- Built‑in request timeout to avoid hanging requests.
- Install the extension from the VS Code Marketplace or copy the repository into a folder and use
code .to open it. - Ensure a Llama.CPP server is running and reachable. By default the extension expects the endpoint at
http://localhost:8080. The server can be started with:./llama.cpp/main -m <model.gguf> --port 8080
- Reload VS Code or run Reload Window.
The extension exposes a handful of workspace settings under the namespace llamafim. Open settings.json and add any of the following options:
| Setting | Type | Default | Description |
|---|---|---|---|
enabled |
boolean | true |
Disable the extension entirely. |
debouncedelay |
number | 250 |
Milliseconds to debounce inline completion requests. |
url |
string | http://localhost:8080 |
Base URL of the Llama.CPP server (without /infill). |
timeout |
number | 3500 |
Request timeout in milliseconds. |
contextsize |
number | 4096 |
Maximum context size (in tokens) sent to the Llama.CPP server. |
Example:
{
"llamafim.enabled": true,
"llamafim.debouncedelay": 200,
"llamafim.url": "http://127.0.0.1:8080",
"llamafim.timeout": 5000,
"llamafim.contextsize": 4096
}
Once configured, simply type in any file. After you pause for the debounce delay, the extension will send the surrounding context to the server and display the returned text as an inline suggestion. The suggestion can be accepted by pressing Tab or rejected by continuing to type.
The provider is registered for all languages ({ pattern: '**' }).
When the extension is active a status bar item appears on the right.
Clicking it toggles the provider’s enabled state – the next inline suggestion will be shown or suppressed accordingly. This toggle is runtime only; the setting llamafim.enabled only sets the initial value when VS Code starts.
A quick start guide for contributing:
# Install dependencies
npm install
# Compile TypeScript
npm run compile
# Run tests (if available)
npm test
# Start a watch build while you edit
npm run watchThe project uses ESBuild for bundling and TSLint/ESLint for linting.
src/extension.ts– Entry point, registers the provider.src/provider.ts– Implements request logic and cancellation.src/config.ts– Reads and normalises VS Code settings.src/defs.ts– Type definitions for the Llama.CPP response.
The test suite lives under test/. It uses Mocha and chai. To run the tests:
npm testThe current tests cover configuration parsing and inline completion logic with mocked fetch responses.
Pull requests are welcome! Please:
- Fork the repository.
- Create a feature branch.
- Run the test suite and ensure all tests pass.
- Submit a pull request.
Before submitting, run the linter:
npm run lintThis project is licensed under the MIT License. See the file for details.
- Llama.CPP – the lightweight inference engine.
- VS Code Extension API
Tip – If you experience performance issues, consider lowering
n_predictor increasingdebouncedelayin the settings.