Run Google's Gemma 4 models entirely on-device, embedded in a Node.js process. Text, image, and audio in — text out. No API keys, no cloud, no network required after the initial model download.
Built on @huggingface/transformers + ONNX Runtime. Ships with a built-in agent system for tool-calling workflows.
- Fully local — your data never leaves the machine
- Multimodal — text, image, and audio inputs through one unified API
- 128K context — sizeable context window to work with
- Agent system — tool-calling loop with automatic retries and context management
- Streaming — async iterators for real-time output
- Thinking mode — chain-of-thought reasoning when you need it
- Hardware accelerated — GPU auto-detection (CoreML on Mac, CUDA on NVIDIA, DirectML on Windows)
npm install @kessler/gemma
# or
pnpm add @kessler/gemmaimport { Gemma } from '@kessler/gemma'
const gemma = new Gemma()
await gemma.load()
// One-shot
const answer = await gemma.complete('What is the speed of light?')
// Streaming
for await (const chunk of gemma.stream('Write a haiku about TypeScript')) {
process.stdout.write(chunk)
}
await gemma.unload()Both models support text, image, and audio inputs with a 128K token context window.
| ID | Parameters | Download | HuggingFace |
|---|---|---|---|
gemma-4-e2b |
2.3B effective | ~500 MB | onnx-community/gemma-4-E2B-it-ONNX |
gemma-4-e4b (default) |
4B effective | ~1.5 GB | onnx-community/gemma-4-E4B-it-ONNX |
Models are downloaded on first use and cached locally (~/.cache/huggingface/). You can also pass any ONNX-format HuggingFace model ID directly.
const gemma = new Gemma({ model: 'gemma-4-e4b' })The complete() and stream() methods accept either a plain string or a messages array. Multimodal content goes inline in the messages:
const response = await gemma.complete([{
role: 'user',
content: [
{ type: 'image', image: './photo.jpg' },
{ type: 'text', text: 'What do you see in this image?' },
],
}])Images can be a file path, URL, or Buffer.
const response = await gemma.complete([{
role: 'user',
content: [
{ type: 'audio', audio: './recording.wav' },
{ type: 'text', text: 'Transcribe this.' },
],
}])Audio can be a file path, URL, or Buffer. Max 30 seconds.
Multi-turn conversations use the standard system / user / assistant roles:
const response = await gemma.complete([
{ role: 'system', content: 'You are a concise technical writer.' },
{ role: 'user', content: 'Explain garbage collection in one paragraph.' },
])for await (const chunk of gemma.stream('Explain quantum entanglement')) {
process.stdout.write(chunk)
}Both complete() and stream() accept the same inputs — strings, message arrays, multimodal content.
Enable chain-of-thought reasoning. The model will reason internally before responding:
const response = await gemma.complete('What is 137 * 29? Show your work.', {
thinking: true,
onThinkingChunk: (t) => process.stderr.write(t),
})By default, device: 'gpu' auto-selects the best available backend:
| Platform | Backend |
|---|---|
| macOS (Apple Silicon) | CoreML / Metal |
| Linux / Windows (NVIDIA) | CUDA |
| Windows (any GPU) | DirectML |
| Fallback | CPU |
Override explicitly if needed:
const gemma = new Gemma({ device: 'cpu' })
const gemma = new Gemma({ device: 'cuda' })
const gemma = new Gemma({ device: 'coreml' })Track model download and loading:
const gemma = new Gemma({
onProgress: (info) => {
if (info.status === 'loading') console.log(`${info.progress}%`)
if (info.status === 'ready') console.log('Model ready')
if (info.status === 'error') console.error(info.error)
},
})The agent runs an autonomous tool-calling loop: the model decides which tools to call, executes them, reads the results, and continues until it has an answer.
import { Gemma, Agent } from '@kessler/gemma'
import fs from 'fs/promises'
const gemma = new Gemma({ model: 'gemma-4-e4b' })
await gemma.load()
const agent = new Agent({
gemma,
systemPrompt: 'You are a helpful file assistant.',
tools: [
{
name: 'read_file',
description: 'Read a file from disk',
parameters: {
type: 'object',
properties: {
path: { type: 'string', description: 'File path to read' },
},
required: ['path'],
},
execute: async (args) => {
return { content: await fs.readFile(args.path as string, 'utf-8') }
},
},
],
onChunk: (text) => process.stdout.write(text),
onToolCall: (call) => console.log(`\n> ${call.name}(${JSON.stringify(call.arguments)})`),
})
const result = await agent.run('Read package.json and tell me the project name')
console.log(`\nDone in ${result.iterations} iterations, ${result.toolCallCount} tool calls`)- Self-executing tools — each tool definition carries its own
executefunction, no separate executor needed - Persistent conversation — call
agent.run()multiple times, context carries over - Truncation recovery — if a tool call gets cut off mid-generation, the agent automatically continues or compresses context
- Image handling — tool results containing screenshots are fed back through the multimodal processor
- Abort support — call
agent.abort()to stop mid-generation
// Multi-turn
const r1 = await agent.run('List the files in src/')
const r2 = await agent.run('Now read the main entry point')
// Reset
agent.clearHistory()| Option | Type | Default | Description |
|---|---|---|---|
model |
string |
'gemma-4-e4b' |
Model ID or HuggingFace model path |
device |
'gpu' | 'cpu' | 'cuda' | 'coreml' |
'gpu' |
Inference device |
dtype |
string |
'q4f16' |
Quantization type |
onProgress |
(info: ProgressInfo) => void |
— | Download/load progress callback |
Download (if needed) and load the model. Must be called before complete() or stream().
Generate a completion. input is a string or ChatMessage[].
| Option | Type | Default | Description |
|---|---|---|---|
maxTokens |
number |
1024 |
Maximum tokens to generate |
thinking |
boolean |
false |
Enable chain-of-thought reasoning |
onChunk |
(text: string) => void |
— | Streaming text callback |
onThinkingChunk |
(text: string) => void |
— | Streaming thinking callback |
Same as complete() but yields text chunks as they're generated.
Returns the token count for a string.
Dispose the model and free memory.
Low-level lexer for Gemma 4 model output. Splits a raw string into typed tokens (TOOL_CALL_START, TEXT, STRING_DELIM, etc.). Useful for building custom parsers on top of Gemma's special token format.
Cancel an in-progress generation.
| Option | Type | Default | Description |
|---|---|---|---|
gemma |
Gemma |
— | Loaded Gemma instance (required) |
systemPrompt |
string |
— | System prompt (required) |
tools |
ToolDefinition[] |
— | Available tools (required) |
maxIterations |
number |
10 |
Max tool-calling loops |
thinking |
boolean |
false |
Enable reasoning mode |
onChunk |
(text: string) => void |
— | Streaming text callback |
onThinkingChunk |
(text: string) => void |
— | Streaming thinking callback |
onToolCall |
(call: ToolCall) => void |
— | Called when a tool is invoked |
onToolResponse |
(resp: ToolResponse) => void |
— | Called when a tool returns |
Run the agent. Returns { response, toolCallCount, iterations }.
Reset conversation state.
Stop the current run.
interface ChatMessage {
role: 'system' | 'user' | 'assistant'
content: string | ContentItem[]
}
type ContentItem =
| { type: 'text'; text: string }
| { type: 'image'; image: string | Buffer }
| { type: 'audio'; audio: string | Buffer }
interface ToolDefinition {
name: string
description: string
parameters?: { type: 'object'; properties: Record<string, ToolParameterDef>; required?: string[] }
execute: (args: Record<string, unknown>) => Promise<Record<string, unknown>>
}
interface AgentRunResult {
response: string
toolCallCount: number
iterations: number
}Apache-2.0