Skip to content

Releases: RMANOV/smartkey

v0.6.0 — "The Brain"

20 Mar 11:15

Choose a tag to compare

┌─────────────────────────────────────────────────────────────────┐
│  SmartKey v0.6.0 — "The Brain"                                  │
│                                                                 │
│  22 commits · 45 files · +6,474 lines · 311 tests              │
│  Master Algorithm · Regime Detection · CapsEngine · Frustration │
│  CorrectionMemory · TechVocab · LangModel · FFI Protocol       │
└─────────────────────────────────────────────────────────────────┘

TL;DR

v0.5.0 taught SmartKey to detect your keyboard layout. v0.6.0 teaches it to understand you. A new proactive Master Algorithm observes your typing patterns, classifies your current regime (fast coding? deliberate prose? language switching?), tracks frustration signals, remembers your corrections, applies smart capitalisation — and adjusts every prediction threshold in real-time. The dual-buffer engine from v0.5.0 is now wrapped in a 4-phase state machine that anticipates, tracks, corrects, and learns. 311 tests verify it all.


Highlights

🧠 Proactive Master Algorithm (master_loop.rs — 750 LoC)

The centrepiece of v0.6.0. A decorator-pattern orchestrator wrapping InputMethodCore with a 4-phase state machine:

ANTICIPATING → TRACKING → CORRECTING → LEARNING
     │              │            │            │
     ▼              ▼            ▼            ▼
 Read context   Follow word   Detect bad   Update profile
 Set lang prior  Build ghost   prediction   Save corrections
 Bias ensemble   Monitor WPM   Suppress it  Adjust thresholds

Key innovation: Hints injection. Before each prediction, MasterLoop computes a Hints struct (lang_prior, suppress_ghost, confidence_boost) and passes it down to the ensemble. This means predictions adapt to context before the user finishes typing — not after.

📊 Typing Regime Detection (typing_regime.rs — 200 LoC)

Inspired by market regime detection in algorithmic trading. Five EMA signals (WPM, word length, out-of-vocabulary rate, dual-buffer flip rate, language ratio) are tracked per word and classified into 5 regimes:

Regime Trigger Effect
FastCoding High WPM + OOV + short words TechVocab δ weight ↑, ghost threshold ↓
DeliberateProse Low WPM + long words Markov β weight ↑, longer ghost text
MixedLanguage High flip rate Dual-buffer stays unlocked longer
NewTopic OOV spike Reset session cache, broaden predictions
LanguageSwitch Sharp lang ratio flip Lock to new language faster

🔠 CapsEngine (caps.rs — 100 LoC)

The trie stores all words lowercase. CapsEngine detects the caps pattern from what you've typed so far and applies it to predictions:

  • Normal — first-letter capitalisation at sentence start
  • AllCapsHELLO
  • CamelCasecamelCase
  • PascalCasePascalCase

Proper nouns are capitalised regardless. This is why the CI tests needed eq_ignore_ascii_case — predictions are no longer always lowercase.

😤 Frustration Detector (frustration.rs — 200 LoC)

Real-time keystroke pattern monitor detecting 4 frustration signals:

  • REJECT — Backspace within 500ms of Tab acceptance → you didn't want that prediction
  • RAPID_DELETE — 3+ Backspace in <1 second → something went wrong
  • RETYPE — Delete all + retype in different script → wrong language was locked
  • ABANDON — Escape + continued manual typing → user gave up on predictions

Each signal carries severity [0.0, 1.0]. When frustration spikes, ghost text is suppressed for the next N words. The Master Algorithm's LEARNING phase uses these signals to adjust LightProfile.confidence_floor.

🔧 Tech Vocabulary (tech_vocab.rs — 150 LoC)

300+ hardcoded technical terms (programming keywords, framework names, CLI commands) with frequency-based scoring. Active as the δ weight in the ensemble only during FastCoding regime — prevents function from appearing while writing Bulgarian prose.

🧬 Per-Language Model Bank (lang_model.rs — 100 LoC)

Each detected language gets its own isolated model bank (trie + Markov + PPM + Kneser-Ney). Prevents cross-language contamination: Bulgarian Markov chains don't bleed into English predictions and vice versa.

🔄 Correction Memory (correction_memory.rs — 150 LoC)

When the user corrects the same prediction 3+ times in the same context, the bad prediction is suppressed and the correction is shown instead. Uses 2-word context hashing and LRU eviction (max 500 entries). Serialises to PersonalProfile v4 for persistence across sessions.

📡 FFI Protocol (ffi_protocol.rs — 80 LoC)

Single source of truth for all FFI serialisation: ReplaceWord payloads use \x1F (Unit Separator) between replace_len and text; ShowComposing uses \x00 between typed and ghost portions. Shared by PyO3, C FFI, and test harnesses — eliminates parser mismatches.

👤 Light Profile (light_profile.rs — 100 LoC)

Adaptive confidence tuning based on rolling accept/reject ratio:

  • High acceptance → lower confidence floor → more ghost suggestions shown
  • Low acceptance → higher floor → fewer, higher-quality suggestions
  • Formula: max(0.15, 0.50 - 0.35 × accept_rate)
  • Auto-suppresses ghost text for 2 words after a rejection

🎮 Playground Simulation Harness (smartkey-playground — new crate)

Closed-loop simulator driving InputMethodCore through target text, measuring keystroke efficiency, prediction accuracy, and ghost acceptance rates. Includes preset scenarios and HTML report generation.


Architecture

v0.6.0 adds a new orchestration layer above the existing prediction stack:

┌──────────────────────────────────────────────────────┐
│  MasterLoop (new)                                     │
│  ┌─ RegimeDetector ── FrustrationDetector            │
│  ├─ LightProfile ──── CorrectionMemory               │
│  └─ Hints → InputMethodCore                          │
│              ┌─ DualBuffer (v0.5.0)                  │
│              ├─ CapsEngine (new)                     │
│              └─ Ensemble                             │
│                  ├─ α  Corpus frequency              │
│                  ├─ β  Markov chain                   │
│                  ├─ γ  CVM counter                    │
│                  └─ δ  TechVocab (new, regime-gated) │
└──────────────────────────────────────────────────────┘

Preedit-First Dual Buffer

input.rs refactored from 1,168 → 1,824 LoC. The dual buffer now operates in preedit mode: text is held in a composition buffer before commit, eliminating character doubling that plagued v0.5.0. Ghost text comparison is now Unicode-safe.

Space Auto-Commit + Syntactic Casing

Pressing Space now auto-commits the current word (no explicit Tab needed for unambiguous predictions). After commit, CapsEngine applies casing rules: sentence-start capitalisation, proper noun detection, regime-based transforms.


By the Numbers

Metric v0.5.0 v0.6.0
Files changed 45
Lines added +6,474
Lines removed -358
Net growth +6,116 LoC
New modules 10
New crate smartkey-playground
master_loop.rs 750 LoC (new)
typing_regime.rs 200 LoC (new)
frustration.rs 200 LoC (new)
tech_vocab.rs 150 LoC (new)
correction_memory.rs 150 LoC (new)
caps.rs 100 LoC (new)
lang_model.rs 100 LoC (new)
light_profile.rs 100 LoC (new)
ffi_protocol.rs 80 LoC (new)
ensemble.rs ~670 LoC 1,394 LoC (+108%)
input.rs ~1,168 LoC 1,824 LoC (+56%)
virtual_typer.rs 881 LoC (new test)
Tests 207 311 (+50%)

Full Changelog

Features

  • feat(core): master_loop.rs — proactive 4-phase Master Algorithm with Hints injection (750 LoC)
  • feat(core): typing_regime.rs — 5-regime detection via EMA signals (200 LoC)
  • feat(core): caps.rs — CapsEngine: smart capitalisation from typed prefix (100 LoC)
  • feat(core): tech_vocab.rs — 300+ technical terms as ensemble δ signal (150 LoC)
  • feat(core): frustration.rs — 4-signal frustration detector (200 LoC)
  • feat(core): lang_model.rs — per-language model isolation bank (100 LoC)
  • feat(core): correction_memory.rs — repeated correction suppression with LRU (150 LoC)
  • feat(core): light_profile.rs — adaptive confidence floor tuning (100 LoC)
  • feat(core): ffi_protocol.rs — unified FFI serialisation contract (80 LoC)
  • feat(core): preedit-first dual buffer — eliminate character doubling
  • feat(core): aggressive space auto-commit and syntactic casing restoration
  • feat(core): per-language models + smart caps for 2× prediction accuracy
  • feat(core): confidence hardening + eval histogram
  • feat(playground): closed-loop simulation harness for prediction quality eval

Fixes

  • fix(core): bug hunt v15 — 10 fixes + 4 simplifications in Master Algorithm
  • fix(core): wire flip detection, clear momentum on reset, harden ghost text
  • fix(core): Unicode-safe ghost text comparison, remove dead sentence_start field
  • fix(core): resolve compile errors and ghost text regression
  • fix(core): case-insensitive assertions in integration + virtual_typer tests
  • fix: ReplaceWord parser mismatch, macOS handler, overflow guards
  • fix: confidence model, overflow guards, code quality improvements
  • fix: parallel surgery — preedit lifecycle, language detection, prediction quality

Chores

  • chore(core): remove applied caps.patch — history preserved in git blame
  • refactor(core): simplify hypothesis phase checks and reduce allocations
  • ci: add smartkey-playground to test and clippy jobs

Known Issues / Tech Debt

  • Confidence single-candidate weakness: Softmax normalisation gives confidence=1.0 to a single low-score prediction. Mitigated by ghost_text_min_score threshold, but a pro...
Read more

v0.5.0 — "Never Switch Again"

14 Mar 10:19

Choose a tag to compare

┌─────────────────────────────────────────────────────────────────┐
│  SmartKey v0.5.0 — "Never Switch Again"                         │
│                                                                 │
│  12 commits · 27 files · +1,985 lines · 207 tests              │
│  Dual-buffer · evdev keymaps · IBus replace · layout-agnostic   │
└─────────────────────────────────────────────────────────────────┘

TL;DR

Type zdravej on a QWERTY keyboard. SmartKey intercepts the hardware scancodes, maintains a parallel Bulgarian Phonetic interpretation in a second buffer, compares corpus frequencies, decides you meant Bulgarian, and outputs здравей — without you pressing Alt+Shift, Ctrl+Space, or touching any language-switch shortcut. That is v0.5.0 in one sentence.


Highlights

🔄 Dual-Buffer Layout-Agnostic Input (dual_buffer.rs — 399 LoC)

The centrepiece of this release. Architecture:

Hardware key press
        │
        ▼
   evdev scancode
   (platform-neutral)
        │
   ┌────┴────┐
   ▼         ▼
 EN QWERTY  BG Phonetic
  buffer     buffer
   │         │
   └────┬────┘
        │
   corpus frequency
   comparison
        │
   confidence ≥ 0.7?
        │
   ┌────┴────────────────┐
   ▼                     ▼
 lock + output        keep accumulating
 ReplaceWord action   both buffers

Key design decisions:

  • Confidence threshold 0.7 — SmartKey doesn't guess until it's sure. Below 0.7 both buffers run in parallel, above 0.7 the winning language is locked and ReplaceWord is emitted.
  • Flip detection — if you suddenly type characters that only exist in the non-locked language, SmartKey unlocks, reassesses, and can flip. There's no permanent lock-in.
  • ReplaceWord action — the existing Action enum gains a new variant. All downstream crates (win, py, ibus) handle it.

⌨️ evdev Keymap Tables (keymap.rs — 323 LoC)

Previous builds used XKB key codes in lookup tables — which are already layout-remapped. This made dual-buffer impossible (you can't detect layout from remapped codes). v0.5.0 fixes this at the root: lookup tables now use raw evdev scancodes (Linux kernel constants), which are hardware positions, not characters.

Two complete lookup tables:

  • EN_QWERTY_MAP: evdev scancode → ASCII character (+ Shift variant)
  • BG_PHONETIC_MAP: evdev scancode → Cyrillic character (+ Shift variant), following the Bulgarian Phonetic standard layout

🔁 IBus Replace Handler (ibus/)

The IBus adapter gains a replace action handler: when ReplaceWord arrives, it erases the N characters in the composition buffer and commits the corrected text. Bug fixes in this path:

  • Off-by-one in backspace count (was deleting one extra character)
  • Wrong CapsLock bit in key event synthesised for backspace
  • Consumed-key event was being committed as text

🌡️ Cold-Start Wrong-Layout Detection

On session start SmartKey has no typed history — the confidence estimator starts blind. v0.5.0 adds a frequency-based cold-start heuristic: compare the character frequency distribution of the current input prefix against expected EN and BG frequencies. Whichever distribution is less surprising wins the initial guess.

Prior approach used raw character counts, which failed on short prefixes. Frequency-ratio comparison is stable at 2–3 characters.

🛠️ ReplaceWord Across All Crates

The new Action::ReplaceWord(String) variant is handled in:

  • smartkey-core — emitted by DualBuffer::process_key
  • smartkey-py (PyO3 bridge) — exposed as ("replace", text) tuple in Python API
  • smartkey-win (TSF) — Key::RawCode arm added to should_claim match; TSF composition handles replace
  • smartkey-ibus — replace handler implemented with corrected backspace synthesis

🦀 Rust 1.94 Clippy Compliance

New lints in Rust 1.94:

  • is_multiple_of — replaces manual % n == 0 patterns
  • unnecessary_unwrap — replaces if x.is_some() { x.unwrap() } with if let Some(v) = x
  • CI now runs on Rust stable with -D warnings enforced

By the Numbers

Metric v0.4.0 v0.5.0
Files changed 27
Lines added +1,985
Lines removed -152
Net growth +1,833 LoC
dual_buffer.rs 399 LoC (new)
keymap.rs 323 LoC (new)
Tests ~150 207 (+38%)
Language switch shortcuts needed Yes No
Rust version compliance 1.8x 1.94
CI Node.js actions/checkout v4 v5 (Node 24)

Known Issues / Tech Debt

Being honest about what v0.5.0 does not yet solve:

  • macOS: Dual-buffer is implemented in smartkey-core but the macOS IMK adapter does not yet plumb ReplaceWord through to the Swift layer. macOS gets prediction quality improvements from v0.4.0 but not layout-agnostic input.
  • Windows: Key::RawCode arm added; however the evdev scancode concept is Linux-specific. Windows uses virtual key codes, so the keymap.rs tables don't apply directly. The Windows path uses the existing ToUnicodeEx approach — a unified scancode abstraction across platforms is planned.
  • Performance: dual_buffer.rs is O(n) in prefix length per keypress. For normal typing (< 50 chars) this is imperceptible. Long pastes trigger one full re-evaluation.
  • Confidence tuning: The 0.7 threshold is hardcoded. A config parameter is planned.
  • Single Cyrillic keyboard layout: Only Bulgarian Phonetic is in keymap.rs. Russian Phonetic, Serbian Cyrillic, Ukrainian, etc. are not yet represented.

Full Changelog

Features

  • feat(core): dual_buffer.rs — DualBuffer struct, confidence locking, flip detection, ReplaceWord emission (399 LoC)
  • feat(core): keymap.rs — evdev scancode lookup tables for EN QWERTY + BG Phonetic (323 LoC)
  • feat(core): Action::ReplaceWord(String) variant — handled in all downstream crates
  • feat(core): instant language detection + language-aware scoring (v0.4.1 backport)

Fixes

  • fix(core): Use evdev keycodes instead of XKB in keymap lookup tables — enables true layout-agnostic detection
  • fix(core): Allow wrong-layout detection at cold start — frequency-based comparison replaces broken count-based heuristic
  • fix(core): Replace broken count-based layout detection with frequency-ratio comparison
  • fix(core): Handle Action::ReplaceWord in all downstream crates
  • fix(core): v0.4.1 post-review cleanup — all clippy warnings resolved
  • fix(ibus): Add replace handler — erases composition buffer and commits corrected text
  • fix(ibus): Fix replace action — off-by-one backspace count, wrong CapsLock bit, consumed-key commit
  • fix(win): Add Key::RawCode arm to TSF should_claim match
  • fix(ci): Resolve Rust 1.94 clippy lints (is_multiple_of, unnecessary_unwrap)
  • chore(ci): Bump actions/checkout v4 → v5 (Node.js 24)

What's Next (v0.6.0 and beyond)

  • macOS: plumb ReplaceWord through to IMK Swift layer
  • Windows: unified scancode abstraction (virtual key → evdev mapping)
  • Config file: expose confidence_threshold, flip_sensitivity, feature flags
  • Additional Cyrillic layouts: Russian Phonetic, Ukrainian, Serbian
  • GTK4 popup panel — Alt-hold to see full candidate list
  • Memory-mapped binary corpus format (eliminate JSON parse at startup)
  • Windows MSI installer + macOS .app bundle
  • Mobile: Android IME via JNI

Install

git clone https://github.com/RMANOV/smartkey.git
cd smartkey
cargo build --release
cargo test --workspace   # 207 tests

Linux (IBus — with layout-agnostic input):

maturin develop -m crates/smartkey-py/Cargo.toml --release
# Start typing — SmartKey detects your intended language automatically

Windows (TSF): cargo build -p smartkey-win --release → register DLL (HKCU, no elevation)
macOS (IMK): cargo build -p smartkey-mac --release → link from Swift


Full diff: v0.4.0...v0.5.0

v0.4.0 — "Algorithmic Intelligence"

14 Mar 10:19

Choose a tag to compare

┌─────────────────────────────────────────────────────────────────┐
│  SmartKey v0.4.0 — "Algorithmic Intelligence"                   │
│                                                                 │
│  2 commits · 18 files · +2,333 lines · ~150+ tests             │
│  8 new prediction modules · PPM · BPE · Kneser-Ney · Hedge     │
└─────────────────────────────────────────────────────────────────┘

TL;DR

v0.4.0 is SmartKey's largest algorithmic leap. Eight new prediction modules land in one release: Unicode-based language detection, per-language vocabulary counters, character-level PPM compression, BPE subword tokenisation, modified Kneser-Ney smoothing, an adaptive ensemble mixer, acceptance-rate telemetry, and session-burst caching. This is the foundation for multilingual prediction without configuration.


Highlights

🌐 Instant Language Detection (lang_detect.rs — 216 LoC)

Unicode block classification: every character has a block (Latin, Cyrillic, Greek, Arabic, CJK…). SmartKey reads the block distribution of your last N keystrokes and infers the active language in microseconds — no ML model, no external corpus, no network call.

📊 Per-Language CVM Counters (lang_cvm.rs — 182 LoC)

Each detected language gets its own CVM streaming counter. As you type Bulgarian the BG counter accumulates; switch to English and the EN counter takes over. Vocabulary boosting is always language-contextual — you never get Bulgarian suggestions while typing English.

⚡ Session Burst Cache (session_cache.rs — 194 LoC)

Words typed this session are tracked with a burstiness weight: repeat a word three times in a row and it jumps to the front of suggestions immediately. The cache is in-memory only — fast, zero-persistence overhead.

📐 Prediction by Partial Matching (ppm.rs — 323 LoC)

PPM is a character-level context-compression algorithm (the same family as LZ77). SmartKey uses PPM-C: it maintains an order-5 context trie, falls back through order-4 → 3 → 2 → 1 → 0 → escape, and produces calibrated character probabilities. This is opt-in behind a feature flag (ppm).

🔤 BPE Subword Tokeniser (bpe.rs — 263 LoC)

Byte Pair Encoding builds a vocabulary of subword units from corpus merge rules. SmartKey's BPE tokeniser handles out-of-vocabulary words by decomposing them into known subwords — useful for technical terms, compound words, and transliteration fragments. Feature flag: bpe.

📏 Modified Kneser-Ney Smoothing (kneser_ney.rs — 247 LoC)

Standard trigram models assign zero probability to unseen sequences — Kneser-Ney fixes this by redistributing probability mass to contexts where a word appears in diverse positions. Rare words that appear in many different contexts get a fair share. Feature flag: kneser_ney.

🎲 Adaptive Ensemble Mixer (hedge.rs — 235 LoC)

HEDGE (Hedge Algorithm) is an online learning algorithm that assigns exponential weights to each prediction signal. Signals that are right more often get more weight; signals that miss get down-weighted. The mixer adapts per-session without any offline training. Feature flag: hedge.

📈 Prediction Quality Telemetry (eval.rs — 256 LoC)

Acceptance rate metrics: what fraction of predictions were accepted, at what rank, after how many keystrokes. The evaluator runs in a background ring buffer and feeds its signals back to the ensemble mixer.


By the Numbers

Metric v0.3.0 v0.4.0
Files changed 18
Lines added +2,333
Lines removed -52
Net growth +2,281 LoC
New modules 8
ppm.rs 323 LoC
bpe.rs 263 LoC
eval.rs 256 LoC
kneser_ney.rs 247 LoC
hedge.rs 235 LoC
lang_detect.rs 216 LoC
session_cache.rs 194 LoC
lang_cvm.rs 182 LoC
ensemble.rs ~400 LoC ~670 LoC (+270)
Tests ~100 ~150+

Feature Flags

Experimental modules are gated behind Cargo feature flags — the default build is fast and stable, opt in to the research algorithms when you want them:

Flag Module Status
ppm Prediction by Partial Matching Experimental
bpe Byte Pair Encoding tokeniser Experimental
kneser_ney Modified Kneser-Ney smoothing Experimental
hedge Adaptive ensemble mixer Experimental
session_cache Burstiness-weighted session cache Stable
# Enable all experimental modules:
cargo build --release --features ppm,bpe,kneser_ney,hedge

Full Changelog

Features

  • feat(core): lang_detect.rs — Unicode block-based language detection (216 LoC)
  • feat(core): lang_cvm.rs — per-language CVM vocabulary counters (182 LoC)
  • feat(core): session_cache.rs — burstiness-weighted session word cache (194 LoC)
  • feat(core): ppm.rs — Prediction by Partial Matching, character-level (323 LoC) [feature: ppm]
  • feat(core): bpe.rs — Byte Pair Encoding subword tokeniser (263 LoC) [feature: bpe]
  • feat(core): eval.rs — acceptance rate metrics and prediction quality measurement (256 LoC)
  • feat(core): kneser_ney.rs — modified Kneser-Ney smoothing for rare words (247 LoC) [feature: kneser_ney]
  • feat(core): hedge.rs — adaptive ensemble mixer with exponential weights (235 LoC) [feature: hedge]
  • feat(core): ensemble.rs expanded — +270 LoC multi-signal blending with per-source normalisation

Fixes

  • fix(win): Make category registration non-fatal for HKCU install — startup no longer aborts if optional registry key is inaccessible

What's Next

  • Dual-buffer layout-agnostic input: type in any script on any keyboard layout (v0.5.0)
  • IBus replace handler for automatic layout switching
  • evdev scancode → char lookup tables (EN QWERTY + BG Phonetic)
  • Cold-start wrong-layout detection via corpus frequency comparison
  • Windows installer (MSI) + macOS .app bundle

Install

git clone https://github.com/RMANOV/smartkey.git
cd smartkey
git checkout v0.4.0
cargo build --release
cargo test --workspace   # ~150+ tests
# Optional: enable experimental modules
cargo build --release --features ppm,bpe,kneser_ney,hedge

Linux (IBus): maturin develop -m crates/smartkey-py/Cargo.toml --release
Windows (TSF): cargo build -p smartkey-win --release → register DLL
macOS (IMK): cargo build -p smartkey-mac --release → link from Swift


Full diff: v0.3.0...v0.4.0

v0.3.0 — "Performance & Prediction Quality"

14 Mar 10:19

Choose a tag to compare

┌─────────────────────────────────────────────────────────────────┐
│  SmartKey v0.3.0 — "Performance & Prediction Quality"           │
│                                                                 │
│  7 commits · 19 files · +1,071 lines · ~100 tests              │
│  Hash probe elimination · Tab fix · Non-admin install           │
└─────────────────────────────────────────────────────────────────┘

TL;DR

v0.3.0 is a focused sharpening release: redundant hash lookups eliminated, the Tab key now correctly commits text instead of ghosting, and Windows users can install SmartKey without administrator privileges. Personal vocabulary persistence lands in personal.rs, and the ensemble scorer gets a significant expansion.


Highlights

⚡ Hash Probe Elimination

Post-review optimization pass removed every redundant contains_key + get double-lookup from hot prediction paths. Each probe that survived a get_or_insert pattern is now a single operation. On a 100K corpus the CPU branch-prediction benefit is measurable.

⌨️ Tab CommitText Fix

Tab was being forwarded to the application before SmartKey's ghost text was committed. Result: the candidate disappeared silently, the cursor advanced, and the typed prefix was left dangling. Fixed: Tab now issues CommitText first, then forwards — matching user expectation universally.

🪟 Non-Admin HKCU Install (Windows)

SmartKey previously required Administrator rights to write its COM registration to HKLM. v0.3.0 introduces per-user registration via HKCU\Software\Classes — no elevation prompt, no IT-department ticket needed.

📚 Personal Vocabulary Persistence (personal.rs)

219-line module for per-user vocabulary tracking: words you accept get recorded, persisted to JSON, and boosted on next load. The feedback loop is now on disk, surviving restarts.

🧮 Ensemble Scorer Expansion (ensemble.rs)

+200 lines of scoring logic: confidence weighting per signal, per-source normalisation, and improved blending of Markov trigram + CVM frequency scores. Prediction quality measurably improves on mixed-language corpora.

🔧 Clippy Lint Cleanup

map_or(false, |x| pred(x))is_some_and(|x| pred(x)) across the codebase. Cleaner, faster, idiomatic.


By the Numbers

Metric v0.2.0 v0.3.0
Files changed 19
Lines added +1,071
Lines removed -393
Net growth +678 LoC
personal.rs 219 LoC (new)
ensemble.rs ~200 LoC ~400 LoC (+200)
Admin required (Windows) Yes No (HKCU)

Full Changelog

Features

  • feat(core): personal.rs — 219 LoC personal vocabulary persistence (JSON round-trip)
  • feat(core): ensemble.rs expanded — +200 LoC, improved confidence weighting and signal blending
  • feat(win): Non-admin HKCU COM registration — install without elevation

Fixes

  • fix(core): Tab CommitText bug — Tab now commits ghost text before forwarding
  • fix(core): Post-review optimizations — eliminate redundant hash probes and allocations
  • fix(core): Remove dead dependencies and dead code
  • fix(clippy): map_or(false, ..)is_some_and(..) in ngram.rs

Refactoring

  • refactor: Merge feature/perf-optimization and feature/smartkey-v030 — clean history

What's Next

  • Language detection — per-script Unicode block analysis (v0.4.0)
  • Prediction by Partial Matching (PPM) character model
  • Kneser-Ney smoothing for rare-word handling
  • Adaptive ensemble mixer with exponential weights
  • Per-language CVM counters for zero-config multilingual support

Install

git clone https://github.com/RMANOV/smartkey.git
cd smartkey
git checkout v0.3.0
cargo build --release
cargo test --workspace

Linux (IBus): maturin develop -m crates/smartkey-py/Cargo.toml --release
Windows (TSF): cargo build -p smartkey-win --release → register DLL (no elevation needed)
macOS (IMK): cargo build -p smartkey-mac --release → link from Swift


Full diff: v0.2.0...v0.3.0

v0.2.0 — Cross-Platform

12 Mar 14:37

Choose a tag to compare

┌─────────────────────────────────────────────────────────────────┐
│  SmartKey v0.2.0 — "Cross-Platform"                             │
│                                                                 │
│  41 commits · 47 files · +9,132 lines · 106 tests              │
│  3 platforms · 4 crates · 8 benchmark groups · 0 unsafe panics  │
└─────────────────────────────────────────────────────────────────┘

   Markov trigram:  125 ns     CVM score:  151 ns
   Full predict:    137 µs     Fuzzy (1K): 505 µs

TL;DR

SmartKey grew from a Linux-only IBus engine into a cross-platform prediction engine with Windows TSF, macOS IMK adapters, fuzzy matching, CVM recency decay, and honest benchmark numbers.


Highlights

🪟 Windows TSF — Full IME Integration

Complete Text Services Framework adapter: COM DLL, class factory, TSF registration, composition lifecycle, ghost text display attributes. Key innovations:

  • ToUnicodeEx key mapping — resolves Shift, CapsLock, Cyrillic, and any installed keyboard layout. No more hardcoded ASCII.
  • Smart key claimingOnTestKeyDown only claims keys SmartKey processes. Tab/Right/Escape only when ghost text is visible. Stops blocking other IMEs and app shortcuts.
  • Dead-key safedwFlags=4 preserves compose sequences (e.g. ^ + eê).

🍎 macOS IMK — C FFI + Swift

Complete C FFI surface for Swift IMKInputController:

  • smartkey_handle_key / smartkey_focus_lost / smartkey_reset — full event lifecycle
  • smartkey_load_trigram + smartkey_load_corpus_file — corpus loading (new in v0.2.0)
  • smartkey_save_personal / smartkey_load_personal — vocabulary persistence

🔍 Fuzzy Matching — Damerau-Levenshtein

When exact prefix fails, SmartKey falls back to fuzzy search with configurable edit distance:

  • Handles substitutions, insertions, deletions, and transpositions
  • "functon""function" (1 edit), "hlelo""hello" (1 transposition)
  • Fuzzy results ranked below exact matches via configurable discount factor
  • Benchmark: 505 µs for 1K corpus / 1 edit — fast enough for real-time

🧠 CVM Recency Decay

The CVM streaming counter now has time-aware vocabulary adaptation:

  • Words you stop typing decay probabilistically over rounds
  • decay_lambda parameter controls half-life (default: 0.1)
  • refresh() resets decay clock on re-encounter — frequent words survive
  • Benchmark: 151 ns per frequency score — nanosecond-class personal boosting

📊 Criterion Benchmark Suite — Real Numbers

8 benchmark groups, 316 lines, measuring every component in isolation:

Component Median Notes
Markov (no context) 6.4 ns Hash lookup, essentially O(1)
Markov (trigram backoff) 125 ns Katz interpolation
CVM frequency score 151 ns Streaming cardinality
Full predict (1K corpus) 137 µs Well under 16ms frame budget
Full predict (10K corpus) 1.74 ms Still real-time capable
Fuzzy search (1K, 1-edit) 505 µs Damerau-Levenshtein
Personal JSON round-trip 62.5 µs CVM snapshot save/load

Architecture Evolution

v0.1.0:                          v0.2.0:

smartkey-core ← smartkey-py      smartkey-core (Apache 2.0)
      │              │                 │
      │         IBus (Linux)     ┌─────┼─────────┐
      └──────────────┘           │     │         │
                            smartkey-py │    smartkey-mac
                              (PyO3)   │     (C FFI)
                              Linux  smartkey-win  macOS
                                      (TSF/COM)
                                      Windows

From 2 crates to 4. The core remains Apache 2.0 — embed it anywhere.


By the Numbers

Metric v0.1.0 v0.2.0
Platforms 1 (Linux) 3 (Linux, Windows, macOS)
Crates 2 4
Tests 38 106 (+179%)
Benchmark groups 0 8
Corpus (EN) ~3K unigrams 100K+ unigrams
Corpus (BG) ~2K unigrams 100K+ unigrams
Integration tests 0 5
CI None GitHub Actions

Full Changelog

Features

  • feat(core): Extract InputMethodCore — cross-platform key event state machine
  • feat(core): Fuzzy prefix matching with Damerau-Levenshtein fallback
  • feat(cvm): Recency decay for frequency scoring (time-aware adaptation)
  • feat(win): Windows TSF scaffold → COM DLL → composition → display attributes → ToUnicodeEx
  • feat(mac): macOS IMK scaffold → C FFI → trigram + corpus file loading
  • feat(py): from_config, export/import_personal in PyO3 bridge
  • feat(bench): Criterion benchmark suite (8 groups, 316 lines)
  • feat(corpus): Scale EN and BG corpora to 100K+ unigrams + tech corpus (421 terms)

Fixes

  • fix(win): ToUnicodeEx for layout-aware key mapping (Shift, Cyrillic)
  • fix(win): Smart key claiming — stops blocking other IMEs
  • fix(win): Dead-key composition safety (dwFlags=4)
  • fix(win): Honor riid in COM entry points, track live objects
  • fix(ibus): Graceful daemon shutdown (SIGTERM/SIGINT handlers)
  • fix(ibus): Corpus deduplication (prevent .json + .bin double-loading)
  • fix: 40+ code review findings (4 critical, 10 important, 26 low)
  • fix(ci): Deterministic test_personal_boost + corpus tracking

Refactoring

  • refactor(ibus): Delegate corpus loading to Rust (removes 31 lines of Python)
  • refactor: Consolidate config parsing, remove dead code

Documentation

  • docs(readme): Fix phantom features, mark completed roadmap items
  • docs(readme): Replace aspirational targets with actual benchmark numbers
  • docs: CI workflow + cross-platform README update

What's Next (Roadmap)

  • Memory-mapped binary corpus format (mmap — eliminate JSON parse at startup)
  • GTK4 popup panel for Alt-hold alternatives
  • SQLite persistence for CVM personal vocabulary
  • BK-tree or SymSpell for O(1) fuzzy lookup at scale
  • Windows installer (MSI) + macOS .app bundle
  • Mobile port (Android IME via JNI)

Install

git clone https://github.com/RMANOV/smartkey.git
cd smartkey
cargo build --release
cargo test --workspace   # 106 tests
cargo bench              # 8 benchmark groups

Linux (IBus): maturin develop -m crates/smartkey-py/Cargo.toml --release
Windows (TSF): cargo build -p smartkey-win --release → register DLL
macOS (IMK): cargo build -p smartkey-mac --release → link from Swift


Full diff: v0.1.0...v0.2.0

v0.1.0 — Genesis: The Keyboard That Learns You

08 Mar 13:38

Choose a tag to compare

SmartKey v0.1.0 — Genesis

The keyboard that learns you, not the other way around.

SmartKey is a system-wide predictive keyboard engine for Linux that predicts the next word — not the next paragraph. Pure Rust core. Zero cloud. Sub-microsecond lookups. Your typing patterns never leave your machine.


What's Inside

Prediction Engine (Rust, 1,258 LoC)

A 3-stage ensemble scorer that blends three independent signals into a single ranked prediction:

Stage Algorithm Latency What It Does
N-gram Trie Aho-Corasick prefix search < 1μs Finds all words matching your typed prefix
Markov Chain Katz backoff (trigram → bigram → unigram) < 5μs Scores candidates by context probability
CVM Counter Streaming cardinality estimator < 2μs Boosts words you type frequently
final_score = α · corpus + β · markov + γ · personal
            = 0.4 · "how common" + 0.4 · "how likely here" + 0.2 · "how often YOU say this"

After a week of coding, async outranks assistant. After switching to Bulgarian, здравей surfaces automatically. No retraining. No config. Just typing.

IBus Integration (Python, 499 LoC)

Full IBus.Engine subclass with:

  • Ghost text — top prediction appears inline as dimmed text
  • Tab to accept entire word, to accept one character, Esc to dismiss
  • Super+Escape kill switch — instantly disables/re-enables
  • Automatic context tracking (5-word sliding window)
  • Multi-language corpus loading (EN + BG out of the box)

Corpus Pipeline (Python + Bash, 793 LoC)

Production-grade Wikipedia-to-JSON pipeline:

  • download_corpus.sh — downloads Wikipedia article dumps (any language)
  • extract_wiki.py — streaming MediaWiki XML extractor (18 regex cleanup passes, skips redirects/stubs)
  • build_corpus.py — sentence-aware tokenizer with contraction + Cyrillic support, streaming n-gram counter
  • corpus_stats.py — quality metrics: Zipf exponent, hapax legomena, coverage curves
# Build a Bulgarian corpus from Wikipedia in one pipeline:
./download_corpus.sh --lang bg
python3 extract_wiki.py raw/bgwiki-*.xml.bz2 | python3 build_corpus.py --stdin --lang bg --min-freq 3

By the Numbers

Metric Value
Rust LoC 1,313
Python LoC 1,193
Test suite 38 tests, 0 failures
Modules tested 5 (CVM, n-gram, Markov, prefix, ensemble)
Dependencies Minimal (aho-corasick, rand, pyo3, serde)
Prediction latency < 10μs target
Memory footprint < 50MB target (EN+BG corpora)
Cloud dependency Zero. Everything is local.

The CVM Innovation

Most keyboards use frequency tables or neural nets. SmartKey uses the Chakraborty–Vinodchandran–Meel (CVM) algorithm — a streaming cardinality estimator that naturally provides:

  • Vocabulary decay — words you stop typing get probabilistically evicted
  • Automatic adaptation — no timestamps, no TTLs, no explicit forgetting
  • O(log N) memory — tracks your entire vocabulary in kilobytes
  • Zero-config language detection — per-language CVM counters detect script switches mid-sentence
cvm.frequency_score("function")  // → 64.0 (you're a programmer)
cvm.frequency_score("synergy")   // → 0.0  (you stopped writing corporate email)

Dual License

  • smartkey-core (Rust library) → Apache 2.0 — embed in your mobile keyboard, IDE plugin, or WASM app
  • Application layer (IBus, corpus, PyO3) → GPL 3.0 — derivatives stay open

Quick Start

git clone https://github.com/RMANOV/smartkey.git
cd smartkey
cargo build --release
cargo test -p smartkey-core    # 38 tests, 0 failures

# Build PyO3 bridge
pip install maturin
maturin develop -m crates/smartkey-py/Cargo.toml --release

# Register with IBus
cp ibus/smartkey.xml ~/.local/share/ibus/component/
ibus restart

What's Next (Roadmap)

  • Criterion benchmark suite with latency targets
  • Memory-mapped binary corpus (mmap, no JSON parse at startup)
  • GTK4 popup panel for Alt-hold alternatives
  • SQLite persistence for CVM personal vocabulary
  • Pre-built EN + BG corpus packages
  • Typo correction via Levenshtein distance (Damerau-Levenshtein, v0.2.0)
  • Android IME port via JNI

Built with Rust, Python, and obsessive attention to keystroke latency.