MikeRust extracts text from .docx files in pure Rust — no LibreOffice, no Pandoc, no external process. The extractor lives at src/pdf/mod.rs (extract_docx_text) and is reused by:
- The folder scanner (
src/sync/scanner.rs::extract_text_dispatch) when indexing.docxfiles into the embedding store. - The chat-upload pipeline (
src/routes/documents.rs::upload_document, cache path) when an attached docx is hashed and pre-extracted todata/storage/cache/<hash>.txt. - The
read_documentbuiltin tool (src/llm/builtin_tools.rs) when the LLM asks for the full text of an attached doc by label.
The upstream willchen96/mike and its forks (e.g. marklok/danishmike) shell out to LibreOffice via libreoffice-convert and parse via mammoth. That's three problems for a sovereign desktop app:
- External binary — LibreOffice is ~300 MB, doesn't bundle, requires PATH wiring.
- Per-extraction process spawn — each docx is a fork+exec with ~1 s startup overhead.
- Mammoth loses redline information — its HTML output drops
<w:del>and visual strikes silently.
zip + quick-xml give us streaming ZIP entry access and a low-allocation XML reader. The extractor stays well under 200 lines and runs in microseconds for typical contracts.
The extractor emits plain text with paragraph newlines plus inline [removed by author: …] markers around content the author indicated as removed. Two distinct OOXML signals trigger the marker:
When a document is edited with track-changes on, deleted runs are wrapped:
<w:del w:id="1" w:author="Alice" w:date="2024-01-01T00:00:00Z">
<w:r><w:delText xml:space="preserve">old text</w:delText></w:r>
</w:del><w:delText> is the same as <w:t> for our purposes — text that was there before the deletion. The extractor enters "removed" mode on <w:del>, exits on </w:del>, and wraps everything inside in [removed by author: old text].
Authors sometimes simulate "removed" without enabling tracked changes by applying strike-through formatting to a run:
<w:r>
<w:rPr><w:strike/></w:rPr>
<w:t>visually struck</w:t>
</w:r><w:dstrike/> is double-strike, treated identically. The extractor flips a current_run_struck flag when it sees either inside <w:rPr>, applies the marker for that run's <w:t> content, and clears the flag at </w:r>.
Both signals collapse into the same surface form so downstream consumers (LLM prompts, embedding chunks) don't have to distinguish:
The contract clauses are: clause 1, [removed by author: clause 2], clause 3.
When a struck or deleted segment fans out across multiple sibling runs the marker spans them; the bracket closes at:
- end of
<w:del>block (for tracked deletions) - end of
<w:r>for strike-through-only runs (since strike scope is run-local) - end of
<w:p>(paragraph) — never spans paragraphs, even if a malformed doc would have it.
| OOXML element | Output |
|---|---|
<w:p> open |
newline (if previous content) |
<w:p> close |
flushes any open removal bracket |
<w:br/> |
newline |
<w:tab/> |
single space |
<w:t> text |
emitted verbatim (XML-unescaped) |
| Anything else | ignored |
After the walk, internal whitespace is collapsed to single spaces but newlines are preserved, then leading/trailing whitespace is trimmed.
The extractor focuses on word/document.xml only. Out of scope at the moment:
- Headers / footers (
word/header*.xml,word/footer*.xml) — usually boilerplate (page numbers, "Confidential" labels). Adding them is a one-line change but bloats embeddings with no signal. - Footnotes / endnotes (
word/footnotes.xml) — relevant for academic / legal citations; planned. - Comments (
word/comments.xml) — distinct from track-changes deletions; planned, would emit as[comment by author: …]. - Tracked insertions (
<w:ins>) — currently rendered as plain text, which matches what a human reader would see in the "all markup accepted" view. Could be marked as[inserted by author: …]symmetrically with deletions if redline analysis becomes a priority.
Chat-attached docx files are extracted once at upload and persisted at data/storage/cache/<sha256>.txt. A file edit changes the SHA-256, so the cache transparently regenerates on the next upload. See CACHE.md for the full storage contract.
Existing cache entries from before redline detection landed are stale. To force regeneration:
rm -r data/storage/cache/*
# next upload re-hashes and re-extracts(or just upload a one-byte-different version, which produces a new hash.)
src/pdf/mod.rs mod docx_tests covers:
- Plain paragraphs round-trip unchanged.
<w:del>tracked deletion produces[removed by author: …].<w:strike/>formatted run produces the same marker.<w:dstrike/>double-strike treated identically.- Removal brackets never span paragraph boundaries.
- Plain runs without strike formatting stay plain (no false positives).
- Empty documents return empty string.
Run with:
cargo test --features rag --lib pdf::docx_tests