Skip to content

ddv1982/csv-data-anonymizer

Repository files navigation

CSV Anonymizer

CSV Anonymizer is a local-first desktop app for reducing sensitive CSV and pasted-data exposure before sharing, testing, demos, or support work. It detects likely personal data, previews transformations, and writes protected output while preserving the original structure where possible.

All non-LLM detection and transformation runs locally in Rust. Optional local LLM replacement also runs on your machine through Ollama.

Read the generated project wiki at github.com/ddv1982/csv-data-anonymizer/wiki.

What It Does

  • Detects common sensitive fields: emails, names, phone numbers, UUIDs, timestamps, numeric IDs, addresses, postal codes, IPs, URLs, MAC addresses, tax IDs, VAT/BTW numbers, and more.
  • Auto-selects high and medium risk columns while still letting you choose exactly which columns to transform.
  • Shows a preview before writing output. Rule-based preview replacements are examples; final output gets its own randomized run.
  • Streams CSV file transformations instead of loading the whole file into memory.
  • Supports lightweight paste workflows for CSV, JSON, XML, YAML, plain text, and logs up to 5 MiB; larger CSV inputs should use the streaming file workflow.
  • Includes Quick by Data Type generation for creating protected sample values without first providing input data.
  • Keeps repeated source values consistent within each run.
  • Offers optional Smart replacement with a local LLM for selected columns.
  • Produces a privacy report with transformed column counts, redaction counts, reused values, token counts, Local AI replacement counts, and fallbacks.

Detection Language Coverage

The app UI is currently English. CSV and pasted values are read as UTF-8, and detector rules are Unicode-aware. Detection coverage is fixture-backed, but it is not a claim of full multilingual parity.

Header-based sensitive-field detection includes a maintained taxonomy for English, Dutch, German, French, Spanish, Portuguese, and Italian, plus a small Japanese pilot for unambiguous phone, address, name, and date headers. Header matching handles Unicode normalization, word segmentation, accent folding for Latin terms, camelCase splitting, compact aliases such as apikey, homephone, and person_id, and conservative fuzzy matching for longer taxonomy terms with sample-value confirmation.

Value validators run independently of header language for structured values such as email, UUID, IP address, URL, MAC address, IBAN, payment cards, VAT IDs, Dutch BTW/omzetbelastingnummer, US SSN/EIN, and formatted phone numbers. Dutch BTW values without an NL prefix are detected only under Dutch BTW header context.

Local LLM Smart Replacement

Smart replacement is optional and off by default. It is designed for columns where rule-based masking is too mechanical and you want more realistic fake values.

The first implementation uses:

  • Ollama running on localhost
  • gemma3:4b as the lightweight default model
  • In-app status checks, setup link, model download, progress, and cancel controls

Usage:

  1. Install or start Ollama.
  2. In CSV Anonymizer, open Local AI setup when Smart replacement prompts for it.
  3. Download gemma3:4b from the app if it is not already available.
  4. Select Smart replacement (Local AI) for the columns that should use the model.
  5. Review the preview, then run the transformation.

The app batches unique values per selected column, asks the local model for realistic fake replacements, validates the response, reuses accepted replacements for repeated source values within the current run, and falls back to rule-based pseudonymization when the model output is missing or unsafe.

Model weights and local runtime binaries are not bundled in the repository or desktop release. The first model download uses network access through Ollama. CSV values selected for Smart replacement are sent only to the configured local Ollama endpoint.

Privacy Boundary

The standard workflow transforms selected values in place: CSV file output keeps the source rows and columns, while pasted structured or text workflows keep the original shape where possible. It redacts, masks, pseudonymizes, tokenizes, or locally replaces selected values. It reduces exposure, but the output is still transformed source data, not guaranteed anonymous data.

It does not produce formal anonymity, differential privacy aggregates, or synthetic datasets. Review previews and privacy reports before sharing generated files.

Strategies

Strategy Use
Redact Replace values with typed placeholders such as [EMAIL], [PERSON], or [DATE].
Mask Replace values with simple masked output.
Pseudonymize Generate readable or shape-preserving fake values.
Tokenize Replace values with opaque tok_... tokens that stay consistent within the current run.
Smart replacement (Local AI) Use a local LLM through Ollama for more realistic fake replacements.
Pass through Leave values unchanged.

Examples of format preservation include email domains, UUID shape, timestamp precision, numeric width and decimals, phone separators, and full-name token count.

Install

Download desktop builds from GitHub Releases.

macOS:

  • Download the .dmg for your Mac.
  • Use aarch64 for Apple Silicon and x64 for Intel.
  • Drag the app into Applications.

Linux:

  • Download the .AppImage, .deb, or .rpm from the latest release.
  • For direct downloads, also download the matching .sha256 and .sha256.asc files and verify them with the release signing key (csv-anonymizer-archive-keyring.pgp) before installing.
  • Debian/Ubuntu users can enable the signed APT repository:
bash <(curl -fsSL https://ddv1982.github.io/csv-data-anonymizer/install-apt-repo.sh)
sudo apt update
sudo apt install csv-anonymizer

After the repository is enabled, normal sudo apt update and sudo apt upgrade runs handle updates.

Development

Requirements:

  • Rust stable
  • Node.js 22.13 or newer
  • Frontend dependencies from frontend/package-lock.json
  • Playwright Chromium for browser e2e checks: cd frontend && npx playwright install chromium

Setup:

npm ci --prefix frontend

Run the desktop app:

npm run tauri:dev

Useful checks:

npm run typecheck
npm run lint
npm run test
npm run fmt
npm run deadcode:required
npm run docs:check
npm run release:check
npm run tauri:prebuilt:check
npm run artifacts:rust:check
npm run linux:package-manager:check
npm run frontend:e2e
npm run frontend:a11y
npm run frontend:audit
npm run cargo:audit
cargo bench -p csv-anonymizer-core --bench csv_streaming
cargo bench -p csv-anonymizer-core --bench detector_matrix -- --sample-size 10
node scripts/rust-smoke.mjs

The root lint, test, typecheck, fmt, docs:check, and deadcode:required scripts are the canonical local gates. The dead-code scans use Knip for the frontend and cargo-machete for Rust dependency drift, and the weekly GitHub Actions maintenance workflow runs the same required dead-code gate. The detector matrix benchmark measures the built-in detector only; the external PII library comparison is archived in docs/detector-library-evaluation.md.

Project Layout

  • frontend - React/Vite desktop UI.
  • src-tauri - Tauri shell, app settings, commands, background jobs, and Ollama integration.
  • crates/csv-anonymizer-core - CSV detection, preview, transformation, reporting, and tests.
  • crates/csv-anonymizer-app - lightweight CLI smoke harness for the shared core.
  • build - package metadata, icons, and platform assets.
  • scripts - release, packaging, metadata, APT, and smoke-test tooling.

Release steps and signing requirements are documented in docs/releasing.md.