From 1a6460c7782773d1dd0e9421e88ca2dc5392e985 Mon Sep 17 00:00:00 2001 From: 0xDarkMatter <0xDarkMatter@users.noreply.github.com> Date: Tue, 31 Mar 2026 11:46:59 +1100 Subject: [PATCH 1/7] feat: add Claude Code skill with DuckDB query layer Adds a Claude Code skill for msgvault that covers the full CLI surface and includes direct DuckDB queries against the Parquet analytics cache for operations the CLI search can't handle (boolean logic, multi-domain, aggregations, thread analysis). Includes: - SKILL.md with verified JSON output shapes, search strategy, safety rules - scripts/query.sh helper wrapping common DuckDB patterns (9 subcommands) - references/duckdb-queries.md with full Parquet schema and query patterns - references/workflows.md with multi-step analysis patterns Tested against a ~755k message archive. All documented commands, jq patterns, and DuckDB queries verified against live data. Ref: #230 --- skills/claude-code/SKILL.md | 279 ++++++++++++++++ .../claude-code/references/duckdb-queries.md | 314 ++++++++++++++++++ skills/claude-code/references/workflows.md | 178 ++++++++++ skills/claude-code/scripts/query.sh | 197 +++++++++++ 4 files changed, 968 insertions(+) create mode 100644 skills/claude-code/SKILL.md create mode 100644 skills/claude-code/references/duckdb-queries.md create mode 100644 skills/claude-code/references/workflows.md create mode 100644 skills/claude-code/scripts/query.sh diff --git a/skills/claude-code/SKILL.md b/skills/claude-code/SKILL.md new file mode 100644 index 00000000..7e9477d2 --- /dev/null +++ b/skills/claude-code/SKILL.md @@ -0,0 +1,279 @@ +--- +name: msgvault-ops +description: "Local email archive operations with msgvault — search, analyze, export, and manage Gmail archives stored in SQLite/Parquet. Use when: querying email history, analyzing senders/domains, exporting messages or attachments, managing Gmail deletions, building sender graphs, running email analytics, importing mbox/emlx, or any task involving msgvault CLI. Triggers on: msgvault, email archive, email search, gmail archive, email export, sender analysis, sender graph, email classification, attachment export, email deletion, list senders, list domains, email analytics, mbox import." +--- + +# msgvault-ops + +Operate the msgvault email archive CLI. All data is local (SQLite + Parquet). Queries run in milliseconds against DuckDB-powered indexes. Gmail API is only used for sync and deletion. + +## Environment + +``` +Binary: msgvault (or full path if not on PATH) +Data: ~/.msgvault/ (override with MSGVAULT_HOME) +Config: ~/.msgvault/config.toml +``` + +Ensure `msgvault` is on PATH or use the full binary path. + +## Quick Reference + +| Task | Command | +|------|---------| +| Archive status | `msgvault stats` | +| Search | `msgvault search "" --json` | +| Top senders | `msgvault list-senders -n 100 --json` | +| Top domains | `msgvault list-domains -n 100 --json` | +| All labels | `msgvault list-labels --json` | +| Read message | `msgvault show-message --json` | +| Export .eml | `msgvault export-eml -o file.eml` | +| Export attachments | `msgvault export-attachments -o ./dir/` | +| Incremental sync | `msgvault sync` | +| TUI | `msgvault tui` (interactive, not for agents) | + +**Always use `--json` for programmatic access.** Parse with `jq`. + +## Search + +### Operators + +Single-operator queries only. `from:` requires an **exact** email address — no fuzzy matching. + +| Operator | Example | Notes | +|----------|---------|-------| +| `from:` | `from:alice@example.com` | Exact sender address | +| `from:@domain` | `from:@gmail.com` | All senders from domain | +| `to:` | `to:team@company.com` | Recipient | +| `cc:` / `bcc:` | `cc:manager@co.com` | CC/BCC fields | +| `subject:` | `subject:meeting` | Subject text | +| `label:` / `l:` | `label:INBOX` | Gmail label | +| `has:attachment` | `has:attachment` | Has attachments | +| `before:` / `after:` | `after:2024-01-01` | Date (YYYY-MM-DD) | +| `older_than:` / `newer_than:` | `newer_than:7d` | Relative (d/w/m/y) | +| `larger:` / `smaller:` | `larger:5M` | Size filter (K/M) | +| bare words | `project update` | Full-text search | +| `"quoted"` | `"exact phrase"` | Exact phrase match | + +**Known limitations:** OR, NOT (-), wildcards (*), and parentheses do NOT work. For boolean/multi-domain queries, use DuckDB (see below). + +### Search Strategy + +The CLI search is single-operator and requires exact email addresses for `from:`. Work around this by layering tools. + +**Resolve sender first, then search:** +```bash +# Don't know the email? Find it via the sender index +msgvault list-senders -n 200 --json | jq -r '.[] | .key' | grep -i lastname +# Or use the query helper for domain-scoped lookup +bash scripts/query.sh by-domain gmail.com 20 +# Then search with the resolved address +msgvault search 'from:jdoe@example.com subject:proposal' -n 10 --json +``` + +**Narrow progressively:** Start broad (full-text), add operators (from:, subject:, date range) to filter down. Use `--json | jq` to post-filter results the CLI can't handle. + +**Escape to DuckDB when CLI can't do it:** Multi-domain, boolean logic, aggregations, thread analysis — drop to `query.sh` or raw DuckDB. Don't fight the CLI's limitations. + +**Stop after 5 attempts.** If targeted queries with plausible sender + keywords don't find it, more searching rarely helps. Check `msgvault list-accounts` (right account?), `msgvault stats` (sync fresh?), or suggest the user check a different account. + +### Pagination + +Default limit is 50. Use `--limit` and `--offset`: + +```bash +msgvault search "from:@gmail.com" --limit 100 --offset 0 --json +msgvault search "from:@gmail.com" --limit 100 --offset 100 --json +``` + +## Common Workflows + +For complete command reference with all flags, see [references/cli-reference.md](references/cli-reference.md). + +For complex multi-step workflows, see [references/workflows.md](references/workflows.md). + +### Sender Graph Analysis + +```bash +# Top 500 senders with counts +msgvault list-senders -n 500 --json | jq -r '.[] | "\(.count)\t\(.key)"' | sort -rn + +# Senders in a date range +msgvault list-senders -n 500 --after 2017-01-01 --before 2020-01-01 --json + +# Domain breakdown +msgvault list-domains -n 200 --json | jq -r '.[] | "\(.count)\t\(.key)"' +``` + +### Message Drill-Down + +```bash +# Search → get ID → read full message +msgvault search "from:alice@example.com subject:contract" --json | jq '.[0].id' +msgvault show-message --json + +# Extract just the body (avoids context bloat) +msgvault show-message --json | jq '.body_text' + +# Extract just attachments list +msgvault show-message --json | jq '.attachments' +``` + +### Attachment Operations + +```bash +# Find messages with large attachments +msgvault search "has:attachment larger:5M" --limit 100 --json + +# Export all attachments from a message +msgvault export-attachments -o ./exports/ + +# Export single attachment by SHA-256 hash (from show-message .attachments[].content_hash) +msgvault export-attachment -o file.pdf + +# Batch export +msgvault search "has:attachment label:Personal" --limit 100 --json | \ + jq -r '.[].id' | while read id; do msgvault export-attachments "$id" -o ./exports/; done +``` + +### Deletion (Staged, Safe) + +**WARNING:** `delete-staged` without `--trash` is PERMANENT and IRREVERSIBLE. Always `--dry-run` first. + +Two-step process — stage in TUI, execute via CLI: + +1. `msgvault tui` → navigate → select with `Space` → press `d` to stage +2. Review and execute: + +```bash +msgvault list-deletions # review pending batches +msgvault delete-staged --dry-run # preview what would be deleted +msgvault delete-staged --trash # move to Gmail trash (recoverable 30 days) +msgvault delete-staged --yes # permanent delete (IRREVERSIBLE) +msgvault cancel-deletion # cancel a batch +msgvault cancel-deletion --all # cancel all +``` + +Always confirm with the user before executing. Suggest `--dry-run` first. + +## JSON Output Shapes (verified) + +### search --json + +```json +[{ + "id": 12345, + "source_message_id": "18f0abc123", + "conversation_id": 67890, + "source_conversation_id": "thread-abc", + "subject": "...", + "from_email": "alice@example.com", + "from_name": "Alice Smith", + "sent_at": "2024-01-15T10:30:00Z", + "snippet": "...", + "labels": ["INBOX", "IMPORTANT"], + "has_attachments": true, + "attachment_count": 2, + "size_estimate": 45678 +}] +``` + +Notes: +- search returns `from_email` and `from_name` (not `from`). No `to`/`cc`/`bcc` — use `show-message` for recipients. +- **Empty results return non-JSON error text.** Always check exit code or wrap: `msgvault search "..." --json 2>/dev/null || echo '[]'` + +### list-senders / list-domains / list-labels --json + +```json +[{"key": "alice@example.com", "count": 142, "total_size": 5678900, "attachment_size": 1234567}] +``` + +### show-message --json + +```json +{ + "id": 12345, + "source_message_id": "18f0abc", + "conversation_id": 67890, + "source_conversation_id": "thread-abc", + "subject": "...", + "from": "Alice Smith ", + "to": [{"email": "bob@example.com", "name": "Bob Jones"}], + "cc": [], + "bcc": [], + "sent_at": "2024-01-15T10:30:00Z", + "labels": ["INBOX"], + "snippet": "...", + "has_attachments": true, + "size_estimate": 45678, + "body_text": "...", + "body_html": "...", + "attachments": [{"id": 123, "filename": "doc.pdf", "mime_type": "application/pdf", "size": 12345, "content_hash": "abc123..."}] +} +``` + +Notes: +- `to`/`cc`/`bcc` are **arrays of objects**: `[{"email": "...", "name": "..."}]` — extract emails with `.to[].email` +- `attachments[].content_hash` is the SHA-256 hash used by `export-attachment` +- `show-message` can return ~11k tokens for long threads. Always pipe through `jq` to extract only what you need: `.body_text`, `.attachments`, `.to[].email`, etc. + +## DuckDB Queries (Advanced) + +The CLI `search` is single-operator only. For boolean logic, multi-domain queries, aggregations, or cross-table joins, use DuckDB against the Parquet cache. + +### Query Helper Script + +`scripts/query.sh` wraps common DuckDB patterns — no raw SQL needed: + +```bash +bash scripts/query.sh senders 50 # Top 50 senders +bash scripts/query.sh senders 50 --after 2020-01-01 # Time-scoped +bash scripts/query.sh by-domain gmail.com,hotmail.com,yahoo.com # Senders from specific domains +bash scripts/query.sh classify example.com,supplier.co,partner.org # Count by domain list +bash scripts/query.sh threads alice@example.com # Thread co-participants +bash scripts/query.sh labels # All labels with counts +bash scripts/query.sh label-messages Personal 20 # Messages with label +bash scripts/query.sh unclassified mycompany.com,asana.com # Domains NOT in list +bash scripts/query.sh sql "SELECT ..." # Raw SQL escape hatch +``` + +### Raw DuckDB (when the script doesn't cover it) + +See [references/duckdb-queries.md](references/duckdb-queries.md) for full schema and query patterns. + +```bash +duckdb -c " +SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as senders +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +WHERE p.domain IN ('example.com', 'supplier.co', 'partner.org') +GROUP BY p.domain ORDER BY emails DESC; +" +``` + +### Key tables (Parquet in `~/.msgvault/analytics/`) + +| Table | Path | Key Columns | +|-------|------|-------------| +| messages | `messages/*/data_0.parquet` (hive by year) | id, subject, snippet, sent_at, has_attachments, year, month | +| message_recipients | `message_recipients/data.parquet` | message_id, participant_id, recipient_type (from/to/cc/bcc) | +| participants | `participants/participants.parquet` | id, email_address, domain, display_name | +| message_labels | `message_labels/data.parquet` | message_id, label_id | +| labels | `labels/labels.parquet` | id, name | +| attachments | `attachments/data.parquet` | message_id, size, filename | + +**Use DuckDB when:** multi-domain IN(), boolean AND/OR/NOT, GROUP BY, JOINs, regex, window functions, CSV/JSON export, thread co-participant analysis. + +**Use CLI `search` when:** simple single-field lookup, quick message retrieval by ID, full-text search on body content. + +## Safety Rules + +1. **Never delete without dry-run first** — `delete-staged --dry-run` before `--yes` +2. **Sync is read-only** — sync/sync-full never modify Gmail +3. **Deletion is two-step** — must stage in TUI first, then execute via CLI +4. **Cancel before execute** — use `cancel-deletion` if unsure +5. **Verify after sync** — `msgvault verify ` checks integrity +6. **Control output size** — always use `jq` with `show-message` to avoid context bloat diff --git a/skills/claude-code/references/duckdb-queries.md b/skills/claude-code/references/duckdb-queries.md new file mode 100644 index 00000000..ee0d6f2f --- /dev/null +++ b/skills/claude-code/references/duckdb-queries.md @@ -0,0 +1,314 @@ +# msgvault DuckDB Query Reference + +The CLI `search` command is limited to single-operator queries. For anything complex, query the Parquet analytics cache directly with DuckDB. + +**DuckDB CLI must be installed** (`which duckdb` to verify). + +## Data Layout + +``` +~/.msgvault/analytics/ +├── messages/year=YYYY/data_0.parquet # Hive-partitioned by year +├── message_recipients/data.parquet # from/to/cc/bcc links +├── participants/participants.parquet # email addresses + domains +├── message_labels/data.parquet # message ↔ label links +├── labels/labels.parquet # label names +├── attachments/data.parquet # attachment metadata +├── conversations/conversations.parquet # thread grouping +└── sources/sources.parquet # account info +``` + +## Table Aliases + +Use these in all queries for readability: + +```sql +-- Standard table references +read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) AS m +read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') AS r +read_parquet('~/.msgvault/analytics/participants/participants.parquet') AS p +read_parquet('~/.msgvault/analytics/message_labels/data.parquet') AS ml +read_parquet('~/.msgvault/analytics/labels/labels.parquet') AS l +read_parquet('~/.msgvault/analytics/attachments/data.parquet') AS a +read_parquet('~/.msgvault/analytics/conversations/conversations.parquet') AS c +``` + +## Schema + +### messages (partitioned by year) +| Column | Type | Notes | +|--------|------|-------| +| id | BIGINT | Primary key | +| source_id | BIGINT | FK → sources | +| source_message_id | VARCHAR | Gmail message ID | +| conversation_id | BIGINT | FK → conversations (thread) | +| subject | VARCHAR | | +| snippet | VARCHAR | Preview text | +| sent_at | TIMESTAMP | | +| size_estimate | BIGINT | Bytes | +| has_attachments | BOOLEAN | | +| deleted_from_source_at | TIMESTAMP | NULL if not deleted | +| month | INTEGER | 1-12 | +| year | BIGINT | Hive partition key | + +### message_recipients +| Column | Type | Notes | +|--------|------|-------| +| message_id | BIGINT | FK → messages | +| participant_id | BIGINT | FK → participants | +| recipient_type | VARCHAR | `from`, `to`, `cc`, `bcc` | +| display_name | VARCHAR | As shown in email | + +### participants +| Column | Type | Notes | +|--------|------|-------| +| id | BIGINT | Primary key | +| email_address | VARCHAR | Full address | +| domain | VARCHAR | Extracted domain | +| display_name | VARCHAR | | + +### message_labels +| Column | Type | Notes | +|--------|------|-------| +| message_id | BIGINT | FK → messages | +| label_id | BIGINT | FK → labels | + +### labels +| Column | Type | Notes | +|--------|------|-------| +| id | BIGINT | Primary key | +| name | VARCHAR | Gmail label name | + +### attachments +| Column | Type | Notes | +|--------|------|-------| +| message_id | BIGINT | FK → messages | +| size | BIGINT | Bytes | +| filename | VARCHAR | | + +## Common Joins + +### Message with sender +```sql +SELECT m.id, m.subject, m.sent_at, p.email_address, p.domain +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +``` + +### Message with labels +```sql +SELECT m.id, m.subject, l.name as label +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_labels/data.parquet') ml + ON ml.message_id = m.id +JOIN read_parquet('~/.msgvault/analytics/labels/labels.parquet') l + ON l.id = ml.label_id +``` + +## Sender Analysis Queries + +### Full sender graph (top N by volume) +```sql +SELECT p.email_address, p.domain, p.display_name, + COUNT(*) as emails, + MIN(m.sent_at) as first_seen, + MAX(m.sent_at) as last_seen +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +GROUP BY p.email_address, p.domain, p.display_name +ORDER BY emails DESC +LIMIT 500; +``` + +### Multi-domain search (impossible via CLI) +```sql +SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as unique_senders +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +WHERE p.domain IN ('example.com', 'supplier.co', 'partner.org', 'ledger.com') +GROUP BY p.domain +ORDER BY emails DESC; +``` + +### Emails to/from known personal contacts +```sql +SELECT p.email_address, r.recipient_type, COUNT(*) as emails +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +WHERE p.email_address IN ('alice@gmail.com', 'bob@example.com', 'carol@example.org') +GROUP BY p.email_address, r.recipient_type +ORDER BY emails DESC; +``` + +### All gmail.com senders (excluding known work contacts) +```sql +SELECT p.email_address, p.display_name, COUNT(*) as emails, + MIN(m.sent_at) as first_seen, MAX(m.sent_at) as last_seen +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +WHERE p.domain = 'gmail.com' + AND p.email_address NOT IN ('adrian.halliday@gmail.com') -- known work +GROUP BY p.email_address, p.display_name +ORDER BY emails DESC; +``` + +### Senders in a time period +```sql +SELECT p.email_address, p.domain, COUNT(*) as emails +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +WHERE m.year BETWEEN 2017 AND 2019 +GROUP BY p.email_address, p.domain +ORDER BY emails DESC +LIMIT 100; +``` + +## Classification Queries + +### Classify all messages by domain list +```sql +WITH sensitive_domains AS ( + SELECT unnest(['example.com','supplier.co','partner.org','anz.com.au','medibank.com.au']) as domain +), +sender_info AS ( + SELECT m.id, p.email_address, p.domain + FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m + JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' + JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +) +SELECT s.domain, COUNT(*) as emails +FROM sender_info s +JOIN sensitive_domains sd ON s.domain = sd.domain +GROUP BY s.domain +ORDER BY emails DESC; +``` + +### Emails with specific labels +```sql +SELECT l.name as label, COUNT(*) as emails +FROM read_parquet('~/.msgvault/analytics/message_labels/data.parquet') ml +JOIN read_parquet('~/.msgvault/analytics/labels/labels.parquet') l + ON l.id = ml.label_id +WHERE l.name IN ('Personal', '00_Private', 'Travel', 'Fusioneer') +GROUP BY l.name +ORDER BY emails DESC; +``` + +### Messages with label AND from domain +```sql +SELECT m.id, m.subject, m.sent_at, p.email_address, l.name as label +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +JOIN read_parquet('~/.msgvault/analytics/message_labels/data.parquet') ml + ON ml.message_id = m.id +JOIN read_parquet('~/.msgvault/analytics/labels/labels.parquet') l + ON l.id = ml.label_id +WHERE l.name = 'Personal' AND p.domain = 'gmail.com' +LIMIT 50; +``` + +### Unclassified domains (not in any known list) +```sql +WITH known_domains AS ( + SELECT unnest([ + -- work + 'mycompany.com','mycompany.io','asana.com','slack.com','github.com', + -- sensitive + 'example.com','supplier.co','anz.com.au','medibank.com.au', + -- personal + 'gmail.com','hotmail.com','yahoo.com' + -- add more... + ]) as domain +) +SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as senders +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +WHERE p.domain NOT IN (SELECT domain FROM known_domains) +GROUP BY p.domain +ORDER BY emails DESC +LIMIT 100; +``` + +## Thread Analysis + +### Co-participants in threads with a sender +```sql +WITH target_threads AS ( + SELECT DISTINCT m.conversation_id + FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m + JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id + JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id + WHERE p.email_address = 'person@example.com' +) +SELECT p.email_address, p.domain, COUNT(DISTINCT m.conversation_id) as shared_threads +FROM target_threads tt +JOIN read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m + ON m.conversation_id = tt.conversation_id +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +WHERE p.email_address != 'person@example.com' +GROUP BY p.email_address, p.domain +ORDER BY shared_threads DESC +LIMIT 20; +``` + +## Export Patterns + +### Export query to CSV +```sql +COPY ( + SELECT p.email_address, p.domain, COUNT(*) as emails + FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m + JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' + JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id + GROUP BY p.email_address, p.domain + ORDER BY emails DESC +) TO 'senders.csv' (HEADER, DELIMITER ','); +``` + +### Export to JSON +```sql +COPY ( + SELECT ... +) TO 'output.json' (FORMAT JSON); +``` + +## Performance Tips + +- Messages are **Hive-partitioned by year** — add `WHERE m.year = 2024` to limit scan scope +- Use `LIMIT` to preview before running full queries +- `COUNT(DISTINCT ...)` is expensive on large sets — use approximations if speed matters +- For repeated queries, consider creating a DuckDB view file +``` diff --git a/skills/claude-code/references/workflows.md b/skills/claude-code/references/workflows.md new file mode 100644 index 00000000..a191619e --- /dev/null +++ b/skills/claude-code/references/workflows.md @@ -0,0 +1,178 @@ +# msgvault Workflows + +Complex multi-step patterns for email analysis, classification, and export. + +## Sender Graph Analysis + +Build a complete picture of who emails you and how often. + +### Full sender graph +```bash +# All senders ranked by volume +msgvault list-senders -n 1000 --json | jq -r '.[] | "\(.count)\t\(.key)"' | sort -rn + +# Domain breakdown +msgvault list-domains -n 500 --json | jq -r '.[] | "\(.count)\t\(.key)"' | sort -rn + +# Senders from a specific domain (e.g. all gmail.com senders) +msgvault search "from:@gmail.com" --limit 500 --json | \ + jq -r '[.[].from_email] | group_by(.) | map({sender: .[0], count: length}) | sort_by(-.count) | .[] | "\(.count)\t\(.sender)"' +``` + +### Time-scoped sender analysis +```bash +# Who emailed during the crypto era (2017-2019)? +msgvault list-senders -n 200 --after 2017-01-01 --before 2019-12-31 --json + +# Recent senders only +msgvault list-senders -n 100 --after 2025-01-01 --json + +# Compare sender volume across periods +for year in 2020 2021 2022 2023 2024 2025; do + echo "=== $year ===" + msgvault list-domains -n 10 --after $year-01-01 --before $year-12-31 --json | \ + jq -r '.[] | "\(.count)\t\(.key)"' +done +``` + +### Unique sender extraction for classification +```bash +# Extract unique senders with counts, suitable for review spreadsheet +msgvault list-senders -n 5000 --json | \ + jq -r '.[] | [.key, .count, (.total_size / 1024 | floor | tostring) + "K"] | @csv' \ + > senders.csv + +# Extract unique domains +msgvault list-domains -n 1000 --json | \ + jq -r '.[] | [.key, .count] | @csv' > domains.csv +``` + +## Email Classification Pipeline + +### Step 1: Domain-based classification +```bash +# Check which domains from a list appear in the archive +for domain in example.com supplier.co partner.org; do + count=$(msgvault search "from:@$domain" --limit 1 --json | jq 'length') + echo "$count\t$domain" +done + +# Count emails per sensitive domain +for domain in $(cat sensitive-domains.txt); do + count=$(msgvault search "from:@$domain" --limit 1 --json 2>/dev/null | jq 'length // 0') + [ "$count" -gt 0 ] && echo "$count\t$domain" +done +``` + +### Step 2: Sender-based classification +```bash +# Find all emails to/from a known personal contact +msgvault search "from:person@gmail.com" --limit 500 --json +msgvault search "to:person@gmail.com" --limit 500 --json + +# Batch check known personal senders +while IFS= read -r sender; do + count=$(msgvault search "from:$sender" --limit 1 --json | jq 'length') + [ "$count" -gt 0 ] && echo "$count\t$sender" +done < known-personal-senders.txt +``` + +### Step 3: Label-based classification +```bash +# See all labels and their counts +msgvault list-labels --json | jq -r '.[] | "\(.count)\t\(.key)"' | sort -rn + +# Emails with a specific label +msgvault search "label:Personal" --limit 500 --json +msgvault search "label:Travel" --limit 500 --json +``` + +## Attachment Mining + +### Find valuable attachments +```bash +# All PDF attachments +msgvault search "has:attachment" --limit 500 --json | \ + jq '[.[] | select(.has_attachments)] | length' + +# Large attachments (likely documents, not inline images) +msgvault search "has:attachment larger:1M" --limit 100 --json + +# Attachments from a specific sender +msgvault search "has:attachment from:accountant@firm.com" --json +``` + +### Batch export +```bash +# Export all attachments from matching messages +mkdir -p exports +msgvault search "has:attachment label:Personal" --limit 200 --json | \ + jq -r '.[].id' | while read id; do + msgvault export-attachments "$id" -o ./exports/ 2>/dev/null + done + +# Export a single message as .eml for forensics +msgvault export-eml 12345 -o message.eml +``` + +## Thread Analysis + +### Find conversation threads +```bash +# All emails in a thread with a specific person +msgvault search "from:alice@example.com" --limit 100 --json | \ + jq -r '.[].subject' | sort -u + +# Cross-reference: who else is on threads with a sender +msgvault search "from:alice@example.com" --limit 50 --json | \ + jq -r '.[].to, .[].cc // empty' | tr ',' '\n' | sort -u +``` + +## Pagination for Large Queries + +```bash +# Paginate through all results (50 at a time) +offset=0 +while true; do + results=$(msgvault search "from:@gmail.com" --limit 50 --offset $offset --json) + count=$(echo "$results" | jq 'length') + [ "$count" -eq 0 ] && break + echo "$results" >> all_gmail_results.json + offset=$((offset + 50)) +done + +# Simpler: fixed page count +for offset in $(seq 0 50 500); do + msgvault search "from:@gmail.com" --limit 50 --offset $offset --json +done +``` + +## Reporting + +### Archive overview +```bash +# Full stats +msgvault stats + +# Top 20 senders +msgvault list-senders -n 20 + +# Top 20 domains +msgvault list-domains -n 20 + +# All labels +msgvault list-labels +``` + +### Export to CSV for spreadsheet review +```bash +# Senders CSV +msgvault list-senders -n 5000 --json | \ + jq -r '["sender","count","size_kb","attachment_kb"], (.[] | [.key, .count, (.total_size/1024|floor), (.attachment_size/1024|floor)]) | @csv' \ + > senders-report.csv + +# Domains CSV +msgvault list-domains -n 1000 --json | \ + jq -r '["domain","count","size_kb"], (.[] | [.key, .count, (.total_size/1024|floor)]) | @csv' \ + > domains-report.csv +``` diff --git a/skills/claude-code/scripts/query.sh b/skills/claude-code/scripts/query.sh new file mode 100644 index 00000000..cdb82b2c --- /dev/null +++ b/skills/claude-code/scripts/query.sh @@ -0,0 +1,197 @@ +#!/usr/bin/env bash +# msgvault DuckDB query helper +# Wraps common analytical queries against the Parquet cache +# Usage: query.sh [args] +# +# Requires: duckdb on PATH +# Respects: MSGVAULT_HOME env var (default: ~/.msgvault) + +set -euo pipefail + +DATA="${MSGVAULT_HOME:-$HOME/.msgvault}/analytics" + +# Verify analytics cache exists +if [ ! -d "$DATA/messages" ]; then + echo "Error: Analytics cache not found at $DATA" >&2 + echo "Run 'msgvault build-cache' first." >&2 + exit 1 +fi + +MSG="read_parquet('$DATA/messages/*/data_0.parquet', hive_partitioning=true)" +RECIP="read_parquet('$DATA/message_recipients/data.parquet')" +PARTS="read_parquet('$DATA/participants/participants.parquet')" +LABELS="read_parquet('$DATA/labels/labels.parquet')" +MLABELS="read_parquet('$DATA/message_labels/data.parquet')" +ATTACH="read_parquet('$DATA/attachments/data.parquet')" + +cmd="${1:-help}" +shift || true + +case "$cmd" in + # Full sender graph: query.sh senders [limit] [--after YYYY-MM-DD] [--before YYYY-MM-DD] + senders) + limit="${1:-100}" + where="" + shift || true + while [[ $# -gt 0 ]]; do + case "$1" in + --after) where="$where AND m.sent_at >= '$2'"; shift 2 ;; + --before) where="$where AND m.sent_at < '$2'"; shift 2 ;; + *) shift ;; + esac + done + duckdb -c " + SELECT p.email_address, p.domain, p.display_name, COUNT(*) as emails, + MIN(m.sent_at) as first_seen, MAX(m.sent_at) as last_seen + FROM $MSG m + JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from' + JOIN $PARTS p ON p.id = r.participant_id + WHERE 1=1 $where + GROUP BY p.email_address, p.domain, p.display_name + ORDER BY emails DESC LIMIT $limit; + " + ;; + + # Senders from specific domains: query.sh by-domain gmail.com,hotmail.com [limit] + by-domain) + domains="$1" + limit="${2:-100}" + in_list=$(echo "$domains" | sed "s/,/','/g") + duckdb -c " + SELECT p.email_address, p.display_name, p.domain, COUNT(*) as emails, + MIN(m.sent_at) as first_seen, MAX(m.sent_at) as last_seen + FROM $MSG m + JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from' + JOIN $PARTS p ON p.id = r.participant_id + WHERE p.domain IN ('$in_list') + GROUP BY p.email_address, p.display_name, p.domain + ORDER BY emails DESC LIMIT $limit; + " + ;; + + # Domain breakdown: query.sh domains [limit] + domains) + limit="${1:-100}" + duckdb -c " + SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as unique_senders, + SUM(m.size_estimate) as total_bytes + FROM $MSG m + JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from' + JOIN $PARTS p ON p.id = r.participant_id + GROUP BY p.domain ORDER BY emails DESC LIMIT $limit; + " + ;; + + # Count emails per domain list: query.sh classify domain1,domain2,... + classify) + domains="$1" + in_list=$(echo "$domains" | sed "s/,/','/g") + duckdb -c " + SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as senders + FROM $MSG m + JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from' + JOIN $PARTS p ON p.id = r.participant_id + WHERE p.domain IN ('$in_list') + GROUP BY p.domain ORDER BY emails DESC; + " + ;; + + # Thread co-participants: query.sh threads + threads) + email="$1" + duckdb -c " + WITH target_threads AS ( + SELECT DISTINCT m.conversation_id + FROM $MSG m + JOIN $RECIP r ON r.message_id = m.id + JOIN $PARTS p ON p.id = r.participant_id + WHERE p.email_address = '$email' + ) + SELECT p.email_address, p.domain, COUNT(DISTINCT m.conversation_id) as shared_threads + FROM target_threads tt + JOIN $MSG m ON m.conversation_id = tt.conversation_id + JOIN $RECIP r ON r.message_id = m.id + JOIN $PARTS p ON p.id = r.participant_id + WHERE p.email_address != '$email' + GROUP BY p.email_address, p.domain + ORDER BY shared_threads DESC LIMIT 20; + " + ;; + + # Label counts: query.sh labels + labels) + duckdb -c " + SELECT l.name, COUNT(*) as emails + FROM $MLABELS ml + JOIN $LABELS l ON l.id = ml.label_id + GROUP BY l.name ORDER BY emails DESC; + " + ;; + + # Messages with a specific label: query.sh label-messages [limit] + label-messages) + label="$1" + limit="${2:-50}" + duckdb -c " + SELECT m.id, m.subject, m.sent_at, p.email_address as sender + FROM $MSG m + JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from' + JOIN $PARTS p ON p.id = r.participant_id + JOIN $MLABELS ml ON ml.message_id = m.id + JOIN $LABELS l ON l.id = ml.label_id + WHERE l.name = '$label' + ORDER BY m.sent_at DESC LIMIT $limit; + " + ;; + + # Unclassified domains: query.sh unclassified domain1,domain2,... + unclassified) + domains="$1" + in_list=$(echo "$domains" | sed "s/,/','/g") + duckdb -c " + SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as senders + FROM $MSG m + JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from' + JOIN $PARTS p ON p.id = r.participant_id + WHERE p.domain NOT IN ('$in_list') + GROUP BY p.domain ORDER BY emails DESC LIMIT 50; + " + ;; + + # Raw SQL: query.sh sql "SELECT ..." + sql) + duckdb -c "$1" + ;; + + help|*) + cat <<'EOF' +msgvault DuckDB query helper + +Queries the Parquet analytics cache directly for operations the CLI +search can't handle (boolean logic, multi-domain, aggregations, JOINs). + +Requires: duckdb on PATH, analytics cache built (msgvault build-cache) +Respects: MSGVAULT_HOME env var (default: ~/.msgvault) + +Commands: + senders [limit] [--after DATE] [--before DATE] Full sender graph + by-domain [limit] Senders from comma-separated domains + domains [limit] Domain breakdown with sender counts + classify Count emails per domain (classification) + threads Co-participants in threads with sender + labels All labels with counts + label-messages