diff --git a/skills/claude-code/SKILL.md b/skills/claude-code/SKILL.md new file mode 100644 index 00000000..c5a39d0d --- /dev/null +++ b/skills/claude-code/SKILL.md @@ -0,0 +1,285 @@ +--- +name: msgvault-ops +description: "Local email archive operations with msgvault — search, analyze, export, and manage Gmail archives stored in SQLite/Parquet. Use when: querying email history, analyzing senders/domains, exporting messages or attachments, managing Gmail deletions, building sender graphs, running email analytics, importing mbox/emlx, or any task involving msgvault CLI. Triggers on: msgvault, email archive, email search, gmail archive, email export, sender analysis, sender graph, email classification, attachment export, email deletion, list senders, list domains, email analytics, mbox import." +--- + +# msgvault-ops + +Operate the msgvault email archive CLI. All data is local (SQLite + Parquet). Queries run in milliseconds against DuckDB-powered indexes. Gmail API is only used for sync and deletion. + +## Environment + +``` +Binary: msgvault (or full path if not on PATH) +Data: ~/.msgvault/ (override with MSGVAULT_HOME) +Config: ~/.msgvault/config.toml +``` + +Ensure `msgvault` is on PATH or use the full binary path. + +## Quick Reference + +| Task | Command | +|------|---------| +| Archive status | `msgvault stats` | +| Search | `msgvault search "" --json` | +| Top senders | `msgvault list-senders -n 100 --json` | +| Top domains | `msgvault list-domains -n 100 --json` | +| All labels | `msgvault list-labels --json` | +| Read message | `msgvault show-message --json` | +| Export .eml | `msgvault export-eml -o file.eml` | +| Export attachments | `msgvault export-attachments -o ./dir/` | +| Incremental sync | `msgvault sync` | +| Full sync | `msgvault sync-full ` (resumable) | +| Build analytics cache | `msgvault build-cache` (required for DuckDB) | +| TUI | `msgvault tui` (interactive, not for agents) | + +**Always use `--json` for programmatic access.** Parse with `jq`. + +## Search + +### Operators + +Single-operator queries only. `from:` requires an **exact** email address — no fuzzy matching. + +| Operator | Example | Notes | +|----------|---------|-------| +| `from:` | `from:alice@example.com` | Exact sender address | +| `from:@domain` | `from:@gmail.com` | All senders from domain | +| `to:` | `to:team@company.com` | Recipient | +| `cc:` / `bcc:` | `cc:manager@co.com` | CC/BCC fields | +| `subject:` | `subject:meeting` | Subject text | +| `label:` / `l:` | `label:INBOX` | Gmail label | +| `has:attachment` | `has:attachment` | Has attachments | +| `before:` / `after:` | `after:2024-01-01` | Date (YYYY-MM-DD) | +| `older_than:` / `newer_than:` | `newer_than:7d` | Relative (d/w/m/y) | +| `larger:` / `smaller:` | `larger:5M` | Size filter (K/M) | +| bare words | `project update` | Full-text search | +| `"quoted"` | `"exact phrase"` | Exact phrase match | + +**Known limitations:** OR, NOT (-), wildcards (*), and parentheses do NOT work. For boolean/multi-domain queries, use DuckDB (see below). + +### Search Strategy + +The CLI search is single-operator and requires exact email addresses for `from:`. Work around this by layering tools. + +**Resolve sender first, then search:** +```bash +# Don't know the email? Find it via the sender index +msgvault list-senders -n 200 --json | jq -r '.[] | .key' | grep -i lastname +# Or use the query helper for domain-scoped lookup +bash scripts/query.sh by-domain gmail.com 20 +# Then search with the resolved address +msgvault search 'from:jdoe@example.com subject:proposal' -n 10 --json +``` + +**Narrow progressively:** Start broad (full-text), add operators (from:, subject:, date range) to filter down. Use `--json | jq` to post-filter results the CLI can't handle. + +**Escape to DuckDB when CLI can't do it:** Multi-domain, boolean logic, aggregations, thread analysis — drop to `query.sh` or raw DuckDB. Don't fight the CLI's limitations. + +**Stop after 5 attempts.** If targeted queries with plausible sender + keywords don't find it, more searching rarely helps. Check `msgvault list-accounts` (right account?), `msgvault stats` (sync fresh?), or suggest the user check a different account. + +### Pagination + +Default limit is 50. Use `--limit` and `--offset`: + +```bash +msgvault search "from:@gmail.com" --limit 100 --offset 0 --json +msgvault search "from:@gmail.com" --limit 100 --offset 100 --json +``` + +## Common Workflows + +For complete command reference with all flags, see [references/cli-reference.md](references/cli-reference.md). + +For complex multi-step workflows, see [references/workflows.md](references/workflows.md). + +### Sender Graph Analysis + +```bash +# Top 500 senders with counts +msgvault list-senders -n 500 --json | jq -r '.[] | "\(.count)\t\(.key)"' | sort -rn + +# Senders in a date range +msgvault list-senders -n 500 --after 2017-01-01 --before 2020-01-01 --json + +# Domain breakdown +msgvault list-domains -n 200 --json | jq -r '.[] | "\(.count)\t\(.key)"' +``` + +### Message Drill-Down + +```bash +# Search → get ID → read full message +msgvault search "from:alice@example.com subject:contract" --json | jq '.[0].id' +msgvault show-message --json + +# Extract just the body (avoids context bloat) +msgvault show-message --json | jq '.body_text' + +# Extract just attachments list +msgvault show-message --json | jq '.attachments' +``` + +### Attachment Operations + +```bash +# Find messages with large attachments +msgvault search "has:attachment larger:5M" --limit 100 --json + +# Export all attachments from a message +msgvault export-attachments -o ./exports/ + +# Export single attachment by SHA-256 hash (from show-message .attachments[].content_hash) +msgvault export-attachment -o file.pdf + +# Batch export +msgvault search "has:attachment label:Personal" --limit 100 --json | \ + jq -r '.[].id' | while read id; do msgvault export-attachments "$id" -o ./exports/; done +``` + +### Deletion (Staged, Safe) + +**WARNING:** `delete-staged` without `--trash` is PERMANENT and IRREVERSIBLE. Always `--dry-run` first. + +Two-step process — stage in TUI, execute via CLI: + +1. `msgvault tui` → navigate → select with `Space` → press `d` to stage +2. Review and execute: + +```bash +msgvault list-deletions # review pending batches +msgvault delete-staged --dry-run # preview what would be deleted +msgvault delete-staged --trash # move to Gmail trash (recoverable 30 days) +msgvault delete-staged --yes # permanent delete (IRREVERSIBLE) +msgvault cancel-deletion # cancel a batch +msgvault cancel-deletion --all # cancel all +``` + +Always confirm with the user before executing. Suggest `--dry-run` first. + +## JSON Output Shapes (verified) + +### search --json + +```json +[{ + "id": 12345, + "source_message_id": "18f0abc123", + "conversation_id": 67890, + "source_conversation_id": "thread-abc", + "subject": "...", + "from_email": "alice@example.com", + "from_name": "Alice Smith", + "sent_at": "2024-01-15T10:30:00Z", + "snippet": "...", + "labels": ["INBOX", "IMPORTANT"], + "has_attachments": true, + "attachment_count": 2, + "size_estimate": 45678 +}] +``` + +Notes: +- search returns `from_email` and `from_name` (not `from`). No `to`/`cc`/`bcc` — use `show-message` for recipients. +- **Empty results return non-JSON error text.** Always check exit code or wrap: `msgvault search "..." --json 2>/dev/null || echo '[]'` + +### list-senders / list-domains / list-labels --json + +```json +[{"key": "alice@example.com", "count": 142, "total_size": 5678900, "attachment_size": 1234567}] +``` + +### show-message --json + +```json +{ + "id": 12345, + "source_message_id": "18f0abc", + "conversation_id": 67890, + "source_conversation_id": "thread-abc", + "subject": "...", + "from": "Alice Smith ", + "to": [{"email": "bob@example.com", "name": "Bob Jones"}], + "cc": [], + "bcc": [], + "sent_at": "2024-01-15T10:30:00Z", + "labels": ["INBOX"], + "snippet": "...", + "has_attachments": true, + "size_estimate": 45678, + "body_text": "...", + "body_html": "...", + "attachments": [{"id": 123, "filename": "doc.pdf", "mime_type": "application/pdf", "size": 12345, "content_hash": "abc123..."}] +} +``` + +Notes: +- `to`/`cc`/`bcc` are **arrays of objects**: `[{"email": "...", "name": "..."}]` — extract emails with `.to[].email` +- `attachments[].content_hash` is the SHA-256 hash used by `export-attachment` +- `show-message` can return ~11k tokens for long threads. Always pipe through `jq` to extract only what you need: `.body_text`, `.attachments`, `.to[].email`, etc. + +## DuckDB Queries (Advanced) + +The CLI `search` is single-operator only. For boolean logic, multi-domain queries, aggregations, or cross-table joins, use DuckDB against the Parquet cache. + +### Query Helper Script + +`scripts/query.sh` wraps common DuckDB patterns — no raw SQL needed: + +```bash +bash scripts/query.sh senders 50 # Top 50 senders +bash scripts/query.sh senders 50 --after 2020-01-01 # Time-scoped +bash scripts/query.sh by-domain gmail.com,hotmail.com,yahoo.com # Senders from specific domains +bash scripts/query.sh classify example.com,supplier.co,partner.org # Count by domain list +bash scripts/query.sh threads alice@example.com # Thread co-participants +bash scripts/query.sh labels # All labels with counts +bash scripts/query.sh label-messages Personal 20 # Messages with label +bash scripts/query.sh unclassified mycompany.com,asana.com # Domains NOT in list +bash scripts/query.sh sql "SELECT ..." # Raw SQL escape hatch +``` + +### Raw DuckDB (when the script doesn't cover it) + +See [references/duckdb-queries.md](references/duckdb-queries.md) for full schema and query patterns. + +```bash +duckdb -c " +SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as senders +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +WHERE p.domain IN ('example.com', 'supplier.co', 'partner.org') +GROUP BY p.domain ORDER BY emails DESC; +" +``` + +### Key tables (Parquet in `~/.msgvault/analytics/`) + +| Table | Path | Key Columns | +|-------|------|-------------| +| messages | `messages/*/data_0.parquet` (hive by year) | id, subject, snippet, sent_at, has_attachments, year, month | +| message_recipients | `message_recipients/data.parquet` | message_id, participant_id, recipient_type (from/to/cc/bcc) | +| participants | `participants/participants.parquet` | id, email_address, domain, display_name | +| message_labels | `message_labels/data.parquet` | message_id, label_id | +| labels | `labels/labels.parquet` | id, name | +| attachments | `attachments/data.parquet` | message_id, size, filename | + +**Use DuckDB when:** multi-domain IN(), boolean AND/OR/NOT, GROUP BY, JOINs, regex, window functions, CSV/JSON export, thread co-participant analysis. + +**Use CLI `search` when:** simple single-field lookup, quick message retrieval by ID, full-text search on body content. + +**Prerequisite:** DuckDB queries require the analytics cache. Run `msgvault build-cache` if the `analytics/` directory is missing or stale. + +**Security:** The `sql` subcommand blocks write operations but can still read local files. Never pass unsanitised user input to any subcommand. Prefer validated subcommands (senders, by-domain, etc.) over raw SQL. + +## Safety Rules + +1. **Never delete without dry-run first** — `delete-staged --dry-run` before `--yes` +2. **Sync is read-only** — sync/sync-full never modify Gmail +3. **Deletion is two-step** — must stage in TUI first, then execute via CLI +4. **Cancel before execute** — use `cancel-deletion` if unsure +5. **Verify after sync** — `msgvault verify ` checks integrity +6. **Control output size** — always use `jq` with `show-message` to avoid context bloat diff --git a/skills/claude-code/references/duckdb-queries.md b/skills/claude-code/references/duckdb-queries.md new file mode 100644 index 00000000..a4a66eca --- /dev/null +++ b/skills/claude-code/references/duckdb-queries.md @@ -0,0 +1,316 @@ +# msgvault DuckDB Query Reference + +The CLI `search` command is limited to single-operator queries. For anything complex, query the Parquet analytics cache directly with DuckDB. + +**DuckDB CLI must be installed** (`which duckdb` to verify). + +**Path note:** All examples below use `~/.msgvault/analytics/`. If `MSGVAULT_HOME` is set, substitute that path (e.g. `$MSGVAULT_HOME/analytics/`). The `query.sh` helper script handles this automatically. + +## Data Layout + +``` +~/.msgvault/analytics/ +├── messages/year=YYYY/data_0.parquet # Hive-partitioned by year +├── message_recipients/data.parquet # from/to/cc/bcc links +├── participants/participants.parquet # email addresses + domains +├── message_labels/data.parquet # message ↔ label links +├── labels/labels.parquet # label names +├── attachments/data.parquet # attachment metadata +├── conversations/conversations.parquet # thread grouping +└── sources/sources.parquet # account info +``` + +## Table Aliases + +Use these in all queries for readability: + +```sql +-- Standard table references +read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) AS m +read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') AS r +read_parquet('~/.msgvault/analytics/participants/participants.parquet') AS p +read_parquet('~/.msgvault/analytics/message_labels/data.parquet') AS ml +read_parquet('~/.msgvault/analytics/labels/labels.parquet') AS l +read_parquet('~/.msgvault/analytics/attachments/data.parquet') AS a +read_parquet('~/.msgvault/analytics/conversations/conversations.parquet') AS c +``` + +## Schema + +### messages (partitioned by year) +| Column | Type | Notes | +|--------|------|-------| +| id | BIGINT | Primary key | +| source_id | BIGINT | FK → sources | +| source_message_id | VARCHAR | Gmail message ID | +| conversation_id | BIGINT | FK → conversations (thread) | +| subject | VARCHAR | | +| snippet | VARCHAR | Preview text | +| sent_at | TIMESTAMP | | +| size_estimate | BIGINT | Bytes | +| has_attachments | BOOLEAN | | +| deleted_from_source_at | TIMESTAMP | NULL if not deleted | +| month | INTEGER | 1-12 | +| year | BIGINT | Hive partition key | + +### message_recipients +| Column | Type | Notes | +|--------|------|-------| +| message_id | BIGINT | FK → messages | +| participant_id | BIGINT | FK → participants | +| recipient_type | VARCHAR | `from`, `to`, `cc`, `bcc` | +| display_name | VARCHAR | As shown in email | + +### participants +| Column | Type | Notes | +|--------|------|-------| +| id | BIGINT | Primary key | +| email_address | VARCHAR | Full address | +| domain | VARCHAR | Extracted domain | +| display_name | VARCHAR | | + +### message_labels +| Column | Type | Notes | +|--------|------|-------| +| message_id | BIGINT | FK → messages | +| label_id | BIGINT | FK → labels | + +### labels +| Column | Type | Notes | +|--------|------|-------| +| id | BIGINT | Primary key | +| name | VARCHAR | Gmail label name | + +### attachments +| Column | Type | Notes | +|--------|------|-------| +| message_id | BIGINT | FK → messages | +| size | BIGINT | Bytes | +| filename | VARCHAR | | + +## Common Joins + +### Message with sender +```sql +SELECT m.id, m.subject, m.sent_at, p.email_address, p.domain +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +``` + +### Message with labels +```sql +SELECT m.id, m.subject, l.name as label +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_labels/data.parquet') ml + ON ml.message_id = m.id +JOIN read_parquet('~/.msgvault/analytics/labels/labels.parquet') l + ON l.id = ml.label_id +``` + +## Sender Analysis Queries + +### Full sender graph (top N by volume) +```sql +SELECT p.email_address, p.domain, p.display_name, + COUNT(*) as emails, + MIN(m.sent_at) as first_seen, + MAX(m.sent_at) as last_seen +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +GROUP BY p.email_address, p.domain, p.display_name +ORDER BY emails DESC +LIMIT 500; +``` + +### Multi-domain search (impossible via CLI) +```sql +SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as unique_senders +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +WHERE p.domain IN ('example.com', 'supplier.co', 'partner.org', 'ledger.com') +GROUP BY p.domain +ORDER BY emails DESC; +``` + +### Emails to/from known personal contacts +```sql +SELECT p.email_address, r.recipient_type, COUNT(*) as emails +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +WHERE p.email_address IN ('alice@gmail.com', 'bob@example.com', 'carol@example.org') +GROUP BY p.email_address, r.recipient_type +ORDER BY emails DESC; +``` + +### All gmail.com senders (excluding known work contacts) +```sql +SELECT p.email_address, p.display_name, COUNT(*) as emails, + MIN(m.sent_at) as first_seen, MAX(m.sent_at) as last_seen +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +WHERE p.domain = 'gmail.com' + AND p.email_address NOT IN ('adrian.halliday@gmail.com') -- known work +GROUP BY p.email_address, p.display_name +ORDER BY emails DESC; +``` + +### Senders in a time period +```sql +SELECT p.email_address, p.domain, COUNT(*) as emails +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +WHERE m.year BETWEEN 2017 AND 2019 +GROUP BY p.email_address, p.domain +ORDER BY emails DESC +LIMIT 100; +``` + +## Classification Queries + +### Classify all messages by domain list +```sql +WITH sensitive_domains AS ( + SELECT unnest(['example.com','supplier.co','partner.org','anz.com.au','medibank.com.au']) as domain +), +sender_info AS ( + SELECT m.id, p.email_address, p.domain + FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m + JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' + JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +) +SELECT s.domain, COUNT(*) as emails +FROM sender_info s +JOIN sensitive_domains sd ON s.domain = sd.domain +GROUP BY s.domain +ORDER BY emails DESC; +``` + +### Emails with specific labels +```sql +SELECT l.name as label, COUNT(*) as emails +FROM read_parquet('~/.msgvault/analytics/message_labels/data.parquet') ml +JOIN read_parquet('~/.msgvault/analytics/labels/labels.parquet') l + ON l.id = ml.label_id +WHERE l.name IN ('Personal', '00_Private', 'Travel', 'Fusioneer') +GROUP BY l.name +ORDER BY emails DESC; +``` + +### Messages with label AND from domain +```sql +SELECT m.id, m.subject, m.sent_at, p.email_address, l.name as label +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +JOIN read_parquet('~/.msgvault/analytics/message_labels/data.parquet') ml + ON ml.message_id = m.id +JOIN read_parquet('~/.msgvault/analytics/labels/labels.parquet') l + ON l.id = ml.label_id +WHERE l.name = 'Personal' AND p.domain = 'gmail.com' +LIMIT 50; +``` + +### Unclassified domains (not in any known list) +```sql +WITH known_domains AS ( + SELECT unnest([ + -- work + 'mycompany.com','mycompany.io','asana.com','slack.com','github.com', + -- sensitive + 'example.com','supplier.co','anz.com.au','medibank.com.au', + -- personal + 'gmail.com','hotmail.com','yahoo.com' + -- add more... + ]) as domain +) +SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as senders +FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +WHERE p.domain NOT IN (SELECT domain FROM known_domains) +GROUP BY p.domain +ORDER BY emails DESC +LIMIT 100; +``` + +## Thread Analysis + +### Co-participants in threads with a sender +```sql +WITH target_threads AS ( + SELECT DISTINCT m.conversation_id + FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m + JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id + JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id + WHERE p.email_address = 'person@example.com' +) +SELECT p.email_address, p.domain, COUNT(DISTINCT m.conversation_id) as shared_threads +FROM target_threads tt +JOIN read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m + ON m.conversation_id = tt.conversation_id +JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id +JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id +WHERE p.email_address != 'person@example.com' +GROUP BY p.email_address, p.domain +ORDER BY shared_threads DESC +LIMIT 20; +``` + +## Export Patterns + +### Export query to CSV +```sql +COPY ( + SELECT p.email_address, p.domain, COUNT(*) as emails + FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m + JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r + ON r.message_id = m.id AND r.recipient_type = 'from' + JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p + ON p.id = r.participant_id + GROUP BY p.email_address, p.domain + ORDER BY emails DESC +) TO 'senders.csv' (HEADER, DELIMITER ','); +``` + +### Export to JSON +```sql +COPY ( + SELECT ... +) TO 'output.json' (FORMAT JSON); +``` + +## Performance Tips + +- Messages are **Hive-partitioned by year** — add `WHERE m.year = 2024` to limit scan scope +- Use `LIMIT` to preview before running full queries +- `COUNT(DISTINCT ...)` is expensive on large sets — use approximations if speed matters +- For repeated queries, consider creating a DuckDB view file +``` diff --git a/skills/claude-code/references/workflows.md b/skills/claude-code/references/workflows.md new file mode 100644 index 00000000..b9810e8e --- /dev/null +++ b/skills/claude-code/references/workflows.md @@ -0,0 +1,182 @@ +# msgvault Workflows + +Complex multi-step patterns for email analysis, classification, and export. + +## Sender Graph Analysis + +Build a complete picture of who emails you and how often. + +### Full sender graph +```bash +# All senders ranked by volume +msgvault list-senders -n 1000 --json | jq -r '.[] | "\(.count)\t\(.key)"' | sort -rn + +# Domain breakdown +msgvault list-domains -n 500 --json | jq -r '.[] | "\(.count)\t\(.key)"' | sort -rn + +# Senders from a specific domain (e.g. all gmail.com senders) +msgvault search "from:@gmail.com" --limit 500 --json | \ + jq -r '[.[].from_email] | group_by(.) | map({sender: .[0], count: length}) | sort_by(-.count) | .[] | "\(.count)\t\(.sender)"' +``` + +### Time-scoped sender analysis +```bash +# Who emailed during the crypto era (2017-2019)? +msgvault list-senders -n 200 --after 2017-01-01 --before 2019-12-31 --json + +# Recent senders only +msgvault list-senders -n 100 --after 2025-01-01 --json + +# Compare sender volume across periods +for year in 2020 2021 2022 2023 2024 2025; do + echo "=== $year ===" + msgvault list-domains -n 10 --after $year-01-01 --before $year-12-31 --json | \ + jq -r '.[] | "\(.count)\t\(.key)"' +done +``` + +### Unique sender extraction for classification +```bash +# Extract unique senders with counts, suitable for review spreadsheet +msgvault list-senders -n 5000 --json | \ + jq -r '.[] | [.key, .count, (.total_size / 1024 | floor | tostring) + "K"] | @csv' \ + > senders.csv + +# Extract unique domains +msgvault list-domains -n 1000 --json | \ + jq -r '.[] | [.key, .count] | @csv' > domains.csv +``` + +## Email Classification Pipeline + +### Step 1: Domain-based classification +```bash +# Check which domains from a list appear in the archive +for domain in example.com supplier.co partner.org; do + count=$(msgvault search "from:@$domain" --limit 1 --json 2>/dev/null | jq 'length' 2>/dev/null || echo 0) + echo "$count\t$domain" +done + +# Count emails per sensitive domain +for domain in $(cat sensitive-domains.txt); do + count=$(msgvault search "from:@$domain" --limit 1 --json 2>/dev/null | jq 'length' 2>/dev/null || echo 0) + [ "$count" -gt 0 ] && echo "$count\t$domain" +done +``` + +### Step 2: Sender-based classification +```bash +# Find all emails to/from a known personal contact +msgvault search "from:person@gmail.com" --limit 500 --json +msgvault search "to:person@gmail.com" --limit 500 --json + +# Batch check known personal senders +while IFS= read -r sender; do + count=$(msgvault search "from:$sender" --limit 1 --json 2>/dev/null | jq 'length' 2>/dev/null || echo 0) + [ "$count" -gt 0 ] && echo "$count\t$sender" +done < known-personal-senders.txt +``` + +### Step 3: Label-based classification +```bash +# See all labels and their counts +msgvault list-labels --json | jq -r '.[] | "\(.count)\t\(.key)"' | sort -rn + +# Emails with a specific label +msgvault search "label:Personal" --limit 500 --json +msgvault search "label:Travel" --limit 500 --json +``` + +## Attachment Mining + +### Find valuable attachments +```bash +# All PDF attachments +msgvault search "has:attachment" --limit 500 --json | \ + jq '[.[] | select(.has_attachments)] | length' + +# Large attachments (likely documents, not inline images) +msgvault search "has:attachment larger:1M" --limit 100 --json + +# Attachments from a specific sender +msgvault search "has:attachment from:accountant@firm.com" --json +``` + +### Batch export +```bash +# Export all attachments from matching messages +mkdir -p exports +msgvault search "has:attachment label:Personal" --limit 200 --json | \ + jq -r '.[].id' | while read id; do + msgvault export-attachments "$id" -o ./exports/ 2>/dev/null + done + +# Export a single message as .eml for forensics +msgvault export-eml 12345 -o message.eml +``` + +## Thread Analysis + +### Find conversation threads +```bash +# All emails in a thread with a specific person +msgvault search "from:alice@example.com" --limit 100 --json 2>/dev/null \ + | jq -r '.[].subject' | sort -u + +# Cross-reference: who else is on threads with a sender +# Note: search --json does NOT include to/cc fields. Use query.sh instead: +bash scripts/query.sh threads alice@example.com + +# Or drill into a specific message for full recipients: +msgvault show-message --json | jq '.to[].email, .cc[].email' +``` + +## Pagination for Large Queries + +```bash +# Paginate through all results (50 at a time) +# Note: empty results return non-JSON error text, so guard with 2>/dev/null +offset=0 +while true; do + results=$(msgvault search "from:@gmail.com" --limit 50 --offset $offset --json 2>/dev/null) || break + count=$(echo "$results" | jq 'length' 2>/dev/null) || break + [ "$count" -eq 0 ] && break + echo "$results" >> all_gmail_results.json + offset=$((offset + 50)) +done + +# Simpler: fixed page count +for offset in $(seq 0 50 500); do + msgvault search "from:@gmail.com" --limit 50 --offset $offset --json 2>/dev/null || true +done +``` + +## Reporting + +### Archive overview +```bash +# Full stats +msgvault stats + +# Top 20 senders +msgvault list-senders -n 20 + +# Top 20 domains +msgvault list-domains -n 20 + +# All labels +msgvault list-labels +``` + +### Export to CSV for spreadsheet review +```bash +# Senders CSV +msgvault list-senders -n 5000 --json | \ + jq -r '["sender","count","size_kb","attachment_kb"], (.[] | [.key, .count, (.total_size/1024|floor), (.attachment_size/1024|floor)]) | @csv' \ + > senders-report.csv + +# Domains CSV +msgvault list-domains -n 1000 --json | \ + jq -r '["domain","count","size_kb"], (.[] | [.key, .count, (.total_size/1024|floor)]) | @csv' \ + > domains-report.csv +``` diff --git a/skills/claude-code/scripts/query.sh b/skills/claude-code/scripts/query.sh new file mode 100644 index 00000000..e49f741e --- /dev/null +++ b/skills/claude-code/scripts/query.sh @@ -0,0 +1,295 @@ +#!/usr/bin/env bash +# msgvault DuckDB query helper +# Wraps common analytical queries against the Parquet cache +# Usage: query.sh [args] +# +# Requires: duckdb on PATH +# Respects: MSGVAULT_HOME env var (default: ~/.msgvault) + +set -euo pipefail + +# Verify duckdb is available +command -v duckdb >/dev/null 2>&1 || { + echo "Error: duckdb not found on PATH" >&2 + echo "Install from https://duckdb.org/docs/installation" >&2 + exit 1 +} + +DATA="${MSGVAULT_HOME:-$HOME/.msgvault}/analytics" + +# Verify analytics cache exists +if [ ! -d "$DATA/messages" ]; then + echo "Error: Analytics cache not found at $DATA" >&2 + echo "Run 'msgvault build-cache' first." >&2 + exit 1 +fi + +MSG="read_parquet('$DATA/messages/*/data_0.parquet', hive_partitioning=true)" +RECIP="read_parquet('$DATA/message_recipients/data.parquet')" +PARTS="read_parquet('$DATA/participants/participants.parquet')" +LABELS="read_parquet('$DATA/labels/labels.parquet')" +MLABELS="read_parquet('$DATA/message_labels/data.parquet')" +ATTACH="read_parquet('$DATA/attachments/data.parquet')" + +# --- Input validation helpers --- + +# Validate integer (limit, offset) +validate_int() { + local val="$1" name="$2" + if ! [[ "$val" =~ ^[0-9]+$ ]] || [ "$val" -eq 0 ] || [ "$val" -gt 100000 ]; then + echo "Error: $name must be a positive integer (1-100000), got '$val'" >&2 + exit 1 + fi +} + +# Validate date (YYYY-MM-DD) +validate_date() { + local val="$1" name="$2" + if ! [[ "$val" =~ ^[0-9]{4}-[0-9]{2}-[0-9]{2}$ ]]; then + echo "Error: $name must be YYYY-MM-DD, got '$val'" >&2 + exit 1 + fi +} + +# Validate domain (letters, digits, dots, hyphens — no underscores or specials) +validate_domain() { + local val="$1" + if ! [[ "$val" =~ ^[a-zA-Z0-9]([a-zA-Z0-9.-]*[a-zA-Z0-9])?$ ]]; then + echo "Error: invalid domain '$val'" >&2 + exit 1 + fi +} + +# Validate email address (basic check) +validate_email() { + local val="$1" + if ! [[ "$val" =~ ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9._-]+$ ]]; then + echo "Error: invalid email '$val'" >&2 + exit 1 + fi +} + +# Validate label name (alphanumeric, spaces, slashes, underscores, hyphens) +validate_label() { + local val="$1" + if [[ "$val" == *"'"* ]] || [[ "$val" == *";"* ]] || [[ "$val" == *"\\"* ]]; then + echo "Error: invalid label name '$val'" >&2 + exit 1 + fi +} + +# Build a validated SQL IN list from comma-separated domains +build_domain_in_list() { + local input="$1" + local result="" + IFS=',' read -ra domains <<< "$input" + for d in "${domains[@]}"; do + validate_domain "$d" + if [ -n "$result" ]; then + result="$result,'$d'" + else + result="'$d'" + fi + done + echo "$result" +} + +# --- Command parsing --- + +cmd="${1:-help}" +if [ $# -gt 0 ]; then shift; fi + +case "$cmd" in + # Full sender graph: query.sh senders [--after DATE] [--before DATE] [limit] + senders) + limit=100 + where="" + while [[ $# -gt 0 ]]; do + case "$1" in + --after) validate_date "$2" "--after"; where="$where AND m.sent_at >= '$2'"; shift 2 ;; + --before) validate_date "$2" "--before"; where="$where AND m.sent_at < '$2'"; shift 2 ;; + *) + if [[ "$1" =~ ^[0-9]+$ ]]; then + limit="$1" + fi + shift ;; + esac + done + validate_int "$limit" "limit" + duckdb -c " + SELECT p.email_address, p.domain, p.display_name, COUNT(*) as emails, + MIN(m.sent_at) as first_seen, MAX(m.sent_at) as last_seen + FROM $MSG m + JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from' + JOIN $PARTS p ON p.id = r.participant_id + WHERE 1=1 $where + GROUP BY p.email_address, p.domain, p.display_name + ORDER BY emails DESC LIMIT $limit; + " + ;; + + # Senders from specific domains: query.sh by-domain gmail.com,hotmail.com [limit] + by-domain) + in_list=$(build_domain_in_list "$1") + limit="${2:-100}" + validate_int "$limit" "limit" + duckdb -c " + SELECT p.email_address, p.display_name, p.domain, COUNT(*) as emails, + MIN(m.sent_at) as first_seen, MAX(m.sent_at) as last_seen + FROM $MSG m + JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from' + JOIN $PARTS p ON p.id = r.participant_id + WHERE p.domain IN ($in_list) + GROUP BY p.email_address, p.display_name, p.domain + ORDER BY emails DESC LIMIT $limit; + " + ;; + + # Domain breakdown: query.sh domains [limit] + domains) + limit="${1:-100}" + validate_int "$limit" "limit" + duckdb -c " + SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as unique_senders, + SUM(m.size_estimate) as total_bytes + FROM $MSG m + JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from' + JOIN $PARTS p ON p.id = r.participant_id + GROUP BY p.domain ORDER BY emails DESC LIMIT $limit; + " + ;; + + # Count emails per domain list: query.sh classify domain1,domain2,... + classify) + in_list=$(build_domain_in_list "$1") + duckdb -c " + SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as senders + FROM $MSG m + JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from' + JOIN $PARTS p ON p.id = r.participant_id + WHERE p.domain IN ($in_list) + GROUP BY p.domain ORDER BY emails DESC; + " + ;; + + # Thread co-participants: query.sh threads + # Finds threads where appears in ANY role (from/to/cc/bcc), + # then lists other participants. This is intentional — it answers + # "who else is on threads involving this person" not just threads they sent. + threads) + email="$1" + validate_email "$email" + duckdb -c " + WITH target_threads AS ( + SELECT DISTINCT m.conversation_id + FROM $MSG m + JOIN $RECIP r ON r.message_id = m.id + JOIN $PARTS p ON p.id = r.participant_id + WHERE p.email_address = '$email' + ) + SELECT p.email_address, p.domain, COUNT(DISTINCT m.conversation_id) as shared_threads + FROM target_threads tt + JOIN $MSG m ON m.conversation_id = tt.conversation_id + JOIN $RECIP r ON r.message_id = m.id + JOIN $PARTS p ON p.id = r.participant_id + WHERE p.email_address != '$email' + GROUP BY p.email_address, p.domain + ORDER BY shared_threads DESC LIMIT 20; + " + ;; + + # Label counts: query.sh labels + labels) + duckdb -c " + SELECT l.name, COUNT(*) as emails + FROM $MLABELS ml + JOIN $LABELS l ON l.id = ml.label_id + GROUP BY l.name ORDER BY emails DESC; + " + ;; + + # Messages with a specific label: query.sh label-messages [limit] + label-messages) + label="$1" + validate_label "$label" + limit="${2:-50}" + validate_int "$limit" "limit" + duckdb -c " + SELECT m.id, m.subject, m.sent_at, p.email_address as sender + FROM $MSG m + JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from' + JOIN $PARTS p ON p.id = r.participant_id + JOIN $MLABELS ml ON ml.message_id = m.id + JOIN $LABELS l ON l.id = ml.label_id + WHERE l.name = '$label' + ORDER BY m.sent_at DESC LIMIT $limit; + " + ;; + + # Unclassified domains: query.sh unclassified domain1,domain2,... + unclassified) + in_list=$(build_domain_in_list "$1") + duckdb -c " + SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as senders + FROM $MSG m + JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from' + JOIN $PARTS p ON p.id = r.participant_id + WHERE p.domain NOT IN ($in_list) + GROUP BY p.domain ORDER BY emails DESC LIMIT 50; + " + ;; + + # Raw SQL: query.sh sql "SELECT ..." + # Allowlist: single read-only statement (SELECT, WITH, DESCRIBE, SHOW) + sql) + # Reject multi-statement input (semicolons allow bypass) + if [[ "$1" == *";"* ]]; then + echo "Error: multi-statement input not allowed (no semicolons)." >&2 + exit 1 + fi + normalized=$(echo "$1" | sed 's/^[[:space:]]*//' | tr '[:lower:]' '[:upper:]') + if [[ "$normalized" =~ ^(SELECT|WITH|DESCRIBE|SHOW) ]]; then + duckdb -c "$1" + else + echo "Error: only read-only statements allowed (SELECT, WITH, DESCRIBE, SHOW)." >&2 + echo "Got: $(echo "$normalized" | head -c 40)" >&2 + exit 1 + fi + ;; + + help|*) + cat <<'EOF' +msgvault DuckDB query helper + +Queries the Parquet analytics cache directly for operations the CLI +search can't handle (boolean logic, multi-domain, aggregations, JOINs). + +Requires: duckdb on PATH, analytics cache built (msgvault build-cache) +Respects: MSGVAULT_HOME env var (default: ~/.msgvault) + +Commands: + senders [limit] [--after DATE] [--before DATE] Full sender graph + by-domain [limit] Senders from comma-separated domains + domains [limit] Domain breakdown with sender counts + classify Count emails per domain (classification) + threads Co-participants in threads involving person + labels All labels with counts + label-messages