From 1a6460c7782773d1dd0e9421e88ca2dc5392e985 Mon Sep 17 00:00:00 2001
From: 0xDarkMatter <0xDarkMatter@users.noreply.github.com>
Date: Tue, 31 Mar 2026 11:46:59 +1100
Subject: [PATCH 1/7] feat: add Claude Code skill with DuckDB query layer

Adds a Claude Code skill for msgvault that covers the full CLI surface
and includes direct DuckDB queries against the Parquet analytics cache
for operations the CLI search can't handle (boolean logic, multi-domain,
aggregations, thread analysis).

Includes:
- SKILL.md with verified JSON output shapes, search strategy, safety rules
- scripts/query.sh helper wrapping common DuckDB patterns (9 subcommands)
- references/duckdb-queries.md with full Parquet schema and query patterns
- references/workflows.md with multi-step analysis patterns

Tested against a ~755k message archive. All documented commands, jq
patterns, and DuckDB queries verified against live data.

Ref: #230
---
 skills/claude-code/SKILL.md                   | 279 ++++++++++++++++
 .../claude-code/references/duckdb-queries.md  | 314 ++++++++++++++++++
 skills/claude-code/references/workflows.md    | 178 ++++++++++
 skills/claude-code/scripts/query.sh           | 197 +++++++++++
 4 files changed, 968 insertions(+)
 create mode 100644 skills/claude-code/SKILL.md
 create mode 100644 skills/claude-code/references/duckdb-queries.md
 create mode 100644 skills/claude-code/references/workflows.md
 create mode 100644 skills/claude-code/scripts/query.sh
diff --git a/skills/claude-code/SKILL.md b/skills/claude-code/SKILL.md
new file mode 100644
index 00000000..7e9477d2
--- /dev/null
+++ b/skills/claude-code/SKILL.md
@@ -0,0 +1,279 @@
+---
+name: msgvault-ops
+description: "Local email archive operations with msgvault — search, analyze, export, and manage Gmail archives stored in SQLite/Parquet. Use when: querying email history, analyzing senders/domains, exporting messages or attachments, managing Gmail deletions, building sender graphs, running email analytics, importing mbox/emlx, or any task involving msgvault CLI. Triggers on: msgvault, email archive, email search, gmail archive, email export, sender analysis, sender graph, email classification, attachment export, email deletion, list senders, list domains, email analytics, mbox import."
+---
+
+# msgvault-ops
+
+Operate the msgvault email archive CLI. All data is local (SQLite + Parquet). Queries run in milliseconds against DuckDB-powered indexes. Gmail API is only used for sync and deletion.
+
+## Environment
+
+```
+Binary:  msgvault (or full path if not on PATH)
+Data:    ~/.msgvault/ (override with MSGVAULT_HOME)
+Config:  ~/.msgvault/config.toml
+```
+
+Ensure `msgvault` is on PATH or use the full binary path.
+
+## Quick Reference
+
+| Task | Command |
+|------|---------|
+| Archive status | `msgvault stats` |
+| Search | `msgvault search "<query>" --json` |
+| Top senders | `msgvault list-senders -n 100 --json` |
+| Top domains | `msgvault list-domains -n 100 --json` |
+| All labels | `msgvault list-labels --json` |
+| Read message | `msgvault show-message <id> --json` |
+| Export .eml | `msgvault export-eml <id> -o file.eml` |
+| Export attachments | `msgvault export-attachments <id> -o ./dir/` |
+| Incremental sync | `msgvault sync` |
+| TUI | `msgvault tui` (interactive, not for agents) |
+
+**Always use `--json` for programmatic access.** Parse with `jq`.
+
+## Search
+
+### Operators
+
+Single-operator queries only. `from:` requires an **exact** email address — no fuzzy matching.
+
+| Operator | Example | Notes |
+|----------|---------|-------|
+| `from:` | `from:alice@example.com` | Exact sender address |
+| `from:@domain` | `from:@gmail.com` | All senders from domain |
+| `to:` | `to:team@company.com` | Recipient |
+| `cc:` / `bcc:` | `cc:manager@co.com` | CC/BCC fields |
+| `subject:` | `subject:meeting` | Subject text |
+| `label:` / `l:` | `label:INBOX` | Gmail label |
+| `has:attachment` | `has:attachment` | Has attachments |
+| `before:` / `after:` | `after:2024-01-01` | Date (YYYY-MM-DD) |
+| `older_than:` / `newer_than:` | `newer_than:7d` | Relative (d/w/m/y) |
+| `larger:` / `smaller:` | `larger:5M` | Size filter (K/M) |
+| bare words | `project update` | Full-text search |
+| `"quoted"` | `"exact phrase"` | Exact phrase match |
+
+**Known limitations:** OR, NOT (-), wildcards (*), and parentheses do NOT work. For boolean/multi-domain queries, use DuckDB (see below).
+
+### Search Strategy
+
+The CLI search is single-operator and requires exact email addresses for `from:`. Work around this by layering tools.
+
+**Resolve sender first, then search:**
+```bash
+# Don't know the email? Find it via the sender index
+msgvault list-senders -n 200 --json | jq -r '.[] | .key' | grep -i lastname
+# Or use the query helper for domain-scoped lookup
+bash scripts/query.sh by-domain gmail.com 20
+# Then search with the resolved address
+msgvault search 'from:jdoe@example.com subject:proposal' -n 10 --json
+```
+
+**Narrow progressively:** Start broad (full-text), add operators (from:, subject:, date range) to filter down. Use `--json | jq` to post-filter results the CLI can't handle.
+
+**Escape to DuckDB when CLI can't do it:** Multi-domain, boolean logic, aggregations, thread analysis — drop to `query.sh` or raw DuckDB. Don't fight the CLI's limitations.
+
+**Stop after 5 attempts.** If targeted queries with plausible sender + keywords don't find it, more searching rarely helps. Check `msgvault list-accounts` (right account?), `msgvault stats` (sync fresh?), or suggest the user check a different account.
+
+### Pagination
+
+Default limit is 50. Use `--limit` and `--offset`:
+
+```bash
+msgvault search "from:@gmail.com" --limit 100 --offset 0 --json
+msgvault search "from:@gmail.com" --limit 100 --offset 100 --json
+```
+
+## Common Workflows
+
+For complete command reference with all flags, see [references/cli-reference.md](references/cli-reference.md).
+
+For complex multi-step workflows, see [references/workflows.md](references/workflows.md).
+
+### Sender Graph Analysis
+
+```bash
+# Top 500 senders with counts
+msgvault list-senders -n 500 --json | jq -r '.[] | "\(.count)\t\(.key)"' | sort -rn
+
+# Senders in a date range
+msgvault list-senders -n 500 --after 2017-01-01 --before 2020-01-01 --json
+
+# Domain breakdown
+msgvault list-domains -n 200 --json | jq -r '.[] | "\(.count)\t\(.key)"'
+```
+
+### Message Drill-Down
+
+```bash
+# Search → get ID → read full message
+msgvault search "from:alice@example.com subject:contract" --json | jq '.[0].id'
+msgvault show-message <id> --json
+
+# Extract just the body (avoids context bloat)
+msgvault show-message <id> --json | jq '.body_text'
+
+# Extract just attachments list
+msgvault show-message <id> --json | jq '.attachments'
+```
+
+### Attachment Operations
+
+```bash
+# Find messages with large attachments
+msgvault search "has:attachment larger:5M" --limit 100 --json
+
+# Export all attachments from a message
+msgvault export-attachments <id> -o ./exports/
+
+# Export single attachment by SHA-256 hash (from show-message .attachments[].content_hash)
+msgvault export-attachment <hash> -o file.pdf
+
+# Batch export
+msgvault search "has:attachment label:Personal" --limit 100 --json | \
+  jq -r '.[].id' | while read id; do msgvault export-attachments "$id" -o ./exports/; done
+```
+
+### Deletion (Staged, Safe)
+
+**WARNING:** `delete-staged` without `--trash` is PERMANENT and IRREVERSIBLE. Always `--dry-run` first.
+
+Two-step process — stage in TUI, execute via CLI:
+
+1. `msgvault tui` → navigate → select with `Space` → press `d` to stage
+2. Review and execute:
+
+```bash
+msgvault list-deletions                   # review pending batches
+msgvault delete-staged --dry-run          # preview what would be deleted
+msgvault delete-staged --trash            # move to Gmail trash (recoverable 30 days)
+msgvault delete-staged --yes              # permanent delete (IRREVERSIBLE)
+msgvault cancel-deletion <batch-id>       # cancel a batch
+msgvault cancel-deletion --all            # cancel all
+```
+
+Always confirm with the user before executing. Suggest `--dry-run` first.
+
+## JSON Output Shapes (verified)
+
+### search --json
+
+```json
+[{
+  "id": 12345,
+  "source_message_id": "18f0abc123",
+  "conversation_id": 67890,
+  "source_conversation_id": "thread-abc",
+  "subject": "...",
+  "from_email": "alice@example.com",
+  "from_name": "Alice Smith",
+  "sent_at": "2024-01-15T10:30:00Z",
+  "snippet": "...",
+  "labels": ["INBOX", "IMPORTANT"],
+  "has_attachments": true,
+  "attachment_count": 2,
+  "size_estimate": 45678
+}]
+```
+
+Notes:
+- search returns `from_email` and `from_name` (not `from`). No `to`/`cc`/`bcc` — use `show-message` for recipients.
+- **Empty results return non-JSON error text.** Always check exit code or wrap: `msgvault search "..." --json 2>/dev/null || echo '[]'`
+
+### list-senders / list-domains / list-labels --json
+
+```json
+[{"key": "alice@example.com", "count": 142, "total_size": 5678900, "attachment_size": 1234567}]
+```
+
+### show-message --json
+
+```json
+{
+  "id": 12345,
+  "source_message_id": "18f0abc",
+  "conversation_id": 67890,
+  "source_conversation_id": "thread-abc",
+  "subject": "...",
+  "from": "Alice Smith <alice@example.com>",
+  "to": [{"email": "bob@example.com", "name": "Bob Jones"}],
+  "cc": [],
+  "bcc": [],
+  "sent_at": "2024-01-15T10:30:00Z",
+  "labels": ["INBOX"],
+  "snippet": "...",
+  "has_attachments": true,
+  "size_estimate": 45678,
+  "body_text": "...",
+  "body_html": "...",
+  "attachments": [{"id": 123, "filename": "doc.pdf", "mime_type": "application/pdf", "size": 12345, "content_hash": "abc123..."}]
+}
+```
+
+Notes:
+- `to`/`cc`/`bcc` are **arrays of objects**: `[{"email": "...", "name": "..."}]` — extract emails with `.to[].email`
+- `attachments[].content_hash` is the SHA-256 hash used by `export-attachment`
+- `show-message` can return ~11k tokens for long threads. Always pipe through `jq` to extract only what you need: `.body_text`, `.attachments`, `.to[].email`, etc.
+
+## DuckDB Queries (Advanced)
+
+The CLI `search` is single-operator only. For boolean logic, multi-domain queries, aggregations, or cross-table joins, use DuckDB against the Parquet cache.
+
+### Query Helper Script
+
+`scripts/query.sh` wraps common DuckDB patterns — no raw SQL needed:
+
+```bash
+bash scripts/query.sh senders 50                                  # Top 50 senders
+bash scripts/query.sh senders 50 --after 2020-01-01               # Time-scoped
+bash scripts/query.sh by-domain gmail.com,hotmail.com,yahoo.com   # Senders from specific domains
+bash scripts/query.sh classify example.com,supplier.co,partner.org # Count by domain list
+bash scripts/query.sh threads alice@example.com                   # Thread co-participants
+bash scripts/query.sh labels                                      # All labels with counts
+bash scripts/query.sh label-messages Personal 20                  # Messages with label
+bash scripts/query.sh unclassified mycompany.com,asana.com       # Domains NOT in list
+bash scripts/query.sh sql "SELECT ..."                            # Raw SQL escape hatch
+```
+
+### Raw DuckDB (when the script doesn't cover it)
+
+See [references/duckdb-queries.md](references/duckdb-queries.md) for full schema and query patterns.
+
+```bash
+duckdb -c "
+SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as senders
+FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m
+JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r
+  ON r.message_id = m.id AND r.recipient_type = 'from'
+JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p
+  ON p.id = r.participant_id
+WHERE p.domain IN ('example.com', 'supplier.co', 'partner.org')
+GROUP BY p.domain ORDER BY emails DESC;
+"
+```
+
+### Key tables (Parquet in `~/.msgvault/analytics/`)
+
+| Table | Path | Key Columns |
+|-------|------|-------------|
+| messages | `messages/*/data_0.parquet` (hive by year) | id, subject, snippet, sent_at, has_attachments, year, month |
+| message_recipients | `message_recipients/data.parquet` | message_id, participant_id, recipient_type (from/to/cc/bcc) |
+| participants | `participants/participants.parquet` | id, email_address, domain, display_name |
+| message_labels | `message_labels/data.parquet` | message_id, label_id |
+| labels | `labels/labels.parquet` | id, name |
+| attachments | `attachments/data.parquet` | message_id, size, filename |
+
+**Use DuckDB when:** multi-domain IN(), boolean AND/OR/NOT, GROUP BY, JOINs, regex, window functions, CSV/JSON export, thread co-participant analysis.
+
+**Use CLI `search` when:** simple single-field lookup, quick message retrieval by ID, full-text search on body content.
+
+## Safety Rules
+
+1. **Never delete without dry-run first** — `delete-staged --dry-run` before `--yes`
+2. **Sync is read-only** — sync/sync-full never modify Gmail
+3. **Deletion is two-step** — must stage in TUI first, then execute via CLI
+4. **Cancel before execute** — use `cancel-deletion` if unsure
+5. **Verify after sync** — `msgvault verify <email>` checks integrity
+6. **Control output size** — always use `jq` with `show-message` to avoid context bloat
diff --git a/skills/claude-code/references/duckdb-queries.md b/skills/claude-code/references/duckdb-queries.md
new file mode 100644
index 00000000..ee0d6f2f
--- /dev/null
+++ b/skills/claude-code/references/duckdb-queries.md
@@ -0,0 +1,314 @@
+# msgvault DuckDB Query Reference
+
+The CLI `search` command is limited to single-operator queries. For anything complex, query the Parquet analytics cache directly with DuckDB.
+
+**DuckDB CLI must be installed** (`which duckdb` to verify).
+
+## Data Layout
+
+```
+~/.msgvault/analytics/
+├── messages/year=YYYY/data_0.parquet   # Hive-partitioned by year
+├── message_recipients/data.parquet      # from/to/cc/bcc links
+├── participants/participants.parquet    # email addresses + domains
+├── message_labels/data.parquet          # message ↔ label links
+├── labels/labels.parquet                # label names
+├── attachments/data.parquet             # attachment metadata
+├── conversations/conversations.parquet  # thread grouping
+└── sources/sources.parquet              # account info
+```
+
+## Table Aliases
+
+Use these in all queries for readability:
+
+```sql
+-- Standard table references
+read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) AS m
+read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') AS r
+read_parquet('~/.msgvault/analytics/participants/participants.parquet') AS p
+read_parquet('~/.msgvault/analytics/message_labels/data.parquet') AS ml
+read_parquet('~/.msgvault/analytics/labels/labels.parquet') AS l
+read_parquet('~/.msgvault/analytics/attachments/data.parquet') AS a
+read_parquet('~/.msgvault/analytics/conversations/conversations.parquet') AS c
+```
+
+## Schema
+
+### messages (partitioned by year)
+| Column | Type | Notes |
+|--------|------|-------|
+| id | BIGINT | Primary key |
+| source_id | BIGINT | FK → sources |
+| source_message_id | VARCHAR | Gmail message ID |
+| conversation_id | BIGINT | FK → conversations (thread) |
+| subject | VARCHAR | |
+| snippet | VARCHAR | Preview text |
+| sent_at | TIMESTAMP | |
+| size_estimate | BIGINT | Bytes |
+| has_attachments | BOOLEAN | |
+| deleted_from_source_at | TIMESTAMP | NULL if not deleted |
+| month | INTEGER | 1-12 |
+| year | BIGINT | Hive partition key |
+
+### message_recipients
+| Column | Type | Notes |
+|--------|------|-------|
+| message_id | BIGINT | FK → messages |
+| participant_id | BIGINT | FK → participants |
+| recipient_type | VARCHAR | `from`, `to`, `cc`, `bcc` |
+| display_name | VARCHAR | As shown in email |
+
+### participants
+| Column | Type | Notes |
+|--------|------|-------|
+| id | BIGINT | Primary key |
+| email_address | VARCHAR | Full address |
+| domain | VARCHAR | Extracted domain |
+| display_name | VARCHAR | |
+
+### message_labels
+| Column | Type | Notes |
+|--------|------|-------|
+| message_id | BIGINT | FK → messages |
+| label_id | BIGINT | FK → labels |
+
+### labels
+| Column | Type | Notes |
+|--------|------|-------|
+| id | BIGINT | Primary key |
+| name | VARCHAR | Gmail label name |
+
+### attachments
+| Column | Type | Notes |
+|--------|------|-------|
+| message_id | BIGINT | FK → messages |
+| size | BIGINT | Bytes |
+| filename | VARCHAR | |
+
+## Common Joins
+
+### Message with sender
+```sql
+SELECT m.id, m.subject, m.sent_at, p.email_address, p.domain
+FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m
+JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r
+  ON r.message_id = m.id AND r.recipient_type = 'from'
+JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p
+  ON p.id = r.participant_id
+```
+
+### Message with labels
+```sql
+SELECT m.id, m.subject, l.name as label
+FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m
+JOIN read_parquet('~/.msgvault/analytics/message_labels/data.parquet') ml
+  ON ml.message_id = m.id
+JOIN read_parquet('~/.msgvault/analytics/labels/labels.parquet') l
+  ON l.id = ml.label_id
+```
+
+## Sender Analysis Queries
+
+### Full sender graph (top N by volume)
+```sql
+SELECT p.email_address, p.domain, p.display_name,
+       COUNT(*) as emails,
+       MIN(m.sent_at) as first_seen,
+       MAX(m.sent_at) as last_seen
+FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m
+JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r
+  ON r.message_id = m.id AND r.recipient_type = 'from'
+JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p
+  ON p.id = r.participant_id
+GROUP BY p.email_address, p.domain, p.display_name
+ORDER BY emails DESC
+LIMIT 500;
+```
+
+### Multi-domain search (impossible via CLI)
+```sql
+SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as unique_senders
+FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m
+JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r
+  ON r.message_id = m.id AND r.recipient_type = 'from'
+JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p
+  ON p.id = r.participant_id
+WHERE p.domain IN ('example.com', 'supplier.co', 'partner.org', 'ledger.com')
+GROUP BY p.domain
+ORDER BY emails DESC;
+```
+
+### Emails to/from known personal contacts
+```sql
+SELECT p.email_address, r.recipient_type, COUNT(*) as emails
+FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m
+JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r
+  ON r.message_id = m.id
+JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p
+  ON p.id = r.participant_id
+WHERE p.email_address IN ('alice@gmail.com', 'bob@example.com', 'carol@example.org')
+GROUP BY p.email_address, r.recipient_type
+ORDER BY emails DESC;
+```
+
+### All gmail.com senders (excluding known work contacts)
+```sql
+SELECT p.email_address, p.display_name, COUNT(*) as emails,
+       MIN(m.sent_at) as first_seen, MAX(m.sent_at) as last_seen
+FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m
+JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r
+  ON r.message_id = m.id AND r.recipient_type = 'from'
+JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p
+  ON p.id = r.participant_id
+WHERE p.domain = 'gmail.com'
+  AND p.email_address NOT IN ('adrian.halliday@gmail.com') -- known work
+GROUP BY p.email_address, p.display_name
+ORDER BY emails DESC;
+```
+
+### Senders in a time period
+```sql
+SELECT p.email_address, p.domain, COUNT(*) as emails
+FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m
+JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r
+  ON r.message_id = m.id AND r.recipient_type = 'from'
+JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p
+  ON p.id = r.participant_id
+WHERE m.year BETWEEN 2017 AND 2019
+GROUP BY p.email_address, p.domain
+ORDER BY emails DESC
+LIMIT 100;
+```
+
+## Classification Queries
+
+### Classify all messages by domain list
+```sql
+WITH sensitive_domains AS (
+  SELECT unnest(['example.com','supplier.co','partner.org','anz.com.au','medibank.com.au']) as domain
+),
+sender_info AS (
+  SELECT m.id, p.email_address, p.domain
+  FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m
+  JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r
+    ON r.message_id = m.id AND r.recipient_type = 'from'
+  JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p
+    ON p.id = r.participant_id
+)
+SELECT s.domain, COUNT(*) as emails
+FROM sender_info s
+JOIN sensitive_domains sd ON s.domain = sd.domain
+GROUP BY s.domain
+ORDER BY emails DESC;
+```
+
+### Emails with specific labels
+```sql
+SELECT l.name as label, COUNT(*) as emails
+FROM read_parquet('~/.msgvault/analytics/message_labels/data.parquet') ml
+JOIN read_parquet('~/.msgvault/analytics/labels/labels.parquet') l
+  ON l.id = ml.label_id
+WHERE l.name IN ('Personal', '00_Private', 'Travel', 'Fusioneer')
+GROUP BY l.name
+ORDER BY emails DESC;
+```
+
+### Messages with label AND from domain
+```sql
+SELECT m.id, m.subject, m.sent_at, p.email_address, l.name as label
+FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m
+JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r
+  ON r.message_id = m.id AND r.recipient_type = 'from'
+JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p
+  ON p.id = r.participant_id
+JOIN read_parquet('~/.msgvault/analytics/message_labels/data.parquet') ml
+  ON ml.message_id = m.id
+JOIN read_parquet('~/.msgvault/analytics/labels/labels.parquet') l
+  ON l.id = ml.label_id
+WHERE l.name = 'Personal' AND p.domain = 'gmail.com'
+LIMIT 50;
+```
+
+### Unclassified domains (not in any known list)
+```sql
+WITH known_domains AS (
+  SELECT unnest([
+    -- work
+    'mycompany.com','mycompany.io','asana.com','slack.com','github.com',
+    -- sensitive
+    'example.com','supplier.co','anz.com.au','medibank.com.au',
+    -- personal
+    'gmail.com','hotmail.com','yahoo.com'
+    -- add more...
+  ]) as domain
+)
+SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as senders
+FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m
+JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r
+  ON r.message_id = m.id AND r.recipient_type = 'from'
+JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p
+  ON p.id = r.participant_id
+WHERE p.domain NOT IN (SELECT domain FROM known_domains)
+GROUP BY p.domain
+ORDER BY emails DESC
+LIMIT 100;
+```
+
+## Thread Analysis
+
+### Co-participants in threads with a sender
+```sql
+WITH target_threads AS (
+  SELECT DISTINCT m.conversation_id
+  FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m
+  JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r
+    ON r.message_id = m.id
+  JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p
+    ON p.id = r.participant_id
+  WHERE p.email_address = 'person@example.com'
+)
+SELECT p.email_address, p.domain, COUNT(DISTINCT m.conversation_id) as shared_threads
+FROM target_threads tt
+JOIN read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m
+  ON m.conversation_id = tt.conversation_id
+JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r
+  ON r.message_id = m.id
+JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p
+  ON p.id = r.participant_id
+WHERE p.email_address != 'person@example.com'
+GROUP BY p.email_address, p.domain
+ORDER BY shared_threads DESC
+LIMIT 20;
+```
+
+## Export Patterns
+
+### Export query to CSV
+```sql
+COPY (
+  SELECT p.email_address, p.domain, COUNT(*) as emails
+  FROM read_parquet('~/.msgvault/analytics/messages/*/data_0.parquet', hive_partitioning=true) m
+  JOIN read_parquet('~/.msgvault/analytics/message_recipients/data.parquet') r
+    ON r.message_id = m.id AND r.recipient_type = 'from'
+  JOIN read_parquet('~/.msgvault/analytics/participants/participants.parquet') p
+    ON p.id = r.participant_id
+  GROUP BY p.email_address, p.domain
+  ORDER BY emails DESC
+) TO 'senders.csv' (HEADER, DELIMITER ',');
+```
+
+### Export to JSON
+```sql
+COPY (
+  SELECT ...
+) TO 'output.json' (FORMAT JSON);
+```
+
+## Performance Tips
+
+- Messages are **Hive-partitioned by year** — add `WHERE m.year = 2024` to limit scan scope
+- Use `LIMIT` to preview before running full queries
+- `COUNT(DISTINCT ...)` is expensive on large sets — use approximations if speed matters
+- For repeated queries, consider creating a DuckDB view file
+```
diff --git a/skills/claude-code/references/workflows.md b/skills/claude-code/references/workflows.md
new file mode 100644
index 00000000..a191619e
--- /dev/null
+++ b/skills/claude-code/references/workflows.md
@@ -0,0 +1,178 @@
+# msgvault Workflows
+
+Complex multi-step patterns for email analysis, classification, and export.
+
+## Sender Graph Analysis
+
+Build a complete picture of who emails you and how often.
+
+### Full sender graph
+```bash
+# All senders ranked by volume
+msgvault list-senders -n 1000 --json | jq -r '.[] | "\(.count)\t\(.key)"' | sort -rn
+
+# Domain breakdown
+msgvault list-domains -n 500 --json | jq -r '.[] | "\(.count)\t\(.key)"' | sort -rn
+
+# Senders from a specific domain (e.g. all gmail.com senders)
+msgvault search "from:@gmail.com" --limit 500 --json | \
+  jq -r '[.[].from_email] | group_by(.) | map({sender: .[0], count: length}) | sort_by(-.count) | .[] | "\(.count)\t\(.sender)"'
+```
+
+### Time-scoped sender analysis
+```bash
+# Who emailed during the crypto era (2017-2019)?
+msgvault list-senders -n 200 --after 2017-01-01 --before 2019-12-31 --json
+
+# Recent senders only
+msgvault list-senders -n 100 --after 2025-01-01 --json
+
+# Compare sender volume across periods
+for year in 2020 2021 2022 2023 2024 2025; do
+  echo "=== $year ==="
+  msgvault list-domains -n 10 --after $year-01-01 --before $year-12-31 --json | \
+    jq -r '.[] | "\(.count)\t\(.key)"'
+done
+```
+
+### Unique sender extraction for classification
+```bash
+# Extract unique senders with counts, suitable for review spreadsheet
+msgvault list-senders -n 5000 --json | \
+  jq -r '.[] | [.key, .count, (.total_size / 1024 | floor | tostring) + "K"] | @csv' \
+  > senders.csv
+
+# Extract unique domains
+msgvault list-domains -n 1000 --json | \
+  jq -r '.[] | [.key, .count] | @csv' > domains.csv
+```
+
+## Email Classification Pipeline
+
+### Step 1: Domain-based classification
+```bash
+# Check which domains from a list appear in the archive
+for domain in example.com supplier.co partner.org; do
+  count=$(msgvault search "from:@$domain" --limit 1 --json | jq 'length')
+  echo "$count\t$domain"
+done
+
+# Count emails per sensitive domain
+for domain in $(cat sensitive-domains.txt); do
+  count=$(msgvault search "from:@$domain" --limit 1 --json 2>/dev/null | jq 'length // 0')
+  [ "$count" -gt 0 ] && echo "$count\t$domain"
+done
+```
+
+### Step 2: Sender-based classification
+```bash
+# Find all emails to/from a known personal contact
+msgvault search "from:person@gmail.com" --limit 500 --json
+msgvault search "to:person@gmail.com" --limit 500 --json
+
+# Batch check known personal senders
+while IFS= read -r sender; do
+  count=$(msgvault search "from:$sender" --limit 1 --json | jq 'length')
+  [ "$count" -gt 0 ] && echo "$count\t$sender"
+done < known-personal-senders.txt
+```
+
+### Step 3: Label-based classification
+```bash
+# See all labels and their counts
+msgvault list-labels --json | jq -r '.[] | "\(.count)\t\(.key)"' | sort -rn
+
+# Emails with a specific label
+msgvault search "label:Personal" --limit 500 --json
+msgvault search "label:Travel" --limit 500 --json
+```
+
+## Attachment Mining
+
+### Find valuable attachments
+```bash
+# All PDF attachments
+msgvault search "has:attachment" --limit 500 --json | \
+  jq '[.[] | select(.has_attachments)] | length'
+
+# Large attachments (likely documents, not inline images)
+msgvault search "has:attachment larger:1M" --limit 100 --json
+
+# Attachments from a specific sender
+msgvault search "has:attachment from:accountant@firm.com" --json
+```
+
+### Batch export
+```bash
+# Export all attachments from matching messages
+mkdir -p exports
+msgvault search "has:attachment label:Personal" --limit 200 --json | \
+  jq -r '.[].id' | while read id; do
+    msgvault export-attachments "$id" -o ./exports/ 2>/dev/null
+  done
+
+# Export a single message as .eml for forensics
+msgvault export-eml 12345 -o message.eml
+```
+
+## Thread Analysis
+
+### Find conversation threads
+```bash
+# All emails in a thread with a specific person
+msgvault search "from:alice@example.com" --limit 100 --json | \
+  jq -r '.[].subject' | sort -u
+
+# Cross-reference: who else is on threads with a sender
+msgvault search "from:alice@example.com" --limit 50 --json | \
+  jq -r '.[].to, .[].cc // empty' | tr ',' '\n' | sort -u
+```
+
+## Pagination for Large Queries
+
+```bash
+# Paginate through all results (50 at a time)
+offset=0
+while true; do
+  results=$(msgvault search "from:@gmail.com" --limit 50 --offset $offset --json)
+  count=$(echo "$results" | jq 'length')
+  [ "$count" -eq 0 ] && break
+  echo "$results" >> all_gmail_results.json
+  offset=$((offset + 50))
+done
+
+# Simpler: fixed page count
+for offset in $(seq 0 50 500); do
+  msgvault search "from:@gmail.com" --limit 50 --offset $offset --json
+done
+```
+
+## Reporting
+
+### Archive overview
+```bash
+# Full stats
+msgvault stats
+
+# Top 20 senders
+msgvault list-senders -n 20
+
+# Top 20 domains
+msgvault list-domains -n 20
+
+# All labels
+msgvault list-labels
+```
+
+### Export to CSV for spreadsheet review
+```bash
+# Senders CSV
+msgvault list-senders -n 5000 --json | \
+  jq -r '["sender","count","size_kb","attachment_kb"], (.[] | [.key, .count, (.total_size/1024|floor), (.attachment_size/1024|floor)]) | @csv' \
+  > senders-report.csv
+
+# Domains CSV
+msgvault list-domains -n 1000 --json | \
+  jq -r '["domain","count","size_kb"], (.[] | [.key, .count, (.total_size/1024|floor)]) | @csv' \
+  > domains-report.csv
+```
diff --git a/skills/claude-code/scripts/query.sh b/skills/claude-code/scripts/query.sh
new file mode 100644
index 00000000..cdb82b2c
--- /dev/null
+++ b/skills/claude-code/scripts/query.sh
@@ -0,0 +1,197 @@
+#!/usr/bin/env bash
+# msgvault DuckDB query helper
+# Wraps common analytical queries against the Parquet cache
+# Usage: query.sh <command> [args]
+#
+# Requires: duckdb on PATH
+# Respects: MSGVAULT_HOME env var (default: ~/.msgvault)
+
+set -euo pipefail
+
+DATA="${MSGVAULT_HOME:-$HOME/.msgvault}/analytics"
+
+# Verify analytics cache exists
+if [ ! -d "$DATA/messages" ]; then
+  echo "Error: Analytics cache not found at $DATA" >&2
+  echo "Run 'msgvault build-cache' first." >&2
+  exit 1
+fi
+
+MSG="read_parquet('$DATA/messages/*/data_0.parquet', hive_partitioning=true)"
+RECIP="read_parquet('$DATA/message_recipients/data.parquet')"
+PARTS="read_parquet('$DATA/participants/participants.parquet')"
+LABELS="read_parquet('$DATA/labels/labels.parquet')"
+MLABELS="read_parquet('$DATA/message_labels/data.parquet')"
+ATTACH="read_parquet('$DATA/attachments/data.parquet')"
+
+cmd="${1:-help}"
+shift || true
+
+case "$cmd" in
+  # Full sender graph: query.sh senders [limit] [--after YYYY-MM-DD] [--before YYYY-MM-DD]
+  senders)
+    limit="${1:-100}"
+    where=""
+    shift || true
+    while [[ $# -gt 0 ]]; do
+      case "$1" in
+        --after)  where="$where AND m.sent_at >= '$2'"; shift 2 ;;
+        --before) where="$where AND m.sent_at < '$2'"; shift 2 ;;
+        *) shift ;;
+      esac
+    done
+    duckdb -c "
+    SELECT p.email_address, p.domain, p.display_name, COUNT(*) as emails,
+           MIN(m.sent_at) as first_seen, MAX(m.sent_at) as last_seen
+    FROM $MSG m
+    JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from'
+    JOIN $PARTS p ON p.id = r.participant_id
+    WHERE 1=1 $where
+    GROUP BY p.email_address, p.domain, p.display_name
+    ORDER BY emails DESC LIMIT $limit;
+    "
+    ;;
+
+  # Senders from specific domains: query.sh by-domain gmail.com,hotmail.com [limit]
+  by-domain)
+    domains="$1"
+    limit="${2:-100}"
+    in_list=$(echo "$domains" | sed "s/,/','/g")
+    duckdb -c "
+    SELECT p.email_address, p.display_name, p.domain, COUNT(*) as emails,
+           MIN(m.sent_at) as first_seen, MAX(m.sent_at) as last_seen
+    FROM $MSG m
+    JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from'
+    JOIN $PARTS p ON p.id = r.participant_id
+    WHERE p.domain IN ('$in_list')
+    GROUP BY p.email_address, p.display_name, p.domain
+    ORDER BY emails DESC LIMIT $limit;
+    "
+    ;;
+
+  # Domain breakdown: query.sh domains [limit]
+  domains)
+    limit="${1:-100}"
+    duckdb -c "
+    SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as unique_senders,
+           SUM(m.size_estimate) as total_bytes
+    FROM $MSG m
+    JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from'
+    JOIN $PARTS p ON p.id = r.participant_id
+    GROUP BY p.domain ORDER BY emails DESC LIMIT $limit;
+    "
+    ;;
+
+  # Count emails per domain list: query.sh classify domain1,domain2,...
+  classify)
+    domains="$1"
+    in_list=$(echo "$domains" | sed "s/,/','/g")
+    duckdb -c "
+    SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as senders
+    FROM $MSG m
+    JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from'
+    JOIN $PARTS p ON p.id = r.participant_id
+    WHERE p.domain IN ('$in_list')
+    GROUP BY p.domain ORDER BY emails DESC;
+    "
+    ;;
+
+  # Thread co-participants: query.sh threads <email>
+  threads)
+    email="$1"
+    duckdb -c "
+    WITH target_threads AS (
+      SELECT DISTINCT m.conversation_id
+      FROM $MSG m
+      JOIN $RECIP r ON r.message_id = m.id
+      JOIN $PARTS p ON p.id = r.participant_id
+      WHERE p.email_address = '$email'
+    )
+    SELECT p.email_address, p.domain, COUNT(DISTINCT m.conversation_id) as shared_threads
+    FROM target_threads tt
+    JOIN $MSG m ON m.conversation_id = tt.conversation_id
+    JOIN $RECIP r ON r.message_id = m.id
+    JOIN $PARTS p ON p.id = r.participant_id
+    WHERE p.email_address != '$email'
+    GROUP BY p.email_address, p.domain
+    ORDER BY shared_threads DESC LIMIT 20;
+    "
+    ;;
+
+  # Label counts: query.sh labels
+  labels)
+    duckdb -c "
+    SELECT l.name, COUNT(*) as emails
+    FROM $MLABELS ml
+    JOIN $LABELS l ON l.id = ml.label_id
+    GROUP BY l.name ORDER BY emails DESC;
+    "
+    ;;
+
+  # Messages with a specific label: query.sh label-messages <label-name> [limit]
+  label-messages)
+    label="$1"
+    limit="${2:-50}"
+    duckdb -c "
+    SELECT m.id, m.subject, m.sent_at, p.email_address as sender
+    FROM $MSG m
+    JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from'
+    JOIN $PARTS p ON p.id = r.participant_id
+    JOIN $MLABELS ml ON ml.message_id = m.id
+    JOIN $LABELS l ON l.id = ml.label_id
+    WHERE l.name = '$label'
+    ORDER BY m.sent_at DESC LIMIT $limit;
+    "
+    ;;
+
+  # Unclassified domains: query.sh unclassified domain1,domain2,...
+  unclassified)
+    domains="$1"
+    in_list=$(echo "$domains" | sed "s/,/','/g")
+    duckdb -c "
+    SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as senders
+    FROM $MSG m
+    JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from'
+    JOIN $PARTS p ON p.id = r.participant_id
+    WHERE p.domain NOT IN ('$in_list')
+    GROUP BY p.domain ORDER BY emails DESC LIMIT 50;
+    "
+    ;;
+
+  # Raw SQL: query.sh sql "SELECT ..."
+  sql)
+    duckdb -c "$1"
+    ;;
+
+  help|*)
+    cat <<'EOF'
+msgvault DuckDB query helper
+
+Queries the Parquet analytics cache directly for operations the CLI
+search can't handle (boolean logic, multi-domain, aggregations, JOINs).
+
+Requires: duckdb on PATH, analytics cache built (msgvault build-cache)
+Respects: MSGVAULT_HOME env var (default: ~/.msgvault)
+
+Commands:
+  senders [limit] [--after DATE] [--before DATE]   Full sender graph
+  by-domain <domains> [limit]                       Senders from comma-separated domains
+  domains [limit]                                   Domain breakdown with sender counts
+  classify <domains>                                Count emails per domain (classification)
+  threads <email>                                   Co-participants in threads with sender
+  labels                                            All labels with counts
+  label-messages <label> [limit]                    Messages with a specific label
+  unclassified <known-domains>                      Domains NOT in the provided list
+  sql "<query>"                                     Raw DuckDB SQL
+
+Examples:
+  query.sh senders 50 --after 2020-01-01
+  query.sh by-domain gmail.com,hotmail.com
+  query.sh classify example.com,supplier.co
+  query.sh threads alice@example.com
+  query.sh labels
+  query.sh label-messages Personal 20
+  query.sh unclassified mycompany.com,asana.com,gmail.com
+EOF
+    ;;
+esac

From d699300db719142f88722caf3170a311588d3128 Mon Sep 17 00:00:00 2001
From: 0xDarkMatter <0xDarkMatter@users.noreply.github.com>
Date: Tue, 31 Mar 2026 12:02:21 +1100
Subject: [PATCH 2/7] =?UTF-8?q?fix:=20address=20roborev=20review=20?=
 =?UTF-8?q?=E2=80=94=20input=20validation,=20empty=20results,=20thread=20q?=
 =?UTF-8?q?uery?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Add input validation to all query.sh subcommands (integers, dates,
  domains, emails, labels) to prevent SQL injection via crafted arguments
- Fix senders subcommand to accept flags before or after optional limit
- Fix thread analysis workflow to use query.sh instead of search --json
  (search does not return to/cc fields)
- Guard all search-to-jq pipelines against non-JSON empty results
- Add note about sql subcommand passing input unvalidated
---
 skills/claude-code/references/workflows.md |  24 +++--
 skills/claude-code/scripts/query.sh        | 105 ++++++++++++++++++---
 2 files changed, 104 insertions(+), 25 deletions(-)

diff --git a/skills/claude-code/references/workflows.md b/skills/claude-code/references/workflows.md
index a191619e..b9810e8e 100644
--- a/skills/claude-code/references/workflows.md
+++ b/skills/claude-code/references/workflows.md
@@ -53,13 +53,13 @@ msgvault list-domains -n 1000 --json | \
 ```bash
 # Check which domains from a list appear in the archive
 for domain in example.com supplier.co partner.org; do
-  count=$(msgvault search "from:@$domain" --limit 1 --json | jq 'length')
+  count=$(msgvault search "from:@$domain" --limit 1 --json 2>/dev/null | jq 'length' 2>/dev/null || echo 0)
   echo "$count\t$domain"
 done
 
 # Count emails per sensitive domain
 for domain in $(cat sensitive-domains.txt); do
-  count=$(msgvault search "from:@$domain" --limit 1 --json 2>/dev/null | jq 'length // 0')
+  count=$(msgvault search "from:@$domain" --limit 1 --json 2>/dev/null | jq 'length' 2>/dev/null || echo 0)
   [ "$count" -gt 0 ] && echo "$count\t$domain"
 done
 ```
@@ -72,7 +72,7 @@ msgvault search "to:person@gmail.com" --limit 500 --json
 
 # Batch check known personal senders
 while IFS= read -r sender; do
-  count=$(msgvault search "from:$sender" --limit 1 --json | jq 'length')
+  count=$(msgvault search "from:$sender" --limit 1 --json 2>/dev/null | jq 'length' 2>/dev/null || echo 0)
   [ "$count" -gt 0 ] && echo "$count\t$sender"
 done < known-personal-senders.txt
 ```
@@ -120,22 +120,26 @@ msgvault export-eml 12345 -o message.eml
 ### Find conversation threads
 ```bash
 # All emails in a thread with a specific person
-msgvault search "from:alice@example.com" --limit 100 --json | \
-  jq -r '.[].subject' | sort -u
+msgvault search "from:alice@example.com" --limit 100 --json 2>/dev/null \
+  | jq -r '.[].subject' | sort -u
 
 # Cross-reference: who else is on threads with a sender
-msgvault search "from:alice@example.com" --limit 50 --json | \
-  jq -r '.[].to, .[].cc // empty' | tr ',' '\n' | sort -u
+# Note: search --json does NOT include to/cc fields. Use query.sh instead:
+bash scripts/query.sh threads alice@example.com
+
+# Or drill into a specific message for full recipients:
+msgvault show-message <id> --json | jq '.to[].email, .cc[].email'
 ```
 
 ## Pagination for Large Queries
 
 ```bash
 # Paginate through all results (50 at a time)
+# Note: empty results return non-JSON error text, so guard with 2>/dev/null
 offset=0
 while true; do
-  results=$(msgvault search "from:@gmail.com" --limit 50 --offset $offset --json)
-  count=$(echo "$results" | jq 'length')
+  results=$(msgvault search "from:@gmail.com" --limit 50 --offset $offset --json 2>/dev/null) || break
+  count=$(echo "$results" | jq 'length' 2>/dev/null) || break
   [ "$count" -eq 0 ] && break
   echo "$results" >> all_gmail_results.json
   offset=$((offset + 50))
@@ -143,7 +147,7 @@ done
 
 # Simpler: fixed page count
 for offset in $(seq 0 50 500); do
-  msgvault search "from:@gmail.com" --limit 50 --offset $offset --json
+  msgvault search "from:@gmail.com" --limit 50 --offset $offset --json 2>/dev/null || true
 done
 ```
 
diff --git a/skills/claude-code/scripts/query.sh b/skills/claude-code/scripts/query.sh
index cdb82b2c..45a637f3 100644
--- a/skills/claude-code/scripts/query.sh
+++ b/skills/claude-code/scripts/query.sh
@@ -24,22 +24,91 @@ LABELS="read_parquet('$DATA/labels/labels.parquet')"
 MLABELS="read_parquet('$DATA/message_labels/data.parquet')"
 ATTACH="read_parquet('$DATA/attachments/data.parquet')"
 
+# --- Input validation helpers ---
+
+# Validate integer (limit, offset)
+validate_int() {
+  local val="$1" name="$2"
+  if ! [[ "$val" =~ ^[0-9]+$ ]]; then
+    echo "Error: $name must be a positive integer, got '$val'" >&2
+    exit 1
+  fi
+}
+
+# Validate date (YYYY-MM-DD)
+validate_date() {
+  local val="$1" name="$2"
+  if ! [[ "$val" =~ ^[0-9]{4}-[0-9]{2}-[0-9]{2}$ ]]; then
+    echo "Error: $name must be YYYY-MM-DD, got '$val'" >&2
+    exit 1
+  fi
+}
+
+# Validate domain (alphanumeric, dots, hyphens only)
+validate_domain() {
+  local val="$1"
+  if ! [[ "$val" =~ ^[a-zA-Z0-9._-]+$ ]]; then
+    echo "Error: invalid domain '$val'" >&2
+    exit 1
+  fi
+}
+
+# Validate email address (basic check)
+validate_email() {
+  local val="$1"
+  if ! [[ "$val" =~ ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9._-]+$ ]]; then
+    echo "Error: invalid email '$val'" >&2
+    exit 1
+  fi
+}
+
+# Validate label name (alphanumeric, spaces, slashes, underscores, hyphens)
+validate_label() {
+  local val="$1"
+  if ! [[ "$val" =~ ^[a-zA-Z0-9\ /_&-]+$ ]]; then
+    echo "Error: invalid label name '$val'" >&2
+    exit 1
+  fi
+}
+
+# Build a validated SQL IN list from comma-separated domains
+build_domain_in_list() {
+  local input="$1"
+  local result=""
+  IFS=',' read -ra domains <<< "$input"
+  for d in "${domains[@]}"; do
+    validate_domain "$d"
+    if [ -n "$result" ]; then
+      result="$result,'$d'"
+    else
+      result="'$d'"
+    fi
+  done
+  echo "$result"
+}
+
+# --- Command parsing ---
+
 cmd="${1:-help}"
 shift || true
 
 case "$cmd" in
-  # Full sender graph: query.sh senders [limit] [--after YYYY-MM-DD] [--before YYYY-MM-DD]
+  # Full sender graph: query.sh senders [--after DATE] [--before DATE] [limit]
   senders)
-    limit="${1:-100}"
+    limit=100
     where=""
-    shift || true
     while [[ $# -gt 0 ]]; do
       case "$1" in
-        --after)  where="$where AND m.sent_at >= '$2'"; shift 2 ;;
-        --before) where="$where AND m.sent_at < '$2'"; shift 2 ;;
-        *) shift ;;
+        --after)  validate_date "$2" "--after"; where="$where AND m.sent_at >= '$2'"; shift 2 ;;
+        --before) validate_date "$2" "--before"; where="$where AND m.sent_at < '$2'"; shift 2 ;;
+        *)
+          if [[ "$1" =~ ^[0-9]+$ ]]; then
+            limit="$1"
+          fi
+          shift ;;
       esac
     done
+    validate_int "$limit" "limit"
     duckdb -c "
     SELECT p.email_address, p.domain, p.display_name, COUNT(*) as emails,
            MIN(m.sent_at) as first_seen, MAX(m.sent_at) as last_seen
@@ -54,16 +123,16 @@ case "$cmd" in
 
   # Senders from specific domains: query.sh by-domain gmail.com,hotmail.com [limit]
   by-domain)
-    domains="$1"
+    in_list=$(build_domain_in_list "$1")
     limit="${2:-100}"
-    in_list=$(echo "$domains" | sed "s/,/','/g")
+    validate_int "$limit" "limit"
     duckdb -c "
     SELECT p.email_address, p.display_name, p.domain, COUNT(*) as emails,
            MIN(m.sent_at) as first_seen, MAX(m.sent_at) as last_seen
     FROM $MSG m
     JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from'
     JOIN $PARTS p ON p.id = r.participant_id
-    WHERE p.domain IN ('$in_list')
+    WHERE p.domain IN ($in_list)
     GROUP BY p.email_address, p.display_name, p.domain
     ORDER BY emails DESC LIMIT $limit;
     "
@@ -72,6 +141,7 @@ case "$cmd" in
   # Domain breakdown: query.sh domains [limit]
   domains)
     limit="${1:-100}"
+    validate_int "$limit" "limit"
     duckdb -c "
     SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as unique_senders,
            SUM(m.size_estimate) as total_bytes
@@ -84,14 +154,13 @@ case "$cmd" in
 
   # Count emails per domain list: query.sh classify domain1,domain2,...
   classify)
-    domains="$1"
-    in_list=$(echo "$domains" | sed "s/,/','/g")
+    in_list=$(build_domain_in_list "$1")
     duckdb -c "
     SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as senders
     FROM $MSG m
     JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from'
     JOIN $PARTS p ON p.id = r.participant_id
-    WHERE p.domain IN ('$in_list')
+    WHERE p.domain IN ($in_list)
     GROUP BY p.domain ORDER BY emails DESC;
     "
     ;;
@@ -99,6 +168,7 @@ case "$cmd" in
   # Thread co-participants: query.sh threads <email>
   threads)
     email="$1"
+    validate_email "$email"
     duckdb -c "
     WITH target_threads AS (
       SELECT DISTINCT m.conversation_id
@@ -131,7 +201,9 @@ case "$cmd" in
   # Messages with a specific label: query.sh label-messages <label-name> [limit]
   label-messages)
     label="$1"
+    validate_label "$label"
     limit="${2:-50}"
+    validate_int "$limit" "limit"
     duckdb -c "
     SELECT m.id, m.subject, m.sent_at, p.email_address as sender
     FROM $MSG m
@@ -146,14 +218,13 @@ case "$cmd" in
 
   # Unclassified domains: query.sh unclassified domain1,domain2,...
   unclassified)
-    domains="$1"
-    in_list=$(echo "$domains" | sed "s/,/','/g")
+    in_list=$(build_domain_in_list "$1")
     duckdb -c "
     SELECT p.domain, COUNT(*) as emails, COUNT(DISTINCT p.email_address) as senders
     FROM $MSG m
     JOIN $RECIP r ON r.message_id = m.id AND r.recipient_type = 'from'
     JOIN $PARTS p ON p.id = r.participant_id
-    WHERE p.domain NOT IN ('$in_list')
+    WHERE p.domain NOT IN ($in_list)
     GROUP BY p.domain ORDER BY emails DESC LIMIT 50;
     "
     ;;
@@ -186,12 +257,16 @@ Commands:
 
 Examples:
   query.sh senders 50 --after 2020-01-01
+  query.sh senders --after 2020-01-01 50
   query.sh by-domain gmail.com,hotmail.com
   query.sh classify example.com,supplier.co
   query.sh threads alice@example.com
   query.sh labels
   query.sh label-messages Personal 20
   query.sh unclassified mycompany.com,asana.com,gmail.com
+
+Note: the sql subcommand passes input directly to DuckDB with no
+validation. All other subcommands validate inputs to prevent injection.
 EOF
     ;;
 esac

From 714bc8d19b31697b4c2c1c1dcc410e86421fa4d2 Mon Sep 17 00:00:00 2001
From: 0xDarkMatter <0xDarkMatter@users.noreply.github.com>
Date: Tue, 31 Mar 2026 12:09:55 +1100
Subject: [PATCH 3/7] fix: label validator regex syntax error in bash

The & character in the bash regex character class caused a parse error.
Switched to denylist approach (reject single quotes, semicolons, backslashes)
which is more robust for label names containing special characters like &.
---
 skills/claude-code/scripts/query.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/skills/claude-code/scripts/query.sh b/skills/claude-code/scripts/query.sh
index 45a637f3..198c8b8f 100644
--- a/skills/claude-code/scripts/query.sh
+++ b/skills/claude-code/scripts/query.sh
@@ -65,7 +65,7 @@ validate_email() {
 # Validate label name (alphanumeric, spaces, slashes, underscores, hyphens)
 validate_label() {
   local val="$1"
-  if ! [[ "$val" =~ ^[a-zA-Z0-9\ /_&-]+$ ]]; then
+  if [[ "$val" == *"'"* ]] || [[ "$val" == *";"* ]] || [[ "$val" == *"\\"* ]]; then
     echo "Error: invalid label name '$val'" >&2
     exit 1
   fi

From f916909c011141fa292da217dfb19a88d117d15b Mon Sep 17 00:00:00 2001
From: 0xDarkMatter <0xDarkMatter@users.noreply.github.com>
Date: Tue, 31 Mar 2026 16:15:28 +1100
Subject: [PATCH 4/7] fix: harden query.sh security and improve completeness

Security:
- Add duckdb binary existence check before running queries
- Tighten domain validation: reject underscores, require start/end
  with alphanumeric (closes injection via underscore identifiers)
- Add write-operation guard to sql subcommand: blocks DROP, DELETE,
  INSERT, UPDATE, CREATE, ALTER, COPY TO
- Add security note to SKILL.md about sql subcommand risks

Correctness:
- Replace shift || true with explicit guard (prevents masked errors)
- Add bounds check to validate_int (1-100000)

Completeness:
- Add build-cache and sync-full to SKILL.md Quick Reference
- Add MSGVAULT_HOME path note to duckdb-queries.md
- Document analytics cache prerequisite in DuckDB section
---
 skills/claude-code/SKILL.md                   |  6 +++++
 .../claude-code/references/duckdb-queries.md  |  2 ++
 skills/claude-code/scripts/query.sh           | 23 +++++++++++++++----
 3 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/skills/claude-code/SKILL.md b/skills/claude-code/SKILL.md
index 7e9477d2..c5a39d0d 100644
--- a/skills/claude-code/SKILL.md
+++ b/skills/claude-code/SKILL.md
@@ -30,6 +30,8 @@ Ensure `msgvault` is on PATH or use the full binary path.
 | Export .eml | `msgvault export-eml <id> -o file.eml` |
 | Export attachments | `msgvault export-attachments <id> -o ./dir/` |
 | Incremental sync | `msgvault sync` |
+| Full sync | `msgvault sync-full <email>` (resumable) |
+| Build analytics cache | `msgvault build-cache` (required for DuckDB) |
 | TUI | `msgvault tui` (interactive, not for agents) |
 
 **Always use `--json` for programmatic access.** Parse with `jq`.
@@ -269,6 +271,10 @@ GROUP BY p.domain ORDER BY emails DESC;
 
 **Use CLI `search` when:** simple single-field lookup, quick message retrieval by ID, full-text search on body content.
 
+**Prerequisite:** DuckDB queries require the analytics cache. Run `msgvault build-cache` if the `analytics/` directory is missing or stale.
+
+**Security:** The `sql` subcommand blocks write operations but can still read local files. Never pass unsanitised user input to any subcommand. Prefer validated subcommands (senders, by-domain, etc.) over raw SQL.
+
 ## Safety Rules
 
 1. **Never delete without dry-run first** — `delete-staged --dry-run` before `--yes`
diff --git a/skills/claude-code/references/duckdb-queries.md b/skills/claude-code/references/duckdb-queries.md
index ee0d6f2f..a4a66eca 100644
--- a/skills/claude-code/references/duckdb-queries.md
+++ b/skills/claude-code/references/duckdb-queries.md
@@ -4,6 +4,8 @@ The CLI `search` command is limited to single-operator queries. For anything com
 
 **DuckDB CLI must be installed** (`which duckdb` to verify).
 
+**Path note:** All examples below use `~/.msgvault/analytics/`. If `MSGVAULT_HOME` is set, substitute that path (e.g. `$MSGVAULT_HOME/analytics/`). The `query.sh` helper script handles this automatically.
+
 ## Data Layout
 
 ```
diff --git a/skills/claude-code/scripts/query.sh b/skills/claude-code/scripts/query.sh
index 198c8b8f..4ea6cf0f 100644
--- a/skills/claude-code/scripts/query.sh
+++ b/skills/claude-code/scripts/query.sh
@@ -8,6 +8,13 @@
 
 set -euo pipefail
 
+# Verify duckdb is available
+command -v duckdb >/dev/null 2>&1 || {
+  echo "Error: duckdb not found on PATH" >&2
+  echo "Install from https://duckdb.org/docs/installation" >&2
+  exit 1
+}
+
 DATA="${MSGVAULT_HOME:-$HOME/.msgvault}/analytics"
 
 # Verify analytics cache exists
@@ -29,8 +36,8 @@ ATTACH="read_parquet('$DATA/attachments/data.parquet')"
 # Validate integer (limit, offset)
 validate_int() {
   local val="$1" name="$2"
-  if ! [[ "$val" =~ ^[0-9]+$ ]]; then
-    echo "Error: $name must be a positive integer, got '$val'" >&2
+  if ! [[ "$val" =~ ^[0-9]+$ ]] || [ "$val" -eq 0 ] || [ "$val" -gt 100000 ]; then
+    echo "Error: $name must be a positive integer (1-100000), got '$val'" >&2
     exit 1
   fi
 }
@@ -44,10 +51,10 @@ validate_date() {
   fi
 }
 
-# Validate domain (alphanumeric, dots, hyphens only)
+# Validate domain (letters, digits, dots, hyphens — no underscores or specials)
 validate_domain() {
   local val="$1"
-  if ! [[ "$val" =~ ^[a-zA-Z0-9._-]+$ ]]; then
+  if ! [[ "$val" =~ ^[a-zA-Z0-9]([a-zA-Z0-9.-]*[a-zA-Z0-9])?$ ]]; then
     echo "Error: invalid domain '$val'" >&2
     exit 1
   fi
@@ -90,7 +97,7 @@ build_domain_in_list() {
 # --- Command parsing ---
 
 cmd="${1:-help}"
-shift || true
+if [ $# -gt 0 ]; then shift; fi
 
 case "$cmd" in
   # Full sender graph: query.sh senders [--after DATE] [--before DATE] [limit]
@@ -230,7 +237,13 @@ case "$cmd" in
     ;;
 
   # Raw SQL: query.sh sql "SELECT ..."
+  # WARNING: No input validation. Only use with agent-constructed queries,
+  # never with raw user input. DuckDB can read/write local files.
   sql)
+    if [[ "$1" =~ (DROP|DELETE|INSERT|UPDATE|CREATE|ALTER|COPY.*TO) ]]; then
+      echo "Error: write operations are not allowed. This helper is read-only." >&2
+      exit 1
+    fi
     duckdb -c "$1"
     ;;
 

From 04b30ac23335de359f67fd6222bf40e42c8b12fc Mon Sep 17 00:00:00 2001
From: 0xDarkMatter <0xDarkMatter@users.noreply.github.com>
Date: Tue, 31 Mar 2026 17:27:48 +1100
Subject: [PATCH 5/7] fix: replace sql blacklist with case-insensitive
 allowlist

The write-operation guard used a case-sensitive blacklist that could be
bypassed with lowercase or mixed-case statements. Replace with a strict
allowlist that normalizes input to uppercase and only permits SELECT,
WITH, EXPLAIN, DESCRIBE, SHOW, and PRAGMA statements.
---
 skills/claude-code/scripts/query.sh | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/skills/claude-code/scripts/query.sh b/skills/claude-code/scripts/query.sh
index 4ea6cf0f..100b09c3 100644
--- a/skills/claude-code/scripts/query.sh
+++ b/skills/claude-code/scripts/query.sh
@@ -237,14 +237,16 @@ case "$cmd" in
     ;;
 
   # Raw SQL: query.sh sql "SELECT ..."
-  # WARNING: No input validation. Only use with agent-constructed queries,
-  # never with raw user input. DuckDB can read/write local files.
+  # Allowlist: only SELECT, WITH, EXPLAIN, DESCRIBE, SHOW, PRAGMA
   sql)
-    if [[ "$1" =~ (DROP|DELETE|INSERT|UPDATE|CREATE|ALTER|COPY.*TO) ]]; then
-      echo "Error: write operations are not allowed. This helper is read-only." >&2
+    normalized=$(echo "$1" | sed 's/^[[:space:]]*//' | tr '[:lower:]' '[:upper:]')
+    if [[ "$normalized" =~ ^(SELECT|WITH|EXPLAIN|DESCRIBE|SHOW|PRAGMA) ]]; then
+      duckdb -c "$1"
+    else
+      echo "Error: only read-only statements allowed (SELECT, WITH, EXPLAIN, DESCRIBE, SHOW)." >&2
+      echo "Got: $(echo "$normalized" | head -c 40)" >&2
       exit 1
     fi
-    duckdb -c "$1"
     ;;
 
   help|*)

From e4c2afb0d23f083443331ed7592f2f92a031d89b Mon Sep 17 00:00:00 2001
From: 0xDarkMatter <0xDarkMatter@users.noreply.github.com>
Date: Tue, 31 Mar 2026 17:44:25 +1100
Subject: [PATCH 6/7] fix: block multi-statement sql bypass, clarify threads
 scope
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Reject semicolons in sql subcommand input to prevent multi-statement
  bypass (e.g. "SELECT 1; DROP TABLE messages")
- Remove PRAGMA from allowlist (can modify DuckDB state)
- Clarify threads subcommand matches any participant role (from/to/cc/bcc)
  not just senders — this is intentional for "who else is on threads
  involving this person" use case. Updated help text to document this.
---
 skills/claude-code/scripts/query.sh | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/skills/claude-code/scripts/query.sh b/skills/claude-code/scripts/query.sh
index 100b09c3..06be533e 100644
--- a/skills/claude-code/scripts/query.sh
+++ b/skills/claude-code/scripts/query.sh
@@ -173,6 +173,9 @@ case "$cmd" in
     ;;
 
   # Thread co-participants: query.sh threads <email>
+  # Finds threads where <email> appears in ANY role (from/to/cc/bcc),
+  # then lists other participants. This is intentional — it answers
+  # "who else is on threads involving this person" not just threads they sent.
   threads)
     email="$1"
     validate_email "$email"
@@ -237,10 +240,15 @@ case "$cmd" in
     ;;
 
   # Raw SQL: query.sh sql "SELECT ..."
-  # Allowlist: only SELECT, WITH, EXPLAIN, DESCRIBE, SHOW, PRAGMA
+  # Allowlist: single read-only statement only (SELECT, WITH, EXPLAIN, DESCRIBE, SHOW)
   sql)
+    # Reject multi-statement input (semicolons allow bypass)
+    if [[ "$1" == *";"* ]]; then
+      echo "Error: multi-statement input not allowed (no semicolons)." >&2
+      exit 1
+    fi
     normalized=$(echo "$1" | sed 's/^[[:space:]]*//' | tr '[:lower:]' '[:upper:]')
-    if [[ "$normalized" =~ ^(SELECT|WITH|EXPLAIN|DESCRIBE|SHOW|PRAGMA) ]]; then
+    if [[ "$normalized" =~ ^(SELECT|WITH|EXPLAIN|DESCRIBE|SHOW) ]]; then
       duckdb -c "$1"
     else
       echo "Error: only read-only statements allowed (SELECT, WITH, EXPLAIN, DESCRIBE, SHOW)." >&2
@@ -264,7 +272,7 @@ Commands:
   by-domain <domains> [limit]                       Senders from comma-separated domains
   domains [limit]                                   Domain breakdown with sender counts
   classify <domains>                                Count emails per domain (classification)
-  threads <email>                                   Co-participants in threads with sender
+  threads <email>                                   Co-participants in threads involving person
   labels                                            All labels with counts
   label-messages <label> [limit]                    Messages with a specific label
   unclassified <known-domains>                      Domains NOT in the provided list

From 0c960a13851cc1680bd835a75530721bab56b7ab Mon Sep 17 00:00:00 2001
From: 0xDarkMatter <0xDarkMatter@users.noreply.github.com>
Date: Tue, 31 Mar 2026 18:36:33 +1100
Subject: [PATCH 7/7] fix: remove EXPLAIN from sql allowlist to prevent ANALYZE
 bypass
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

EXPLAIN ANALYZE executes the underlying statement, so allowing EXPLAIN
breaks the read-only guarantee. Remove EXPLAIN from the allowlist
entirely — agents rarely need it and can use DESCRIBE/SHOW instead.

Allowlist is now: SELECT, WITH, DESCRIBE, SHOW.
---
 skills/claude-code/scripts/query.sh | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/skills/claude-code/scripts/query.sh b/skills/claude-code/scripts/query.sh
index 06be533e..e49f741e 100644
--- a/skills/claude-code/scripts/query.sh
+++ b/skills/claude-code/scripts/query.sh
@@ -240,7 +240,7 @@ case "$cmd" in
     ;;
 
   # Raw SQL: query.sh sql "SELECT ..."
-  # Allowlist: single read-only statement only (SELECT, WITH, EXPLAIN, DESCRIBE, SHOW)
+  # Allowlist: single read-only statement (SELECT, WITH, DESCRIBE, SHOW)
   sql)
     # Reject multi-statement input (semicolons allow bypass)
     if [[ "$1" == *";"* ]]; then
@@ -248,10 +248,10 @@ case "$cmd" in
       exit 1
     fi
     normalized=$(echo "$1" | sed 's/^[[:space:]]*//' | tr '[:lower:]' '[:upper:]')
-    if [[ "$normalized" =~ ^(SELECT|WITH|EXPLAIN|DESCRIBE|SHOW) ]]; then
+    if [[ "$normalized" =~ ^(SELECT|WITH|DESCRIBE|SHOW) ]]; then
       duckdb -c "$1"
     else
-      echo "Error: only read-only statements allowed (SELECT, WITH, EXPLAIN, DESCRIBE, SHOW)." >&2
+      echo "Error: only read-only statements allowed (SELECT, WITH, DESCRIBE, SHOW)." >&2
       echo "Got: $(echo "$normalized" | head -c 40)" >&2
       exit 1
     fi