Add Claude Code skill with DuckDB query layer#236
Add Claude Code skill with DuckDB query layer#2360xDarkMatter wants to merge 7 commits intowesm:mainfrom
Conversation
Adds a Claude Code skill for msgvault that covers the full CLI surface and includes direct DuckDB queries against the Parquet analytics cache for operations the CLI search can't handle (boolean logic, multi-domain, aggregations, thread analysis). Includes: - SKILL.md with verified JSON output shapes, search strategy, safety rules - scripts/query.sh helper wrapping common DuckDB patterns (9 subcommands) - references/duckdb-queries.md with full Parquet schema and query patterns - references/workflows.md with multi-step analysis patterns Tested against a ~755k message archive. All documented commands, jq patterns, and DuckDB queries verified against live data. Ref: wesm#230
roborev: Combined Review (
|
… query - Add input validation to all query.sh subcommands (integers, dates, domains, emails, labels) to prevent SQL injection via crafted arguments - Fix senders subcommand to accept flags before or after optional limit - Fix thread analysis workflow to use query.sh instead of search --json (search does not return to/cc fields) - Guard all search-to-jq pipelines against non-JSON empty results - Add note about sql subcommand passing input unvalidated
The & character in the bash regex character class caused a parse error. Switched to denylist approach (reject single quotes, semicolons, backslashes) which is more robust for label names containing special characters like &.
roborev: Combined Review (
|
Security: - Add duckdb binary existence check before running queries - Tighten domain validation: reject underscores, require start/end with alphanumeric (closes injection via underscore identifiers) - Add write-operation guard to sql subcommand: blocks DROP, DELETE, INSERT, UPDATE, CREATE, ALTER, COPY TO - Add security note to SKILL.md about sql subcommand risks Correctness: - Replace shift || true with explicit guard (prevents masked errors) - Add bounds check to validate_int (1-100000) Completeness: - Add build-cache and sync-full to SKILL.md Quick Reference - Add MSGVAULT_HOME path note to duckdb-queries.md - Document analytics cache prerequisite in DuckDB section
roborev: Combined Review (
|
The write-operation guard used a case-sensitive blacklist that could be bypassed with lowercase or mixed-case statements. Replace with a strict allowlist that normalizes input to uppercase and only permits SELECT, WITH, EXPLAIN, DESCRIBE, SHOW, and PRAGMA statements.
roborev: Combined Review (
|
- Reject semicolons in sql subcommand input to prevent multi-statement bypass (e.g. "SELECT 1; DROP TABLE messages") - Remove PRAGMA from allowlist (can modify DuckDB state) - Clarify threads subcommand matches any participant role (from/to/cc/bcc) not just senders — this is intentional for "who else is on threads involving this person" use case. Updated help text to document this.
roborev: Combined Review (
|
EXPLAIN ANALYZE executes the underlying statement, so allowing EXPLAIN breaks the read-only guarantee. Remove EXPLAIN from the allowlist entirely — agents rarely need it and can use DESCRIBE/SHOW instead. Allowlist is now: SELECT, WITH, DESCRIBE, SHOW.
roborev: Combined Review (
|
Hey Wes — thanks for the nudge on #230, here's a PR!
This adds a Claude Code skill under
skills/claude-code/that covers the full CLI and adds a DuckDB query layer for the stuff the CLI search can't do yet (multi-domain, boolean logic, aggregations, thread co-participant analysis).What's included
SKILL.md— core skill with verified JSON output shapes, search strategy, and safety rulesscripts/query.sh— helper that wraps common DuckDB queries so agents don't need raw SQL for everyday operations:references/duckdb-queries.md— full Parquet schema + query patterns for when the helper doesn't cover itreferences/workflows.md— multi-step patterns for sender graphs, classification pipelines, batch exportHow it was tested
All documented commands, jq patterns, and DuckDB queries were verified against a ~755k message archive. The JSON output shapes were checked field-by-field against live data (caught a few surprises —
to/cc/bccare arrays of objects,searchusesfrom_emailnotfrom, etc).The
query.shscript respectsMSGVAULT_HOMEand checks for the analytics cache before running.Notes
Closes #230