Add Claude Code skill with DuckDB query layer by 0xDarkMatter · Pull Request #236 · wesm/msgvault

0xDarkMatter · 2026-03-31T00:47:24Z

Hey Wes — thanks for the nudge on #230, here's a PR!

This adds a Claude Code skill under skills/claude-code/ that covers the full CLI and adds a DuckDB query layer for the stuff the CLI search can't do yet (multi-domain, boolean logic, aggregations, thread co-participant analysis).

What's included

SKILL.md — core skill with verified JSON output shapes, search strategy, and safety rules

scripts/query.sh — helper that wraps common DuckDB queries so agents don't need raw SQL for everyday operations:

query.sh senders 50 --after 2020-01-01
query.sh by-domain gmail.com,hotmail.com
query.sh classify example.com,supplier.co
query.sh threads alice@example.com
query.sh labels

references/duckdb-queries.md — full Parquet schema + query patterns for when the helper doesn't cover it
references/workflows.md — multi-step patterns for sender graphs, classification pipelines, batch export

How it was tested

All documented commands, jq patterns, and DuckDB queries were verified against a ~755k message archive. The JSON output shapes were checked field-by-field against live data (caught a few surprises — to/cc/bcc are arrays of objects, search uses from_email not from, etc).

The query.sh script respects MSGVAULT_HOME and checks for the analytics cache before running.

Notes

Sits alongside the MCP server as an alternative for Claude Code users (per the discussion in Consider a Claude Code skill as alternative to MCP server #230)
The DuckDB layer is the main differentiator — it bypasses the CLI search limitations entirely
Happy to adjust the directory structure or naming if you have a preferred convention

Closes #230

Adds a Claude Code skill for msgvault that covers the full CLI surface and includes direct DuckDB queries against the Parquet analytics cache for operations the CLI search can't handle (boolean logic, multi-domain, aggregations, thread analysis). Includes: - SKILL.md with verified JSON output shapes, search strategy, safety rules - scripts/query.sh helper wrapping common DuckDB patterns (9 subcommands) - references/duckdb-queries.md with full Parquet schema and query patterns - references/workflows.md with multi-step analysis patterns Tested against a ~755k message archive. All documented commands, jq patterns, and DuckDB queries verified against live data. Ref: wesm#230

roborev-ci · 2026-03-31T00:51:55Z

roborev: Combined Review (`1a6460c`)

Verdict: Changes are not ready to merge; there is 1 high-severity issue and 3 medium-severity issues to address.

High

SQL injection across query.sh commands
Location: query.sh:33-39, query.sh:49-51, query.sh:57-67, query.sh:73-80, query.sh:85-93, query.sh:98-112, query.sh:126-148
Multiple subcommands build DuckDB SQL by directly interpolating untrusted arguments, including limits, dates, domain lists, email addresses, and label names. This is a security issue, not just a correctness bug: crafted input can alter the query, and DuckDB can access local files. Use strict validation per argument type and avoid string concatenation for SQL construction; prefer parameterized execution or safe intermediate tables/files.

Medium

senders misparses optional arguments when flags come first
Location: query.sh:33
The senders subcommand always treats the first argument as a numeric limit, so a documented call like query.sh senders --after 2020-01-01 turns into LIMIT --after and fails. Parse flags before consuming an optional positional limit, or only treat the first argument as a limit when it is not an option.
Thread-analysis workflow relies on fields search --json does not return
Location: workflows.md:128
The workflow reads .to and .cc from msgvault search --json output, but the documented schema says recipient fields are not included there. As written, the recipe cannot extract the participants it claims to compute. Use show-message per message ID or query message_recipients directly.
Workflow snippets assume zero-result searches still emit JSON
Location: workflows.md:57, workflows.md:76, workflows.md:138
Several examples pipe msgvault search --json straight into jq, but the documented behavior for zero matches is non-JSON error text. Those snippets will fail or miscount on empty results. Guard on exit status or normalize empty results to [] before piping to jq.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

… query - Add input validation to all query.sh subcommands (integers, dates, domains, emails, labels) to prevent SQL injection via crafted arguments - Fix senders subcommand to accept flags before or after optional limit - Fix thread analysis workflow to use query.sh instead of search --json (search does not return to/cc fields) - Guard all search-to-jq pipelines against non-JSON empty results - Add note about sql subcommand passing input unvalidated

The & character in the bash regex character class caused a parse error. Switched to denylist approach (reject single quotes, semicolons, backslashes) which is more robust for label names containing special characters like &.

roborev-ci · 2026-03-31T01:52:21Z

roborev: Combined Review (`714bc8d`)

Verdict: Changes are generally solid, but there are 2 medium-severity correctness issues that should be fixed before merge.

Medium

skills/claude-code/scripts/query.sh:98
The senders argument parser silently ignores unknown tokens. A typo like --afetr 2024-01-01 can degrade into an unfiltered query instead of failing, which may return plausible but incorrect analytics output.
Recommended fix: reject unknown flags and unexpected positional arguments, and fail when an option is missing its value.
skills/claude-code/scripts/query.sh:66
validate_label rejects labels containing ', but Gmail labels can legitimately include apostrophes. This causes label-messages to fail for valid labels in real archives.
Recommended fix: escape single quotes for SQL literals instead of banning them, or use a dedicated SQL-quoting helper.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Security: - Add duckdb binary existence check before running queries - Tighten domain validation: reject underscores, require start/end with alphanumeric (closes injection via underscore identifiers) - Add write-operation guard to sql subcommand: blocks DROP, DELETE, INSERT, UPDATE, CREATE, ALTER, COPY TO - Add security note to SKILL.md about sql subcommand risks Correctness: - Replace shift || true with explicit guard (prevents masked errors) - Add bounds check to validate_int (1-100000) Completeness: - Add build-cache and sync-full to SKILL.md Quick Reference - Add MSGVAULT_HOME path note to duckdb-queries.md - Document analytics cache prerequisite in DuckDB section

roborev-ci · 2026-03-31T05:20:21Z

roborev: Combined Review (`f916909`)

Verdict: Changes are not PR-ready due to one medium-severity safety issue in the sql helper.

Medium

skills/claude-code/scripts/query.sh:228, skills/claude-code/scripts/query.sh:240

The sql subcommand is documented as read-only, but its write-operation guard is a small, case-sensitive blacklist that is easy to bypass. Lowercase or mixed-case statements, unlisted write-capable DuckDB syntax, or variants such as copy ... to ... can still execute, which breaks the stated safety boundary and could allow file writes or state changes.

Recommended fix: Replace the blacklist with a strict allowlist of read-only statement types after normalizing input, ideally permitting only commands that begin with SELECT or WITH (and, if needed, other explicitly read-only forms like EXPLAIN, DESCRIBE, or SHOW). If possible, also run DuckDB in a genuinely read-only/sandboxed mode.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

The write-operation guard used a case-sensitive blacklist that could be bypassed with lowercase or mixed-case statements. Replace with a strict allowlist that normalizes input to uppercase and only permits SELECT, WITH, EXPLAIN, DESCRIBE, SHOW, and PRAGMA statements.

roborev-ci · 2026-03-31T06:31:23Z

roborev: Combined Review (`04b30ac`)

Verdict: changes are not ready to merge due to two medium-severity issues in query.sh.

Medium

Thread participant matching does not match the documented sender-only behavior
Location: skills/claude-code/scripts/query.sh:165-167
The threads subcommand is described as finding co-participants in threads for a given sender, but the seed query matches the address in any recipient role (from, to, cc, bcc). This broadens results and can include unrelated threads where the address only appeared as a recipient.
Suggested fix: Add AND r.recipient_type = 'from' to target_threads, or update the command/help text to clearly state that it matches any participant rather than only senders.
sql allowlist can be bypassed with multi-statement input
Location: skills/claude-code/scripts/query.sh:217, skills/claude-code/scripts/query.sh:217-220
The sql subcommand claims to allow only read-only DuckDB statements, but the implementation only validates the first token before passing the full string to duckdb -c. Because DuckDB accepts multiple statements, inputs like SELECT ...; COPY ... TO ... or SELECT ...; INSTALL ...; LOAD ... would bypass the restriction and execute non-read-only behavior.
Suggested fix: Enforce single-statement execution, reject semicolons / statement separators, and avoid broad PRAGMA allowance unless you explicitly enumerate safe read-only pragmas. A stronger fix is to parse and validate the full statement list or remove this subcommand for any untrusted input path.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

- Reject semicolons in sql subcommand input to prevent multi-statement bypass (e.g. "SELECT 1; DROP TABLE messages") - Remove PRAGMA from allowlist (can modify DuckDB state) - Clarify threads subcommand matches any participant role (from/to/cc/bcc) not just senders — this is intentional for "who else is on threads involving this person" use case. Updated help text to document this.

roborev-ci · 2026-03-31T07:31:55Z

roborev: Combined Review (`e4c2afb`)

Verdict: 1 medium-severity issue should be addressed before merge.

Medium

skills/claude-code/scripts/query.sh:229-237
The new SQL allowlist can be bypassed via EXPLAIN ANALYZE on side-effecting statements. The current prefix check accepts queries starting with EXPLAIN and passes the full statement through to DuckDB, which breaks the intended read-only guarantee.
Recommended fix: either remove EXPLAIN from the allowlist, or explicitly reject EXPLAIN ANALYZE and validate the explained statement itself against the read-only whitelist.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

EXPLAIN ANALYZE executes the underlying statement, so allowing EXPLAIN breaks the read-only guarantee. Remove EXPLAIN from the allowlist entirely — agents rarely need it and can use DESCRIBE/SHOW instead. Allowlist is now: SELECT, WITH, DESCRIBE, SHOW.

roborev-ci · 2026-03-31T08:33:16Z

roborev: Combined Review (`0c960a1`)

Verdict: No medium-or-higher findings; the reviewed changes look clean.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

0xDarkMatter added 2 commits March 31, 2026 12:02

fix: label validator regex syntax error in bash

714bc8d

The & character in the bash regex character class caused a parse error. Switched to denylist approach (reject single quotes, semicolons, backslashes) which is more robust for label names containing special characters like &.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Claude Code skill with DuckDB query layer#236

Add Claude Code skill with DuckDB query layer#236
0xDarkMatter wants to merge 7 commits intowesm:mainfrom
0xDarkMatter:feat/claude-code-skill

0xDarkMatter commented Mar 31, 2026

Uh oh!

roborev-ci bot commented Mar 31, 2026

Uh oh!

roborev-ci bot commented Mar 31, 2026

Uh oh!

roborev-ci bot commented Mar 31, 2026

Uh oh!

roborev-ci bot commented Mar 31, 2026

Uh oh!

roborev-ci bot commented Mar 31, 2026

Uh oh!

roborev-ci bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

0xDarkMatter commented Mar 31, 2026

What's included

How it was tested

Notes

Uh oh!

roborev-ci bot commented Mar 31, 2026

roborev: Combined Review (1a6460c)

High

Medium

Uh oh!

roborev-ci bot commented Mar 31, 2026

roborev: Combined Review (714bc8d)

Medium

Uh oh!

roborev-ci bot commented Mar 31, 2026

roborev: Combined Review (f916909)

Medium

Uh oh!

roborev-ci bot commented Mar 31, 2026

roborev: Combined Review (04b30ac)

Medium

Uh oh!

roborev-ci bot commented Mar 31, 2026

roborev: Combined Review (e4c2afb)

Medium

Uh oh!

roborev-ci bot commented Mar 31, 2026

roborev: Combined Review (0c960a1)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

roborev: Combined Review (`1a6460c`)

roborev: Combined Review (`714bc8d`)

roborev: Combined Review (`f916909`)

roborev: Combined Review (`04b30ac`)

roborev: Combined Review (`e4c2afb`)

roborev: Combined Review (`0c960a1`)