Skip to content

Reliable dataset merge-on-upload + restore model privacy filter#45

Merged
peteromallet merged 5 commits into
mainfrom
reliability/merge-and-privacy-filter
May 30, 2026
Merged

Reliable dataset merge-on-upload + restore model privacy filter#45
peteromallet merged 5 commits into
mainfrom
reliability/merge-and-privacy-filter

Conversation

@peteromallet
Copy link
Copy Markdown
Owner

Makes DataClaw's data cleansing and dataset updating reliable. Came out of a multi-perspective audit + sense-check of the cleanse/merge/upload pipeline.

What this fixes

1. Uploads no longer destroy prior data. The old path overwrote the entire remote conversations.jsonl with the latest local export — any session not reproducible locally right now (second machine, log rotation, narrower --source, cleared ~/.claude) was silently deleted. Now it downloads the remote, union-merges by (source, session_id), and reuploads. Fail-closed on download error (never overwrites remote with a local-only file); parent_commit optimistic concurrency with 412 retry; invariant merged >= remote.

2. The "Privacy filter" toggle is real again. A complete model-based PII filter (dataclaw/privacy_filter.py) had been written but never merged to main, and its deps were dropped — the UI toggle was cosmetic. Recovered it, fixed the content_parts field-walk gap, wired it into both export paths after mechanical redaction, reads privacy_filter.enabled/.device from config, degrades gracefully if torch is absent. Default model openai/privacy-filter (Apache-2.0), overridable via config/env.

3. Carried-forward records get re-redacted under the CURRENT policy — so re-publishing old remote data never resurfaces content the user has since tightened away (mechanical + model).

4. Updates are O(new data), not O(history). A redaction-policy stamp lets the merge skip re-redacting records already current, instead of re-running the PII model over the whole dataset every push — while still forcing a one-time full re-scan whenever the policy tightens.

5. Cleaner updates. No-op pushes skip all uploads (no empty-commit churn); corrupt remote lines are preserved verbatim instead of dropping data or wedging future pushes.

Tests

549 passing (+ a live model-detection test gated behind the pii marker). New coverage for carry-forward, dedup-by-session-id, superset selection, originalFile preservation, re-redaction, dedup invariant, policy-stamp skip, and malformed-line preservation.

Deferred (documented in docs/reliability-merge-and-privacy-filter-plan.md)

Streaming merge (memory at multi-GB), sharding (full re-download/upload per push), atomic create_commit, compaction semantics, model-load publish-time signal, git_branch redaction.

🤖 Generated with Claude Code

peteromallet and others added 5 commits May 30, 2026 01:39
Replace the destructive wholesale-overwrite upload with a safe
download -> union-merge -> re-redact -> reupload in push_to_huggingface:

- merge_jsonl_union in jsonl_tools.py unions remote+local by
  (source, session_id) (avoids start_time/project key drift), keeps the
  superset on conflict, and guarantees merged_total >= remote_total so a
  remote shrink is impossible.
- Carried-forward remote records are re-run through the CURRENT redaction
  pipeline (transform_session + Anonymizer, built exactly as the export
  loop does) so old-policy redaction is never re-published. Idempotent.
- Download fails closed: only EntryNotFoundError/RepositoryNotFoundError
  count as "empty remote"; any other error aborts the push rather than
  overwriting the remote with a local-only file.
- Reads remote records raw (preserves originalFile), uses parent_commit
  for optimistic concurrency with 412 retry, updates meta["sessions"] to
  the merged total.

Tests: tests/test_jsonl_merge.py covers carry-forward, dedup-by-sid,
superset selection, originalFile preservation, and re-redaction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Recover the never-merged dataclaw/privacy_filter.py (token-classification
PII NER) and make the UI "Privacy filter" toggle real:

- Default model openai/privacy-filter (Apache-2.0, the originally-intended
  model), overridable via privacy_filter.model config / env var.
- Fix the field-walk gap: scan/redact messages[].content_parts too (HEAD's
  transform_session walks it; the recovered code did not).
- Wire the model filter into the export loop (serial + parallel worker),
  applied in-memory right after transform_session, so edits are baked into
  the file confirm hashes and publish enforces. Reads
  privacy_filter.{enabled(default false),device,min_score,model} from config.
- Graceful degradation: if torch/transformers/model is unavailable, warn
  once and continue with mechanical redaction only -- never abort the export.
- Lazy-import privacy_filter so the CLI never pulls in torch unless enabled.
- Drop the dead shard/manifest layer; keep dict/text functions + oversized
  guard. Restore the pii optional-dependency extra + pytest marker.

torch stays out of the Mac sidecar by default (enabled=false); the model
downloads on first use.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1. Carried-forward remote records now get the model privacy filter, not
   just mechanical redaction. _build_carry_forward_redactor reads the
   privacy-filter config and applies _apply_model_privacy_filter after
   transform_session, mirroring the export loop. Without this, the bulk of
   a steady-state push (carry-forwards) shipped with mechanical redaction
   only when the model filter was enabled.

2. merge_jsonl_union counts remote_total/local_total by UNIQUE merge key
   instead of raw line count. A remote file from the old non-deduping
   uploader could contain duplicate (source, session_id) lines; counting
   raw lines made merged_total < remote_total trip the union-invariant
   guard and permanently block publishing even though no session was lost.

Tests: duplicate-remote-key invariant; carry-forward applies model filter.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rupt-line guard

Addresses the scale/elegance findings from the merge review.

Keystone - redaction-policy stamp:
- Stamp each exported record with redaction_policy_version() (hash of redact
  strings/usernames + model config + a code version). Computed in both export
  paths (serial + parallel worker), written after fingerprinting.
- The merge's carry-forward redactor now SKIPS re-redaction for any record
  already stamped with the current version, re-redacting (and re-stamping) only
  stale ones. This turns a steady-state push from O(total history) -- which
  re-ran the PII model over the whole dataset every push -- into ~O(new data),
  while preserving the tighten-only guarantee (policy change bumps the version
  and forces a one-time full re-scan).

Cheap wins:
- Skip ALL uploads when the merged file is byte-identical to the remote we just
  downloaded, so a no-change push stops churning the repo with empty commits
  (the metadata timestamp alone otherwise differed every run).
- _load_raw_records preserves unparseable JSONL lines verbatim instead of
  raising: a corrupt remote line can no longer drop data or permanently wedge
  all future pushes. Counted in MergeStats.malformed_preserved.

Tests: policy-version sensitivity, carry-forward stamp-skip vs re-redact,
malformed-line preservation + invariant.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@peteromallet peteromallet merged commit f36ac63 into main May 30, 2026
4 of 5 checks passed
@peteromallet peteromallet deleted the reliability/merge-and-privacy-filter branch May 30, 2026 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant