Reliable dataset merge-on-upload + restore model privacy filter#45
Merged
Conversation
Replace the destructive wholesale-overwrite upload with a safe download -> union-merge -> re-redact -> reupload in push_to_huggingface: - merge_jsonl_union in jsonl_tools.py unions remote+local by (source, session_id) (avoids start_time/project key drift), keeps the superset on conflict, and guarantees merged_total >= remote_total so a remote shrink is impossible. - Carried-forward remote records are re-run through the CURRENT redaction pipeline (transform_session + Anonymizer, built exactly as the export loop does) so old-policy redaction is never re-published. Idempotent. - Download fails closed: only EntryNotFoundError/RepositoryNotFoundError count as "empty remote"; any other error aborts the push rather than overwriting the remote with a local-only file. - Reads remote records raw (preserves originalFile), uses parent_commit for optimistic concurrency with 412 retry, updates meta["sessions"] to the merged total. Tests: tests/test_jsonl_merge.py covers carry-forward, dedup-by-sid, superset selection, originalFile preservation, and re-redaction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Recover the never-merged dataclaw/privacy_filter.py (token-classification
PII NER) and make the UI "Privacy filter" toggle real:
- Default model openai/privacy-filter (Apache-2.0, the originally-intended
model), overridable via privacy_filter.model config / env var.
- Fix the field-walk gap: scan/redact messages[].content_parts too (HEAD's
transform_session walks it; the recovered code did not).
- Wire the model filter into the export loop (serial + parallel worker),
applied in-memory right after transform_session, so edits are baked into
the file confirm hashes and publish enforces. Reads
privacy_filter.{enabled(default false),device,min_score,model} from config.
- Graceful degradation: if torch/transformers/model is unavailable, warn
once and continue with mechanical redaction only -- never abort the export.
- Lazy-import privacy_filter so the CLI never pulls in torch unless enabled.
- Drop the dead shard/manifest layer; keep dict/text functions + oversized
guard. Restore the pii optional-dependency extra + pytest marker.
torch stays out of the Mac sidecar by default (enabled=false); the model
downloads on first use.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1. Carried-forward remote records now get the model privacy filter, not just mechanical redaction. _build_carry_forward_redactor reads the privacy-filter config and applies _apply_model_privacy_filter after transform_session, mirroring the export loop. Without this, the bulk of a steady-state push (carry-forwards) shipped with mechanical redaction only when the model filter was enabled. 2. merge_jsonl_union counts remote_total/local_total by UNIQUE merge key instead of raw line count. A remote file from the old non-deduping uploader could contain duplicate (source, session_id) lines; counting raw lines made merged_total < remote_total trip the union-invariant guard and permanently block publishing even though no session was lost. Tests: duplicate-remote-key invariant; carry-forward applies model filter. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rupt-line guard Addresses the scale/elegance findings from the merge review. Keystone - redaction-policy stamp: - Stamp each exported record with redaction_policy_version() (hash of redact strings/usernames + model config + a code version). Computed in both export paths (serial + parallel worker), written after fingerprinting. - The merge's carry-forward redactor now SKIPS re-redaction for any record already stamped with the current version, re-redacting (and re-stamping) only stale ones. This turns a steady-state push from O(total history) -- which re-ran the PII model over the whole dataset every push -- into ~O(new data), while preserving the tighten-only guarantee (policy change bumps the version and forces a one-time full re-scan). Cheap wins: - Skip ALL uploads when the merged file is byte-identical to the remote we just downloaded, so a no-change push stops churning the repo with empty commits (the metadata timestamp alone otherwise differed every run). - _load_raw_records preserves unparseable JSONL lines verbatim instead of raising: a corrupt remote line can no longer drop data or permanently wedge all future pushes. Counted in MergeStats.malformed_preserved. Tests: policy-version sensitivity, carry-forward stamp-skip vs re-redact, malformed-line preservation + invariant. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Makes DataClaw's data cleansing and dataset updating reliable. Came out of a multi-perspective audit + sense-check of the cleanse/merge/upload pipeline.
What this fixes
1. Uploads no longer destroy prior data. The old path overwrote the entire remote
conversations.jsonlwith the latest local export — any session not reproducible locally right now (second machine, log rotation, narrower--source, cleared~/.claude) was silently deleted. Now it downloads the remote, union-merges by(source, session_id), and reuploads. Fail-closed on download error (never overwrites remote with a local-only file);parent_commitoptimistic concurrency with 412 retry; invariantmerged >= remote.2. The "Privacy filter" toggle is real again. A complete model-based PII filter (
dataclaw/privacy_filter.py) had been written but never merged to main, and its deps were dropped — the UI toggle was cosmetic. Recovered it, fixed thecontent_partsfield-walk gap, wired it into both export paths after mechanical redaction, readsprivacy_filter.enabled/.devicefrom config, degrades gracefully if torch is absent. Default modelopenai/privacy-filter(Apache-2.0), overridable via config/env.3. Carried-forward records get re-redacted under the CURRENT policy — so re-publishing old remote data never resurfaces content the user has since tightened away (mechanical + model).
4. Updates are O(new data), not O(history). A redaction-policy stamp lets the merge skip re-redacting records already current, instead of re-running the PII model over the whole dataset every push — while still forcing a one-time full re-scan whenever the policy tightens.
5. Cleaner updates. No-op pushes skip all uploads (no empty-commit churn); corrupt remote lines are preserved verbatim instead of dropping data or wedging future pushes.
Tests
549 passing (+ a live model-detection test gated behind the
piimarker). New coverage for carry-forward, dedup-by-session-id, superset selection, originalFile preservation, re-redaction, dedup invariant, policy-stamp skip, and malformed-line preservation.Deferred (documented in docs/reliability-merge-and-privacy-filter-plan.md)
Streaming merge (memory at multi-GB), sharding (full re-download/upload per push), atomic
create_commit, compaction semantics, model-load publish-time signal,git_branchredaction.🤖 Generated with Claude Code