Skip to content

SK-536: Add sx vault copy for lossless cross-vault migration#152

Merged
detkin merged 18 commits into
mainfrom
detkin/sx-vault-copy-client
Jun 1, 2026
Merged

SK-536: Add sx vault copy for lossless cross-vault migration#152
detkin merged 18 commits into
mainfrom
detkin/sx-vault-copy-client

Conversation

@detkin
Copy link
Copy Markdown
Contributor

@detkin detkin commented Jun 1, 2026

Summary

Adds sx vault copy --from <profile> --to <profile> — copy a whole vault (teams, bots, assets + all versions, installation scopes, audit history, usage history) between any backends (skills.new, git, path). Wires the Sleuth vault to the new GraphQL surface from sleuth-io/pulse#3954 and adds a backend-agnostic copy engine.

  • Sleuth vault wiring: team/user-scoped installs via the new installations input; team-repository management via updateTeam + repo-GID resolution; GetUsageStats/ReadUsageEvents via the new assetUsageEvents export; RecordUsageEvents sends actor; ImportAuditEvents; reads asset scopes from skills.new via the installations query.
  • internal/vaultcopy engine: teams → bots → assets(+versions+scopes) → audit → usage. Best-effort (item failures become warnings); always previews, requires --yes to apply.
  • Paginates assetAuditLog at the server's 50-cap (was a single first=1000 call); chunks audit-import and usage POSTs for large vaults.

Depends on: sleuth-io/pulse#3954 (the GraphQL operations this calls). Already deployed to app.skills.new.

Testing

Exercised both directions against deployed skills.new with two real orgs (Docker): skills.new → git vault, then git vault → fresh org. Validated assets+versions, teams, bots, scopes, audit (15,885 events round-tripped), and usage (chunked). Round-trip unit test (path→path) covers losslessness + idempotency + org-wide assets.

Known limitations (candidate follow-ups)

  • Multi-scope asset → skills.new destination currently keeps only the last scope (being addressed).
  • Cross-org copy requires referenced users/repos to exist in the destination org (warns otherwise).
  • Bot API keys are not copyable (regenerate on the destination).

Security implications of changes have been considered

detkin added 8 commits June 1, 2026 13:21
- Sync vendored schema; add importAuditEvents + assetUsageEvents ops
- Team/user-scoped SetAssetInstallation via installations input
- GetUsageStats via assetUsageEvents (paged) + shared summarizer
- RecordUsageEvents sends actor for original attribution
- New Vault.ImportAuditEvents: lossless audit-history import
- Path/Git append JSONL verbatim (git via vault tx); Sleuth
  delegates to importAuditEvents, chunked to the 5000 cap
- Round-trip test on a path vault
- AddTeamRepository/RemoveTeamRepository via updateTeam skillsRepositories
- Resolve repo identifiers (URL/slug/SSH) to GIDs by paging org repos
- Re-send members on update so the replace preserves membership
- Unit test for trailingOwnerName
- internal/vaultcopy: backend-agnostic engine copying teams, bots,
  assets+versions, installation scopes, audit, and usage between vaults
- Vault.ReadUsageEvents (raw usage read) for path/git/sleuth
- ManifestAssetScopes capability for complete file-backed scope reads
- sx vault copy --from/--to with --only, --dry-run, always-preview-then-yes
- Path-to-path round-trip test asserting losslessness + idempotency
- Register org-wide (scopeless) assets on the destination instead of
  copying files without an install row
- ManifestAssetScopes returns presence so files-only assets aren't
  mis-registered as org-wide
- Count scopes during dry-run preview; simplify version counting
- Warn when source teams exceed the list limit (truncation)
- Round-trip test covers an org-wide asset
- QueryAuditEvents pages assetAuditLog at the server's 50-cap with
  endCursor (was a single first=1000 call that the server rejects)
- Add AssetInstallScopes: read an asset's installation scopes from any
  vault (Sleuth via installations query; file-backed via manifest), so
  scopes copy even when the source is skills.new
- Engine uses the backend-neutral assetScopeReader capability
- importAuditEvents: encode event data as a JSON string for the
  JSONString scalar (was sending a raw object; server rejected with
  "must be str, not dict"). Verified against deployed skills.new.
- Copy engine: a category-level failure (e.g. audit) is now a warning
  and the copy continues, so it can't abort later categories (usage).
- RecordUsageEvents posts in batches of 1000 instead of one giant
  /api/usage request, so a whole-vault usage import is enqueued in
  manageable chunks
@detkin detkin requested review from inesp, mrdon and mzgajner as code owners June 1, 2026 22:21
@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @detkin's task in 3m 43s —— View job


Review: PR #152 — Add sx vault copy

Solid implementation of a backend‑agnostic copy engine; the main concerns are (1) a 50‑asset cap on Sleuth ListAssets that quietly breaks the lossless claim for the Sleuth source, (2) a slug/name mismatch in AssetInstallScopes that can drop scopes silently, and (3) the preview path replays full audit/usage reads twice for large vaults.

[HIGH] Sleuth source silently caps assets to 50 per type — breaks lossless claim

Location: internal/vault/sleuth.go:617-622 (combined with internal/vaultcopy/vaultcopy.go:65,156)

Issue: The copy engine asks for assetListLimit = 10000 from ListAssets, but SleuthVault.listAssetsByType silently clamps any value > 50 back to 50, and there is no pagination loop wrapping that call. The Sleuth backend documents the 50 cap, but the result is that copying from a skills.new vault with more than 50 assets of any one type (skills, MCPs, agents, ...) will silently truncate. There is no warning emitted at the copy layer either — ListAssets returns no HasMore/TotalCount/cursor, so copyAssets cannot even detect the truncation. The PR description claims this is lossless, but for any org with >50 skills it is not.

Suggestion: Either page AssetUsageEvents-style through vaultgql.VaultAssets inside listAssetsByType (preferred — actually lossless), or at minimum surface the cap to the caller (return a HasMore/TotalCount like ListTeamsResult does, and have copyAssets warn loudly when truncation is possible). The existing teams/audit/usage pagination patterns in this PR are the right template.

[HIGH] AssetInstallScopes can silently drop scopes due to slug vs. name mismatch

Location: internal/vault/sleuth_mgmt.go:438-470

Issue: AssetInstallScopes is called by copyAssets with a.Name, which for Sleuth is populated from node.GetSlug() (see sleuth.go:652). Inside AssetInstallScopes, the filter is strings.EqualFold(resp.Vault.Assets.Nodes[i].GetName(), name) — it compares the display name of each returned asset against what is in fact a slug. When an asset's display name differs from its slug (e.g. "My Skill" vs my-skill), the loop never matches; the function returns (nil, false, nil), and copyAssets interprets that as "files-only / no scopes to register". The asset gets uploaded to the destination but its installation scopes are silently dropped. assetInfoByName already established that slug-first matching is the right pattern.

Suggestion: Match on slug and name (slug wins), mirroring assetInfoByName:

if strings.EqualFold(resp.Vault.Assets.Nodes[i].GetSlug(), name) ||
    strings.EqualFold(resp.Vault.Assets.Nodes[i].GetName(), name) {

Also guard against the search returning more than 50 hits for very generic slugs — pin to exact match server-side if possible, or page.

[MEDIUM] Preview replays the full audit + usage reads, doubling wall time on large vaults

Location: internal/commands/vault_copy.go:79-104 and internal/vaultcopy/vaultcopy.go:263-285

Issue: runVaultCopy calls vaultcopy.Copy(..., DryRun: true) for the preview and, on --yes, calls it again for the apply. Both invocations pull the entire audit log and entire usage history from the source: copyAudit does src.QueryAuditEvents(ctx, mgmt.AuditFilter{}) and uses len(events) for the report. The PR description mentions 15,885 audit events round-tripped — that means 318 paginated queries to count them in the preview, then 318 more for the apply (plus another ~318 chunked import calls into the destination). A preview should not pay the cost of a full read.

Suggestion: Either (a) thread a "counts only" mode through the engine so DryRun can use a cheap TotalCount from the connection rather than fetching every node, or (b) cache the read events on the Report in DryRun mode and reuse them in the apply pass. (a) keeps preview light; (b) avoids the double network read entirely.

[MEDIUM] Audit/usage import is non-idempotent; partial failure causes duplicates on retry

Location: internal/vault/sleuth_mgmt.go:1337-1386 (ImportAuditEvents) and :1222-1264 (RecordUsageEvents)

Issue: Both methods chunk through events with for start := 0; start < len(events); start += chunkSize. If chunk N fails (network error, context cancel, server 500), the function returns and chunks < N have already been imported server-side. The copy engine catches the error at the stage level and turns it into a warning, but a re-run of sx vault copy --yes would re-send all events from the start — the server's importAuditEvents mutation has no deduplication (per the schema), so audit history would be silently duplicated.

Suggestion: At minimum, document this in the --yes help text and the copy audit failed warning ("re-running may duplicate audit history"). A stronger fix is to track the last successful chunk in the destination (e.g. cursor by (timestamp, target) and skip events already present) — but that's a follow-up. The roundtrip test only exercises a 1-event import so this edge isn't covered.

[LOW] Bot creation against Sleuth destination auto-issues an API key that is silently discarded

Location: internal/vaultcopy/vaultcopy.go:144-147

Issue: copyBots calls dst.CreateBot(ctx, mgmt.Bot{...}) and discards the returned rawToken. For the Sleuth destination, CreateBot auto-issues a default API key that is returned exactly once on the create response and is otherwise unrecoverable (see comment at sleuth_mgmt.go:662-676). The copy throws that token away — every newly-created bot on the destination now has an active key the user can never use, and the warning at line 150 just tells them to "regenerate". The auto-issued key is essentially orphaned.

Suggestion: After CreateBot returns, immediately call DeleteBotApiKey for the auto-issued key (if the keyID is exposed) or surface a cleaner instruction to the user. At minimum, the warning should say "newly-created bots have an unusable auto-issued key — revoke it with sx bot key delete <bot> <auto-issued-id> and create a fresh one with sx bot key create."

[LOW] Scope copy ordering can leave assets uploaded without scopes if scope read fails

Location: internal/vaultcopy/vaultcopy.go:163-201

Issue: For each asset, copyAssets uploads every version first and only then calls AssetInstallScopes. If the scope read errors after uploads succeed, the asset content is in the destination but no installation rows are recorded — the destination has a dangling, unreferenced asset version set. The error becomes a warning and the user sees a "copied N assets" report with no visible signal that some assets are stranded.

Suggestion: Either (a) read the scope set up front (before any uploads) so a read failure aborts the asset cleanly with no half-state, or (b) include a separate Report.AssetsMissingScopes counter so the report makes the dangling-asset case visible alongside Versions and Scopes.

[LOW] `sleuthBotListSoftCap = 1000` is silently relied on by the copy engine

Location: internal/vault/sleuth_mgmt.go:520-587 combined with internal/vaultcopy/vaultcopy.go:133-153

Issue: copyBots does src.ListBots(ctx), but for Sleuth that returns at most ~1000 (no pagination — only a warn-at-1000 soft cap). An org with >1000 bots would silently lose bot rows in the copy, and the report would show Bots: 1000 with no other signal. Same shape of issue as the assets cap, just at a higher threshold.

Suggestion: Either page the bot list (when/if the server adds a connection), or have copyBots defensively check for len(bots) == sleuthBotListSoftCap and emit a warning that bot count may be truncated. The pattern from listTeamNodesFiltered (it already logs at saturation) could be replicated end-to-end through the copy report.

[LOW] `assetUsageEvents` query selects `botName` but the value is never used

Location: internal/vault/graphql/operations/asset_usage_events.graphql:29 and internal/vault/sleuth_mgmt.go:1308-1319

Issue: The GraphQL query asks for botName, but ReadUsageEvents does not map it onto mgmt.UsageEvent (which has no Bot field), so the field is fetched and discarded on every page. Wasted bytes per row × 15k+ rows during a large export.

Suggestion: Drop botName from asset_usage_events.graphql until mgmt.UsageEvent is extended to carry it (or extend mgmt.UsageEvent if bot attribution matters for the copy). Either direction; the current shape is just dead.

[LOW] Round-trip test only exercises path→path with one audit/usage event each

Location: internal/vaultcopy/vaultcopy_roundtrip_test.go

Issue: The test imports a single audit event and a single usage event, so the chunking paths in sleuth_mgmt.go (ImportAuditEvents/RecordUsageEvents) — which is the actually-novel logic in this PR — are not exercised by automated tests. The PR description says these were validated against a live skills.new instance with 15,885 events, but that one-off manual run doesn't survive regressions. Path-only tests also can't catch the slug/name mismatch flagged above.

Suggestion: Add a unit test that seeds more events than sleuthAuditImportChunkSize (5000) and sleuthUsagePostChunkSize (1000) and asserts chunking calls the underlying writer the right number of times — even with a fake Vault implementation, since the chunking is in the Sleuth methods, not the engine. The slug/name mismatch could be covered by a focused unit test that stubs the GraphQL response.

· [Branch](https://github.com/sleuth-io/sx/tree/detkin/sx-vault-copy-client)

- Sleuth SetAssetInstallations sets an asset's full install set in a
  single server call, so multi-scope assets no longer lose all but the
  last scope to the replace-on-set mutation
- Engine prefers the bulk call for >1 scope, falls back to per-target
  on error (preserves best-effort partial success)
- Unit tests for the dispatch logic
@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @detkin's task in 2m 55s —— View job


PR Review: #152 — Add sx vault copy

Multi-scope replace‑on‑set and chunked usage POSTs are now fixed, but two HIGH issues from the prior review (the 50‑asset cap on the Sleuth source, and the slug/name mismatch in AssetInstallScopes) are still present and will silently corrupt copies from skills.new in real-world conditions.

[HIGH] Sleuth source still silently caps assets at 50 per type — breaks the lossless claim

Location: internal/vault/sleuth.go:617-622 (consumed by internal/vaultcopy/vaultcopy.go:155-201)

Issue: listAssetsByType still clamps any caller‑supplied limit > 50 back to 50 and does not paginate, even though copyAssets passes assetListLimit = 10000. There is no pagination loop, no HasMore returned, and no warning. Copying from a skills.new vault with more than 50 assets of any single type (skills, MCPs, agents, …) will silently truncate, and the report will report only the first 50 — exactly the "lossless" promise in the PR description. The new commit 434cfb6 fixed multi‑scope replace‑on‑set but did not address this cap.

Suggestion: Page through vaultgql.VaultAssets inside listAssetsByType using the connection's pageInfo.endCursor (mirroring the new audit/usage pagination), or at minimum surface a HasMore/TotalCount so copyAssets can warn loudly. The current shape silently loses data.

[HIGH] `AssetInstallScopes` still uses name-only matching against an input that is a slug — scopes silently dropped

Location: internal/vault/sleuth_mgmt.go:493-507 and internal/vault/graphql/operations/asset_installations.graphql:1-15

Issue: copyAssets calls AssetInstallScopes(ctx, a.Name), where a.Name for a Sleuth source is populated from node.GetSlug() (sleuth.go:652). Inside, the loop is still strings.EqualFold(resp.Vault.Assets.Nodes[i].GetName(), name) — comparing the display name against what is actually a slug. The query in asset_installations.graphql does not even select slug, so a slug-based comparison isn't an option until the operation is updated. When display name and slug differ (e.g. "My Skill" vs my-skill) the node is never matched, AssetInstallScopes returns (nil, false, nil), and copyAssets interprets that as "files only / no scopes" — uploading the asset to the destination with no installation scopes registered. The asset becomes a dangling, never‑installed file set on the destination. assetInfoByName already establishes slug-first matching as the right pattern.

Suggestion: Add slug to asset_installations.graphql's nodes { ... } selection and match slug-first, name-fallback:

if strings.EqualFold(node.GetSlug(), name) ||
    strings.EqualFold(node.GetName(), name) {

Also pin the search: argument to an exact match server-side, or page — the 50-row cap on the search response can independently hide the right node on a generic slug.

[MEDIUM] Preview still replays the full audit + usage reads, doubling wall time on large vaults

Location: internal/commands/vault_copy.go:79-104 and internal/vaultcopy/vaultcopy.go:290-312

Issue: runVaultCopy calls vaultcopy.Copy(..., DryRun: true) for the preview and, on --yes, calls it again for the apply. Both copyAudit/copyUsage invocations pull the entire audit log and entire usage history from the source, only to use len(events) for the report on the dry-run pass. With the PR's stated 15,885 audit events at the 50-per-page server cap that is ~318 paginated queries just to populate the preview's count, then another ~318 for the apply — plus chunked imports. This still hasn't been addressed.

Suggestion: Either thread a "counts only" mode through Copy (cheap TotalCount from the connection when DryRun is set), or cache the read events on the Report in DryRun mode and reuse them on the apply pass. The former keeps preview light; the latter eliminates the double network read.

[MEDIUM] Audit/usage import remains non-idempotent — retry duplicates events

Location: internal/vault/sleuth_mgmt.go:1277-1319 (RecordUsageEvents) and :1392-1441 (ImportAuditEvents)

Issue: Both methods chunk in a for start := 0; start < len(events); start += chunkSize loop and return on the first error. If chunk N fails, chunks < N have already landed on the destination, but the engine catches the error at the stage level and turns it into a warning. A user who re-runs sx vault copy --yes will re-send all events from event 0, and the server's importAuditEvents/postUsageStats have no dedup — the duplicate Co-Authored-By of an audit re-import would show up as ghost history. The roundtrip test only exercises a 1-event import, so this edge isn't covered.

Suggestion: At minimum, update the copy audit failed / copy usage failed warning text to say "re-running may duplicate audit/usage history; consider importing into an empty destination". A stronger fix is to dedupe by (timestamp, target, actor, event) server-side or stream a resume cursor, but that's a follow-up.

[LOW] Bot creation discards the auto-issued API key — destination accumulates orphan keys

Location: internal/vaultcopy/vaultcopy.go:144-147 (combined with sleuth_mgmt.go:705-731)

Issue: copyBots does dst.CreateBot(ctx, mgmt.Bot{...}) and ignores the returned rawToken. For Sleuth, CreateBot auto-issues a default API key, returns the raw token once, and there is no recovery path. The copy throws it away — every newly‑created bot on a skills.new destination ends up with an active, unrecoverable API key sitting on the server. The user gets bot API keys are not copyable; regenerate them on the destination, with no instruction to clean up the orphan.

Suggestion: Either delete the auto-issued key immediately after CreateBot returns (if DeleteBotApiKey accepts the key by ID returned in the create payload), or sharpen the warning to: newly-created bots on the destination have an auto-issued API key you can't see — revoke it with 'sx bot key delete <bot>' and create a fresh one.

[LOW] Scope read happens after version upload — read failure leaves stranded asset versions

Location: internal/vaultcopy/vaultcopy.go:163-201

Issue: For each asset, copyAssets uploads every version first and only then calls AssetInstallScopes. If the scope read errors after uploads succeed, the asset content is on the destination but no installation rows are recorded — the destination has a dangling, unreferenced asset version set. The error becomes a warning; the report just shows "copied N assets" with no signal that some are stranded. Combined with the slug/name mismatch above, this is the path through which scopes get silently dropped today.

Suggestion: Read scopes before uploading versions, so a scope-read failure aborts the asset cleanly with no half-state. Or add a Report.AssetsMissingScopes counter so the dangling case is visible alongside Versions and Scopes.

[LOW] `sleuthBotListSoftCap = 1000` is silently relied on by the copy engine

Location: internal/vault/sleuth_mgmt.go:568-642 consumed by internal/vaultcopy/vaultcopy.go:133-153

Issue: copyBots does src.ListBots(ctx). For Sleuth, listBotNodes returns at most ~1000 (no pagination — bots is a plain list, not a connection — with a warnIfBotListSoftCap heuristic). An org with >1000 bots silently loses bot rows during copy, and the report shows Bots: 1000 with no further signal. Same shape as the assets cap, just at a higher threshold.

Suggestion: Either upgrade bots to a paginated connection on the server, or have copyBots defensively check for len(bots) == sleuthBotListSoftCap and propagate that as a Report.Warnings entry (not just a structured-log line).

[LOW] `assetUsageEvents` query selects `botName` but the value is never consumed

Location: internal/vault/graphql/operations/asset_usage_events.graphql:29 and internal/vault/sleuth_mgmt.go:1362-1375

Issue: The query asks for botName, but ReadUsageEvents does not map it onto mgmt.UsageEvent (which has no Bot field). Wasted bytes per row on every paginated export.

Suggestion: Drop botName from asset_usage_events.graphql until mgmt.UsageEvent carries bot attribution, or extend mgmt.UsageEvent if bot attribution matters for the copy.

[LOW] Round-trip test does not exercise the chunking paths that are the novel logic in this PR

Location: internal/vaultcopy/vaultcopy_roundtrip_test.go:58-67

Issue: The test seeds 1 audit event and 1 usage event. The new chunking logic in sleuth_mgmt.go (ImportAuditEvents chunked at 5000, RecordUsageEvents chunked at 1000) is not covered by automated tests — the only validation is the 15,885-event manual run mentioned in the PR description, which doesn't survive regressions. Path-only tests also can't catch the slug/name mismatch flagged above.

Suggestion: Add a unit test that seeds more events than sleuthAuditImportChunkSize and asserts the chunk boundaries (via a fake server transport or by intercepting PostUsageStats/vaultgql.ImportAuditEvents). Plus a unit test that stubs the GraphQL AssetInstallations response with name != slug and exercises AssetInstallScopes to catch the slug/name regression.

· [Branch](https://github.com/sleuth-io/sx/tree/detkin/sx-vault-copy-client)

@detkin detkin changed the title Add sx vault copy for lossless cross-vault migration SK-536: Add sx vault copy for lossless cross-vault migration Jun 1, 2026
@linear-code
Copy link
Copy Markdown

linear-code Bot commented Jun 1, 2026

SK-536

- AssetInstallScopes returns present=false when a skills.new asset has
  no installations (only an explicit ORGANIZATION install is org-wide),
  so the copy no longer marks uninstalled source assets org-wide.
  (Note: skills.new still defaults uploads to org-wide server-side —
  separate follow-up.)
@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @detkin's task in 2m 16s —— View job


PR Review: #152 — Add sx vault copy

The latest commits fix multi-scope replace-on-set and tighten "zero-install ≠ org-wide", but the two HIGH issues from the prior review (the 50-asset cap on the Sleuth source and the slug/name mismatch in AssetInstallScopes) are still present and will silently corrupt copies from skills.new with realistic data.

[HIGH] Sleuth source still silently caps assets at 50 per type — breaks the lossless claim

Location: internal/vault/sleuth.go:617-622 (consumed by internal/vaultcopy/vaultcopy.go:155-201)

Issue: listAssetsByType still clamps any caller-supplied limit > 50 back to 50 and does not paginate, even though copyAssets passes assetListLimit = 10000. No pagination loop, no HasMore returned, no warning. Copying from a skills.new vault with more than 50 assets of any single type (skills, MCPs, agents, …) silently truncates, and the report shows only the first 50 — the opposite of the "lossless" claim. None of the recent commits (5c5acf9, 434cfb6, da33380) touched this path.

Suggestion: Page through vaultgql.VaultAssets inside listAssetsByType using the connection's pageInfo.endCursor (mirroring the new audit/usage pagination in sleuth_mgmt.go:1455-1485 and :1341-1390), or at minimum surface a HasMore/TotalCount from ListAssetsResult so copyAssets can warn loudly when the source is truncated.

[HIGH] `AssetInstallScopes` still uses name-only matching against an input that is actually a slug — scopes silently dropped

Location: internal/vault/sleuth_mgmt.go:493-507 and internal/vault/graphql/operations/asset_installations.graphql:1-15

Issue: copyAssets calls AssetInstallScopes(ctx, a.Name), where a.Name for a Sleuth source is populated from node.GetSlug() (sleuth.go:652). Inside, the loop is still strings.EqualFold(resp.Vault.Assets.Nodes[i].GetName(), name) — comparing the display name against what is actually a slug. The AssetInstallations query doesn't even select slug, so a slug-based comparison isn't even possible without an operation change. When display name and slug differ (e.g. "My Skill" vs my-skill), the node never matches, the function returns (nil, false, nil) (which copyAssets now interprets as "asset uploaded but not registered" since 5c5acf9), and the destination ends up with the asset file content but no installation scope — silently dangling. assetInfoByName already establishes slug-first matching as the right pattern.

Suggestion: Add slug to asset_installations.graphql's nodes { … } selection and match slug-first, name-fallback in AssetInstallScopes:

if strings.EqualFold(node.GetSlug(), name) ||
    strings.EqualFold(node.GetName(), name) {

Also pin the search: argument to an exact match server-side, or page — the 50-row cap on the search response can independently hide the right node on a common slug stem.

[MEDIUM] Preview still replays the full audit + usage reads, doubling wall time on large vaults

Location: internal/commands/vault_copy.go:79-104 and internal/vaultcopy/vaultcopy.go:290-312

Issue: runVaultCopy calls vaultcopy.Copy(..., DryRun: true) for the preview and, on --yes, calls it again to apply. Both copyAudit/copyUsage invocations pull the entire audit log and entire usage history from the source on the dry-run pass solely to populate len(events) for the report. With the PR's 15,885 audit events at a 50-per-page server cap, that's ~318 paginated queries just to show the preview count, then another ~318 for apply (plus chunked imports into the destination). Still not addressed by the recent commits.

Suggestion: Either thread a "counts only" mode through Copy (cheap TotalCount from the connection when DryRun is set), or cache the events on the Report in DryRun mode and reuse them on apply. The former keeps preview lightweight; the latter eliminates the double network read entirely.

[MEDIUM] Audit/usage import remains non-idempotent — retry duplicates events

Location: internal/vault/sleuth_mgmt.go:1285-1327 (RecordUsageEvents) and :1400-1449 (ImportAuditEvents)

Issue: Both methods chunk in for start := 0; start < len(events); start += chunkSize and return on the first error. If chunk N fails (network blip, server 500, ctx cancel), chunks < N have already landed; the engine catches and demotes the error to a warning at vaultcopy.go:90. A user who re-runs sx vault copy --yes re-sends all events from 0 — importAuditEvents and postUsageStats have no dedup, so audit/usage history silently doubles. The roundtrip test only exercises a 1-event import so this regression has no automated coverage.

Suggestion: At minimum, update the warning text in vaultcopy.go:90 to say "re-running may duplicate audit/usage history; consider importing into an empty destination". A stronger follow-up is to dedup server-side by (timestamp, target, actor, event) or stream a resume cursor — but that's tracked.

[LOW] Bot creation discards the auto-issued API key — destination accumulates orphan keys

Location: internal/vaultcopy/vaultcopy.go:144-147 (combined with sleuth_mgmt.go:725-738)

Issue: copyBots does dst.CreateBot(ctx, mgmt.Bot{…}) and ignores the returned rawToken. The Sleuth CreateBot auto-issues a default API key, returns the raw token once (see the comment at sleuth_mgmt.go:725-731), and there's no recovery path. The copy drops it; every newly-created bot on a skills.new destination ends up with an active, unrecoverable API key sitting on the server. The user just sees "bot API keys are not copyable; regenerate them on the destination" with no instruction to clean up the orphan.

Suggestion: Either delete the auto-issued key immediately after CreateBot returns (would need the keyID exposed from the create response), or sharpen the warning: "newly-created bots on the destination have an auto-issued API key you can't see — revoke it with 'sx bot key delete <bot>' and create a fresh one with 'sx bot key create'".

[LOW] Scope read happens after version upload — read failure leaves stranded asset versions

Location: internal/vaultcopy/vaultcopy.go:163-201

Issue: For each asset, copyAssets uploads every version first and only then calls AssetInstallScopes. If the scope read errors after uploads succeed, the asset content is on the destination but no installation rows are recorded — a dangling, unreferenced asset version set. The error becomes a warning and the report just shows "copied N assets" with no signal that some are stranded. Combined with the slug/name mismatch above, this is the silent path through which scopes get dropped today.

Suggestion: Read scopes before uploading versions, so a scope-read failure aborts the asset cleanly. Or add a Report.AssetsMissingScopes counter so the dangling case is visible alongside Versions and Scopes.

[LOW] `sleuthBotListSoftCap = 1000` is silently relied on by the copy engine

Location: internal/vault/sleuth_mgmt.go:583-650 consumed by internal/vaultcopy/vaultcopy.go:133-153

Issue: copyBots does src.ListBots(ctx). For Sleuth, listBotNodes returns at most ~1000 (no pagination — bots is a plain GraphQL list with only a warnIfBotListSoftCap log heuristic). An org with >1000 bots silently loses rows during copy, and the user-visible report shows Bots: 1000 with no other signal. The structured-log warning never surfaces in the CLI output. Same shape as the assets cap, just at a higher threshold.

Suggestion: Either upgrade bots to a paginated connection server-side, or have copyBots check len(bots) >= sleuthBotListSoftCap (export the constant) and propagate as a Report.Warnings entry rather than just a log line.

[LOW] `assetUsageEvents` query selects `botName` but the value is never consumed

Location: internal/vault/graphql/operations/asset_usage_events.graphql:29 and internal/vault/sleuth_mgmt.go:1371-1382

Issue: The query asks for botName, but ReadUsageEvents doesn't map it onto mgmt.UsageEvent (which has no Bot field). Wasted bytes per row on every paginated export — at 15k+ rows it adds up.

Suggestion: Drop botName from asset_usage_events.graphql until mgmt.UsageEvent carries bot attribution, or extend mgmt.UsageEvent if bot attribution matters for the copy. The current shape is dead weight.

[LOW] Round-trip test does not exercise the chunking paths that are the novel logic in this PR

Location: internal/vaultcopy/vaultcopy_roundtrip_test.go:58-67

Issue: The test seeds 1 audit event and 1 usage event. The new chunking logic in sleuth_mgmt.go (ImportAuditEvents at 5000, RecordUsageEvents at 1000) is not covered by automated tests — the only validation is the 15,885-event manual run mentioned in the PR description, which won't survive regressions. The path-only test also can't catch the slug/name mismatch flagged above; the slug path only exists in the Sleuth backend.

Suggestion: Add a unit test that seeds more events than sleuthAuditImportChunkSize and asserts chunk boundaries (via a fake transport intercepting vaultgql.ImportAuditEvents / PostUsageStats). A separate focused unit test that stubs the AssetInstallations GraphQL response with name != slug would catch the slug/name regression.

· [Branch](https://github.com/sleuth-io/sx/tree/detkin/sx-vault-copy-client)

- HIGH: paginate Sleuth ListAssets (was capped at 50/type, silently
  truncating); prod now reads 90 assets/354 versions vs 79/249
- HIGH: AssetInstallScopes matches on slug (caller passes the slug),
  not display name — scopes were silently dropped when they differ
- MEDIUM: --yes applies directly instead of preview-then-apply, so a
  large vault's audit/usage history isn't read twice
- MEDIUM: document that audit/usage import is additive (retry duplicates)
- LOW: drop unused botName from the usage export query
@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @detkin's task in 2m 15s —— View job


PR Review: #152 — Add sx vault copy

Both HIGH issues from the prior review are now fixed (asset pagination + slug-first scope matching), and the MEDIUM concerns are addressed at the minimum suggested level (--yes skips the preview re-read; idempotency limitation is documented); three LOWs remain (bot auto-issued key, scope-read-after-upload, bot-list cap surfacing) plus thin chunking test coverage.

[RESOLVED ✓] 50-asset cap on Sleuth source

Location: internal/vault/sleuth.go:629-665

Verified: listAssetsByType now pages through vaultgql.VaultAssets with endCursor until !HasNextPage, bounded by opts.Limit. Page size is sleuthAssetsPageSize = 50 (the server cap). The PR description confirms 90 assets / 354 versions are now read where the prior cap yielded 79 / 249.

[RESOLVED ✓] AssetInstallScopes slug vs name mismatch

Location: internal/vault/sleuth_mgmt.go:493-507 + internal/vault/graphql/operations/asset_installations.graphql:5

Verified: slug is now selected on the GraphQL nodes, and the loop matches EqualFold(node.GetSlug(), name) || EqualFold(node.GetName(), name). Slug-first matching with name fallback — matches the assetInfoByName pattern. Scopes will no longer silently drop when display name ≠ slug.

[RESOLVED ✓] Preview replays full audit + usage reads

Location: internal/commands/vault_copy.go:84-94

Verified: --yes now calls Copy exactly once with DryRun: false and skips the preview pass entirely. The double full-read on apply is eliminated. (The --dry-run path still pulls the entire audit/usage history to populate the report count — that's the explicit "I asked for a preview" cost, which is reasonable, though a future cheap-count optimization would still be welcome.)

[RESOLVED ✓] `botName` selected but unused

Location: internal/vault/graphql/operations/asset_usage_events.graphql:23-29

Verified: botName is no longer in the selection set. Wasted bytes per row eliminated.

[ADDRESSED] Audit/usage import non-idempotency now documented

Location: internal/commands/vault_copy.go:42-44

Verified: The command long-help now states "audit and usage import is additive — re-running a copy whose audit/usage stage previously failed part-way will duplicate the already-imported events on the destination". This matches the suggested minimum mitigation. Stronger server-side dedup (by (timestamp, target, actor, event)) is still worth filing as a follow-up — the warning currently lives only in the long-help, not in the runtime warning text emitted at vaultcopy.go:90 when copy audit failed fires. Consider also appending "re-running may duplicate already-imported events" to that runtime warning so operators see it at the moment of failure.

[LOW — open] Bot creation discards the auto-issued API key on Sleuth destinations

Location: internal/vaultcopy/vaultcopy.go:143-150

Issue: copyBots still does dst.CreateBot(ctx, mgmt.Bot{…}) and ignores the returned rawToken. For Sleuth destinations, every newly-created bot ends up with an active, unrecoverable auto-issued API key on the server. The user-facing warning at line 150 ("bot API keys are not copyable; regenerate them on the destination") doesn't tell them they have orphan keys to clean up.

Suggestion: Either revoke the auto-issued key immediately after CreateBot returns (would need the keyID surfaced from the create response), or sharpen the warning text: "newly-created bots on the destination have an auto-issued API key you can't see — revoke it with 'sx bot key delete <bot>' and create a fresh one with 'sx bot key create'". The cleanup-instruction variant is a one-line change and unblocks operators without server-side work.

[LOW — open] Scope read still happens after version upload

Location: internal/vaultcopy/vaultcopy.go:163-201

Issue: For each asset, copyAssets uploads every version first and only then calls AssetInstallScopes. If the scope read errors after uploads succeed, the asset content is on the destination but no installation rows are recorded — a dangling, unreferenced asset version set, and the report shows "copied N assets" with no signal that some are stranded. The slug-fix lands the common case, but a network blip on the scope read still leaves the same shape.

Suggestion: Read scopes before uploading versions so a scope-read failure aborts the asset cleanly with no half-state, or add a Report.AssetsMissingScopes counter so the dangling case is visible in the report alongside Versions and Scopes.

[LOW — open] `sleuthBotListSoftCap = 1000` warning isn't surfaced in the copy report

Location: internal/vault/sleuth_mgmt.go:647-653 consumed by internal/vaultcopy/vaultcopy.go:133-153

Issue: warnIfBotListSoftCap emits a structured-log warning when len(nodes) >= 1000, but copyBots never sees it — the CLI report shows Bots: 1000 with no surface signal that the source may have been truncated. Same shape as the asset cap that was just fixed, just at a higher threshold.

Suggestion: Either upgrade bots to a paginated connection server-side, or have ListBots return a HasMore (mirroring ListTeamsResult) so copyBots can append a Report.Warnings entry. The current structured-log warning is invisible to users running sx vault copy.

[LOW — open] Chunking paths (5000-event audit, 1000-event usage) still uncovered by tests

Location: internal/vaultcopy/vaultcopy_roundtrip_test.go:58-67 and internal/vault/import_audit_e2e_test.go:46-80

Issue: Both the round-trip and the import e2e test seed only 1-3 events, far below sleuthAuditImportChunkSize = 5000 and sleuthUsagePostChunkSize = 1000. The chunking logic is the novel and easily-regressed part of the Sleuth path; today only the 15,885-event manual run from the PR description validates it. Net: a future refactor that breaks chunk boundaries (off-by-one on end, accidentally serial-vs-batched encoding) wouldn't trip any test.

Suggestion: Add a focused unit test against sleuth_mgmt.go that seeds sleuthAuditImportChunkSize + 1 audit events (and the same for usage), uses a fake HTTP/GraphQL transport that records call counts and per-call event counts, and asserts the expected (chunks, last-chunk-size). Even a small bump beyond the boundary catches the regression class.

Fix this →

· [Branch](https://github.com/sleuth-io/sx/tree/detkin/sx-vault-copy-client)

@detkin detkin force-pushed the detkin/sx-vault-copy-client branch from d97b170 to 745e9c2 Compare June 1, 2026 22:58
@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @detkin's task in 5m 8s —— View job


PR Review: #152 — Add sx vault copy

Both prior HIGH issues are confirmed fixed (asset pagination at sleuth.go:629-665, slug-first scope matching at sleuth_mgmt.go:493-507 + asset_installations.graphql:5) and the prior MEDIUMs are addressed; one new MEDIUM emerges in the bulk-scope fallback path, plus the four previously-flagged LOWs remain open.

[RESOLVED ✓] 50-asset cap, slug/name mismatch, preview replay, `botName` selected-but-unused

Verified:

  • internal/vault/sleuth.go:629-665 now pages through vaultgql.VaultAssets using pageInfo.endCursor with sleuthAssetsPageSize = 50, bounded by opts.Limit. Lossless for any asset count.
  • internal/vault/sleuth_mgmt.go:493-507 matches EqualFold(node.GetSlug(), name) || EqualFold(node.GetName(), name); asset_installations.graphql now selects both slug and name. Scopes will no longer drop silently when display-name ≠ slug.
  • internal/commands/vault_copy.go:84-94 skips the dry-run pass when --yes is set, so the full audit/usage read isn't paid twice. (--dry-run still does the full read by design; reasonable cost for an explicit preview.)
  • asset_usage_events.graphql:23-29 no longer selects botName. Wasted bytes per row eliminated.
[MEDIUM — new] Bulk-scope fallback silently drops scopes on Sleuth destinations

Location: internal/vaultcopy/vaultcopy.go:255-269 (combined with internal/vault/sleuth_mgmt.go:389-431)

Issue: When the bulk SetAssetInstallations fails on a Sleuth destination, copyAssetScopes falls through to per-target SetAssetInstallation. But on Sleuth every SetAssetInstallation call goes through setAssetInstallationsGraphQL, which replaces the asset's whole install set — exactly the failure mode commit 434cfb6 was supposed to eliminate. So with N targets where target K is unresolvable (e.g. an unknown user/team in the destination org), the per-target loop ends with only the last successful target installed; targets 1 … N-1 that succeeded server-side along the way get clobbered by the next call. The comment on line 252-254 ("one unresolvable target doesn't drop the rest — matching best-effort") describes the intent, but the actual end state is the opposite: only one scope survives instead of N-1.

A concrete failure path: targets = [repo-a, user-not-in-dest, team-platform] → bulk fails resolving the user → fallback runs → after the loop only team-platform is installed; repo-a was overwritten by the team-platform call. The report logs r.Scopes += 2 (the two that returned no error from the per-target call), but the destination only carries one of them.

Suggestion: Either drop the failing target from targets and retry bulk with the remaining N-1 (preserves "best-effort" semantics atomically), or for replace-on-set backends abort on bulk failure with a single warning rather than per-target replay. The first preserves the most data; the second is the smaller, safer change. Either way, the current per-target fallback path is a regression of the bug 434cfb6 already fixed.

Fix this →

[LOW — open] Bot creation discards the auto-issued Sleuth API key

Location: internal/vaultcopy/vaultcopy.go:143-150 (combined with sleuth_mgmt.go:716-742)

Issue: copyBots ignores the rawToken returned by dst.CreateBot. On Sleuth the auto-issued key is returned exactly once on the create response (per the comment at sleuth_mgmt.go:728-733) and is unrecoverable. Every newly-created bot ends up with an active, orphaned API key on the destination, and the user-facing warning at line 150 doesn't instruct them to clean it up.

Suggestion: Sharpen the warning to: "newly-created bots on the destination have an auto-issued API key you can't see — revoke it with 'sx bot key delete <bot>' and create a fresh one with 'sx bot key create'". (Auto-revoking the key after create would be cleaner but requires surfacing the keyID from the create payload.)

[LOW — open] Scope read happens after version upload — read failure leaves stranded asset versions

Location: internal/vaultcopy/vaultcopy.go:163-201

Issue: For each asset, copyAssets uploads every version first and only then calls AssetInstallScopes (or scopeReader.AssetInstallScopes on line 194). If the scope read errors after uploads succeed, the asset content is on the destination but no installation rows are recorded — a dangling, unreferenced asset version set. The error becomes a warning and the report just shows "copied N assets" with no signal that some are stranded. With the slug fix the common case is covered, but a network blip on the scope read still hits the same shape.

Suggestion: Read scopes before uploading versions so a scope-read failure aborts the asset cleanly with no half-state, or add a Report.AssetsMissingScopes counter so the dangling case is visible alongside Versions and Scopes.

[LOW — open] `sleuthBotListSoftCap = 1000` warning isn't surfaced in the copy report

Location: internal/vault/sleuth_mgmt.go:647-653 consumed by internal/vaultcopy/vaultcopy.go:133-153

Issue: warnIfBotListSoftCap emits a structured-log warning when len(nodes) >= 1000, but the CLI report just shows Bots: 1000 with no surface signal that the source may be truncated. Same shape as the asset cap that was just fixed, at a higher threshold and not surfaced through Report.Warnings.

Suggestion: Either upgrade bots to a paginated connection server-side, or have ListBots return a HasMore (mirroring ListTeamsResult) so copyBots can append a Report.Warnings entry.

[LOW — open] Chunking paths (5000-event audit, 1000-event usage) still uncovered by tests

Location: internal/vaultcopy/vaultcopy_roundtrip_test.go:58-67 and internal/vault/import_audit_e2e_test.go:33-50

Issue: Both tests seed 1-3 events, far below sleuthAuditImportChunkSize = 5000 and sleuthUsagePostChunkSize = 1000. The chunking logic in sleuth_mgmt.go:1303-1330 and :1407-1449 is the novel and easily-regressed part of the Sleuth path; only the 15,885-event manual run from the PR description validates it. A future refactor that breaks chunk boundaries (off-by-one on end, accidentally serial-vs-batched encoding) wouldn't trip any test. The new import_audit_e2e_test.go only exercises path vault, where chunking doesn't apply.

Suggestion: Add a focused unit test against sleuth_mgmt.go that seeds sleuthAuditImportChunkSize + 1 audit events and the same for usage, uses a fake HTTP/GraphQL transport that records per-call event counts, and asserts the expected (chunks, last-chunk-size). Even a small bump beyond the boundary catches the regression class.

**Fix this →


· Branch

@detkin detkin force-pushed the detkin/sx-vault-copy-client branch from 44d867a to 82abfdb Compare June 1, 2026 23:09
@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @detkin's task in 3m 33s —— View job


PR Review: #152 — Add sx vault copy

The new MEDIUM from the prior review (bulk-scope fallback clobbering) is now confirmed fixed in 82abfdb via the unresolved return + no-per-target-fallback; the previously-resolved HIGH/MEDIUM items still hold; four LOWs remain open.

[RESOLVED ✓] Bulk-scope fallback clobbering

Location: internal/vault/sleuth_mgmt.go:433-504 + internal/vaultcopy/vaultcopy.go:255-265

Verified: SetAssetInstallations now returns unresolved for team/user/bot targets that don't resolve in the destination (rather than failing the whole call), still applies the resolvable targets atomically in one GraphQL call, and copyAssetScopes emits a per-unresolved-target warning without falling back to per-target calls. The new scopes_test.go covers the bulk-success, bulk-unresolved, bulk-error, append-dest, org-wide, and not-present paths. The "all targets unresolved" edge case at sleuth_mgmt.go:495-499 correctly returns without sending an empty set (which would have wrongly globalized the asset).

[RESOLVED ✓] Prior fixes confirmed

Verified:

  • Asset pagination: internal/vault/sleuth.go:629-665 pages through VaultAssets with pageInfo.endCursor at sleuthAssetsPageSize = 50, bounded by opts.Limit. No more 50-per-type silent truncation.
  • Slug-first scope matching: internal/vault/sleuth_mgmt.go:511-528 + asset_installations.graphql:5 selects slug and matches EqualFold(GetSlug(), name) || EqualFold(GetName(), name). Scopes no longer drop when display name ≠ slug.
  • Preview no longer replayed for --yes: internal/commands/vault_copy.go:88-94 runs Copy(..., DryRun: false) exactly once when --yes is set, skipping the preview pass — the full audit/usage read isn't paid twice on apply.
  • botName removed: asset_usage_events.graphql:23-29 no longer selects the unused field.
  • Idempotency limitation documented: vault_copy.go:42-44 long-help warns that audit/usage import is additive and retry duplicates.
[LOW — open] Bot creation discards the auto-issued Sleuth API key

Location: internal/vaultcopy/vaultcopy.go:143-150 (combined with sleuth_mgmt.go:734-760)

Issue: copyBots still does dst.CreateBot(ctx, mgmt.Bot{…}) and ignores the returned rawToken. On Sleuth, CreateBot auto-issues a default API key and returns the raw token exactly once on the create response (per the comment at sleuth_mgmt.go:746-751) — there is no recovery path. Every newly-created bot on a skills.new destination ends up with an active, unrecoverable auto-issued API key on the server, and the user-facing warning at vaultcopy.go:150 ("bot API keys are not copyable; regenerate them on the destination") doesn't tell them they have orphan keys to revoke.

Suggestion: Sharpen the warning to: "newly-created bots on the destination have an auto-issued API key you can't see — revoke it with 'sx bot key delete <bot>' and create a fresh one with 'sx bot key create'". Or, cleaner: auto-revoke the just-issued key by surfacing the keyID from the create payload and calling DeleteBotApiKey immediately after CreateBot.

[LOW — open] Scope read still happens after version upload — read failure leaves stranded asset versions

Location: internal/vaultcopy/vaultcopy.go:163-201

Issue: For each asset, copyAssets uploads every version first (lines 171-192) and only then calls scopeReader.AssetInstallScopes (line 194). If the scope read errors after uploads succeed, the asset content is on the destination but no installation rows are recorded — a dangling, unreferenced asset version set. The error becomes a warning (r.warnf("read scopes for %q: %v", …)) and the report just shows "copied N assets" with no signal that some are stranded. With the slug-fix the common case is covered, but a network blip on the scope read still hits the same shape.

Suggestion: Read scopes before uploading versions so a scope-read failure aborts the asset cleanly with no half-state. Or add a Report.AssetsMissingScopes counter so the dangling case is visible in the report alongside Versions and Scopes.

[LOW — open] `sleuthBotListSoftCap = 1000` warning isn't surfaced in the copy report

Location: internal/vault/sleuth_mgmt.go:665-671 consumed by internal/vaultcopy/vaultcopy.go:133-153

Issue: warnIfBotListSoftCap emits a structured-log warning when len(nodes) >= 1000, but the CLI report (printReport in vault_copy.go:145-167) shows only Bots: 1000 with no surface signal that the source may be truncated. The structured-log line goes to the logger, not into Report.Warnings, so operators running sx vault copy don't see it. Same shape as the asset cap that was fixed, just at a higher threshold and not propagated.

Suggestion: Either upgrade bots to a paginated connection server-side, or have ListBots return a HasMore (mirroring ListTeamsResult) so copyBots can append a Report.Warnings entry. The current structured-log warning is invisible to users.

[LOW — open] Chunking paths (5000-event audit, 1000-event usage) still uncovered by tests

Location: internal/vaultcopy/vaultcopy_roundtrip_test.go:58-67 and internal/vault/import_audit_e2e_test.go:33-50

Issue: The round-trip test seeds 1 audit event and 1 usage event; the new import_audit_e2e_test.go seeds 2 events. Both are far below sleuthAuditImportChunkSize = 5000 and sleuthUsagePostChunkSize = 1000. The chunking logic in sleuth_mgmt.go:1321-1346 (RecordUsageEvents) and :1425-1458 (ImportAuditEvents) is the novel, easily-regressed part of the Sleuth path; only the 15,885-event manual run from the PR description validates it. A future refactor that breaks chunk boundaries (off-by-one on end, accidentally serial-vs-batched encoding) wouldn't trip any test.

Suggestion: Add a focused unit test against sleuth_mgmt.go that seeds sleuthAuditImportChunkSize + 1 audit events (and sleuthUsagePostChunkSize + 1 usage events), uses a fake HTTP/GraphQL transport that records per-call event counts, and asserts the expected (chunks, last-chunk-size). Even a small bump beyond each boundary catches the regression class.

Fix this →

· [Branch](https://github.com/sleuth-io/sx/tree/detkin/sx-vault-copy-client)

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @detkin's task in 3m 26s —— View job


PR Review: #152 — Add sx vault copy

Commit 82abfdb cleanly resolves the prior MEDIUM (the per-target fallback no longer clobbers scopes on Sleuth — the bulk call is now best-effort with unresolved returned for the caller to warn on); one new MEDIUM emerges in AssetInstallScopes (path scopes are silently dropped on read), and four LOWs from the previous round remain open.

[MEDIUM — new] Path-scoped installs silently widen to whole-repo on copy from Sleuth

Location: internal/vault/sleuth_mgmt.go:537-554 and internal/vault/graphql/operations/asset_installations.graphql:7-12

Issue: AssetInstallScopes reads VaultAssetInstallation rows and maps EntityTypeRepository to manifest.ScopeKindRepo only — monoRepoConfigName is fetched in the query (line 11) but never read. On the Sleuth schema, path scoping for monorepos is encoded as a REPOSITORY install with a monoRepoConfigName (the enum has no PATH value — schema.graphql:3427-3433), and on the write side SetAssetInstallations does support RepositoryInstallationInput{Url, Paths} (lines 466-467). The round-trip is therefore lossy in one direction: a path-scoped install (manifest.ScopeKindPath) on the source becomes a whole-repository install on the destination, silently broadening the asset's access.

Concrete failure: source has my-skill scoped to github.com/acme/mono#services/api. After copy, the destination has my-skill installed on the entire github.com/acme/mono repo. The report still shows Scopes: 1 and no warning fires.

Suggestion: Map inst.MonoRepoConfigName to manifest.ScopeKindPath when it is non-empty (with the path stored in Scope.Paths), and leave the ScopeKindRepo mapping for the empty case. Worth also confirming with the server team whether monoRepoConfigName carries a path or a named-config-id — if the latter, you also need a lookup to translate the GID back to the path string used by RepositoryInstallationInput.Paths on write.

Fix this →

[RESOLVED ✓] Bulk-scope fallback no longer clobbers

Verified: internal/vault/sleuth_mgmt.go:434-503 (SetAssetInstallations now returns (unresolved []InstallTarget, err error)) and internal/vaultcopy/vaultcopy.go:255-266 (no per-target fallback — resolvable targets are applied atomically; unresolved ones become warnings). The "all targets unresolved" case is also handled correctly (line 495-499 short-circuits to avoid sending an empty set, which the server would read as org-wide). The new tests in scopes_test.go cover bulk success, partial unresolved, hard error, append-on-set behavior, and present-without-scopes — solid coverage of the dispatch logic.

[LOW — open] Audit/usage runtime warning still doesn't mention re-run duplicates

Location: internal/vaultcopy/vaultcopy.go:90

Issue: The command long-help (internal/commands/vault_copy.go:42-44) now states that re-running a partly-failed copy will duplicate audit/usage. Good. But the runtime warning emitted when copyAudit/copyUsage actually fails (r.warnf("copy %s failed: %v", s.name, err)) is generic — the operator who sees copy audit failed: connection reset has no in-the-moment signal that retrying will double-up history. The long-help is easy to miss when the error is what triggers the action.

Suggestion: Special-case the audit/usage stages with a sharper message, e.g. r.warnf("copy %s failed: %v — re-running may duplicate already-imported events on the destination", s.name, err).

[LOW — open] Bot creation discards the auto-issued Sleuth API key

Location: internal/vaultcopy/vaultcopy.go:143-150 (combined with sleuth_mgmt.go:716-742)

Issue: copyBots ignores the rawToken returned by dst.CreateBot. On Sleuth destinations every newly-created bot ends up with an active, unrecoverable auto-issued API key on the server. The current warning ("bot API keys are not copyable; regenerate them on the destination") doesn't tell the user to revoke the orphan.

Suggestion: Sharpen the warning to include the cleanup instruction: "newly-created bots on the destination have an auto-issued API key you can't see — revoke it with 'sx bot key delete <bot>' and create a fresh one with 'sx bot key create'". (Auto-revoking right after CreateBot would be cleaner but needs the keyID surfaced from the mutation.)

[LOW — open] Scope read after version upload leaves stranded versions on transient read failure

Location: internal/vaultcopy/vaultcopy.go:163-200

Issue: Each asset uploads every version first (lines 171-192), then reads scopes (line 194). If the scope read errors after uploads succeeded, the asset content is on the destination but no installation is recorded — a dangling, never-installed version set. The error becomes a warning and the report just shows "copied N assets". The slug-first fix removes the common silent-drop path but a network blip on the scope read still hits the same shape, and the new path-scope issue above can quietly compound this.

Suggestion: Read scopes before uploading versions so a scope-read failure aborts the asset cleanly. Or add a Report.AssetsMissingScopes counter so the dangling-asset case is visible in the report alongside Versions and Scopes.

[LOW — open] `sleuthBotListSoftCap = 1000` warning still doesn't surface in the copy report

Location: internal/vault/sleuth_mgmt.go:597-632 consumed by internal/vaultcopy/vaultcopy.go:133-153

Issue: warnIfBotListSoftCap emits a structured-log warning when len(nodes) >= 1000, but the CLI report just shows Bots: 1000 with no surface signal. Same shape as the asset cap that was already fixed via pagination — at a higher threshold and not surfaced through Report.Warnings.

Suggestion: Either upgrade bots to a paginated GraphQL connection server-side, or have ListBots return a HasMore (mirroring ListTeamsResult) so copyBots can append a Report.Warnings entry.

[LOW — open] Chunking paths (5000-event audit, 1000-event usage) still uncovered by tests

Location: internal/vault/import_audit_e2e_test.go:34-50 and internal/vaultcopy/vaultcopy_roundtrip_test.go:58-67

Issue: Both tests seed 1-3 events, well below sleuthAuditImportChunkSize = 5000 and sleuthUsagePostChunkSize = 1000. The novel Sleuth chunking logic in sleuth_mgmt.go is only validated by the 15,885-event manual run from the PR description, which won't survive regressions. The new import_audit_e2e_test.go only exercises the path backend, where chunking doesn't apply.

Suggestion: Add a focused unit test against sleuth_mgmt.go that seeds sleuthAuditImportChunkSize + 1 audit events and the same for usage, uses a fake HTTP/GraphQL transport that records per-call event counts, and asserts the expected (chunks, last-chunk-size).

Fix this →

· Branch ·

detkin added 2 commits June 1, 2026 16:18
- AssetInstallScopes resolves a repo install's mono-repo config to its
  include paths (via repository.monoRepoConfigs), emitting a path scope
  instead of silently widening to the whole repo
@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @detkin's task in 3m 36s —— View job


PR Review: #152 — Add sx vault copy

Both prior HIGH issues remain fixed and the prior-round MEDIUM (path scopes silently widening to whole-repo) is now resolved via monoRepoConfigPaths; one mild new issue is the unused monoRepoConfigName selection (dead bytes pattern), plus the previously-open LOWs (bot key discarded, scope-read-after-upload, bot soft-cap not surfaced in report, chunking test coverage, and runtime audit/usage warning text not mentioning duplicates).

[RESOLVED ✓] Path-scoped installs no longer widen to whole-repo on Sleuth → Sleuth copy

Location: internal/vault/sleuth_mgmt.go:543-555 + internal/vault/sleuth_mgmt.go:572-593 + internal/vault/graphql/operations/repo_mono_configs.graphql

Verified: AssetInstallScopes now branches on inst.MonoRepoConfigId != nil && inst.EntityId != nil and calls monoRepoConfigPaths to resolve the actual subpaths from the source server's repository.monoRepoConfigs. When paths come back populated, the scope is emitted as ScopeKindPath with the real subpaths; only an empty-path config falls back to ScopeKindRepo. The write side at :462-463 was already path-aware (RepositoryInstallationInput{Url: t.Repo, Paths: t.Paths}), so the round-trip is now end-to-end correct.

[RESOLVED ✓] Prior fixes still in place

Verified:

  • Asset pagination: sleuth.go:629-665 pages through VaultAssets with pageInfo.endCursor at sleuthAssetsPageSize = 50, bounded by opts.Limit. No more 50-per-type cap.
  • Slug-first scope matching: sleuth_mgmt.go:518-525 + asset_installations.graphql:5-6 selects both slug and name and matches EqualFold(GetSlug(), name) || EqualFold(GetName(), name).
  • Bulk-scope best-effort: sleuth_mgmt.go:445-503 returns unresolved (no clobbering per-target fallback); vaultcopy.go:255-265 consumes that contract atomically.
  • --yes skips preview pass: vault_copy.go:88-94 calls Copy exactly once.
  • botName removed from asset_usage_events.graphql:23-29; idempotency limitation documented in vault_copy.go:42-44 long-help.
[LOW — new] `monoRepoConfigName` is selected by `asset_installations.graphql` but never consumed

Location: internal/vault/graphql/operations/asset_installations.graphql:12 and internal/vault/sleuth_mgmt.go:543-555

Issue: The new install scope reader matches on MonoRepoConfigId (line 546) and resolves paths via monoRepoConfigPaths. MonoRepoConfigName is fetched but is never read — same shape as the botName issue that was just removed from asset_usage_events.graphql. Wasted bytes per install row on every paginated read of an asset's installations.

Suggestion: Drop monoRepoConfigName from asset_installations.graphql's installations { … } selection — monoRepoConfigId plus the follow-up RepoMonoRepoConfigs query already cover the path resolution path.

Fix this →

[LOW — new] `monoRepoConfigPaths` is N+1 across an asset's path-scoped installs

Location: internal/vault/sleuth_mgmt.go:538-563 (per-installation loop) + :572-593 (per-call fetch)

Issue: For each path-scoped install row, AssetInstallScopes makes one extra GraphQL round-trip to RepoMonoRepoConfigs(repoGID). An asset path-scoped against the same repo twice (e.g. mono#services/api and mono#services/web) fetches the same repo's configs twice; across many assets the same repos are re-fetched repeatedly. For a vault with N assets averaging M path installs, that's N×M extra calls in the copy.

Suggestion: Cache monoRepoConfigPaths results per (repoGID, configGID) for the duration of a single AssetInstallScopes call, or — cheaper — once per copyAssets pass. A map[string]map[string][]string keyed by repoGID → configGID → paths populated lazily would collapse the duplicates without changing call shape.

[LOW — open] Bot creation discards the auto-issued Sleuth API key

Location: internal/vaultcopy/vaultcopy.go:143-150 (combined with sleuth_mgmt.go:773-799)

Issue: copyBots still does dst.CreateBot(ctx, mgmt.Bot{…}) and ignores the returned rawToken. On Sleuth destinations the auto-issued key is returned once (per the comment at sleuth_mgmt.go:785-790) and is unrecoverable. The user-facing warning at vaultcopy.go:150 ("bot API keys are not copyable; regenerate them on the destination") doesn't tell the operator they have orphan keys to revoke.

Suggestion: Sharpen the warning to: "newly-created bots on the destination have an auto-issued API key you can't see — revoke it with 'sx bot key delete <bot>' and create a fresh one with 'sx bot key create'". (Auto-revoking right after CreateBot is cleaner but needs the keyID surfaced from the mutation payload.)

[LOW — open] Scope read still happens after version upload — read failure leaves stranded versions

Location: internal/vaultcopy/vaultcopy.go:163-201

Issue: For each asset, copyAssets uploads every version first (lines 171-192) and only then calls scopeReader.AssetInstallScopes (line 194). If the scope read errors after uploads succeed, the asset content is on the destination but no installation rows are recorded — a dangling, unreferenced asset version set. With the new monoRepoConfigPaths round-trip the read path now has more failure surfaces (each path-scoped install adds a network call), so this risk is slightly bigger than before.

Suggestion: Read scopes before uploading versions so a scope-read failure aborts the asset cleanly, or add a Report.AssetsMissingScopes counter so the dangling case is visible alongside Versions and Scopes.

[LOW — open] `sleuthBotListSoftCap = 1000` warning still isn't surfaced in the copy report

Location: internal/vault/sleuth_mgmt.go:704-710 consumed by internal/vaultcopy/vaultcopy.go:133-153

Issue: warnIfBotListSoftCap emits a structured-log warning when len(nodes) >= 1000, but printReport in vault_copy.go:145-167 just shows Bots: 1000 with no surface signal. The logger.Warn goes to stderr-as-JSON; users running sx vault copy against an org with 1000+ bots see no in-band signal that the source may have been truncated.

Suggestion: Either upgrade bots to a paginated connection server-side, or have ListBots return a HasMore (mirroring ListTeamsResult) so copyBots can append a Report.Warnings entry.

[LOW — open] Audit/usage runtime warning still doesn't mention re-run duplicates

Location: internal/vaultcopy/vaultcopy.go:90

Issue: The command long-help (vault_copy.go:42-44) now states that re-running a partly-failed audit/usage stage duplicates events. But the runtime warning emitted when copyAudit/copyUsage actually fails — r.warnf("copy %s failed: %v", s.name, err) — is generic. The operator who sees copy audit failed: connection reset in the report has no in-the-moment signal that a retry will doubly-import already-landed chunks. The long-help is easy to miss when the failure itself is what triggers the next action.

Suggestion: Special-case the audit/usage stages with a sharper message, e.g. r.warnf("copy %s failed: %v — re-running may duplicate already-imported events on the destination", s.name, err).

[LOW — open] Chunking paths (5000-event audit, 1000-event usage) still uncovered by tests

Location: internal/vaultcopy/vaultcopy_roundtrip_test.go:58-67 and internal/vault/import_audit_e2e_test.go:33-50

Issue: The round-trip test seeds 1 audit event + 1 usage event; the path-vault e2e seeds 2 audit events. Both well below sleuthAuditImportChunkSize = 5000 and sleuthUsagePostChunkSize = 1000. The Sleuth chunking loops at sleuth_mgmt.go:1360-1386 (RecordUsageEvents) and :1464-1507 (ImportAuditEvents) are the novel, easily-regressed parts of this PR; the only validation today is the 15,885-event manual run from the PR description, which won't survive future refactors. A future off-by-one on end, or a switch from serial to batched encoding inside the chunk, would not trip any test.

Suggestion: Add a unit test against sleuth_mgmt.go that seeds sleuthAuditImportChunkSize + 1 audit events (and sleuthUsagePostChunkSize + 1 usage events), uses a fake HTTP/GraphQL transport that records per-call event counts, and asserts the expected (chunks, last-chunk-size). Even a small bump beyond each boundary catches the regression class.

Fix this →

· [Branch](https://github.com/sleuth-io/sx/tree/detkin/sx-vault-copy-client)

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @detkin's task in 2m 31s —— View job


PR Review: #152 — Add sx vault copy

The most recent commit (21072a5) resolves the prior-round dead-field selection on monoRepoConfigName; both HIGH issues remain fixed, the bulk-scope replace-on-set contract is sound, and path-scoped installs survive the round-trip via monoRepoConfigPaths. The remaining four LOWs (bot key discarded, scope-read-after-upload, bot soft-cap not surfaced, chunking test coverage) plus a still-generic audit/usage failure warning are the same items flagged in the previous round.

[RESOLVED ✓] `monoRepoConfigName` no longer selected from `AssetInstallations`

Location: internal/vault/graphql/operations/asset_installations.graphql:7-12

Verified: The selection set is now entityType, entityName, entityId, monoRepoConfigId only. Commit 21072a5 matches the prior-round suggestion — monoRepoConfigPaths already resolves the actual paths server-side, so the name field was pure dead weight.

[RESOLVED ✓] Prior fixes still in place

Verified:

  • Asset pagination: internal/vault/sleuth.go:629-665 pages through VaultAssets with pageInfo.endCursor at sleuthAssetsPageSize = 50, bounded by opts.Limit.
  • Slug-first scope matching: internal/vault/sleuth_mgmt.go:516-525 + asset_installations.graphql:5-6 select both slug and name; the loop matches slug-first with name fallback.
  • Path-scoped installs preserved: sleuth_mgmt.go:543-555 resolves MonoRepoConfigId → real subpaths via monoRepoConfigPaths and emits ScopeKindPath. The write side at :462-463 already supports RepositoryInstallationInput{Url, Paths}.
  • Bulk-scope best-effort, no clobbering fallback: sleuth_mgmt.go:445-503 returns unresolved rather than failing the whole call; vaultcopy.go:255-265 consumes that contract atomically.
  • --yes skips preview pass: commands/vault_copy.go:88-94 runs Copy exactly once when --yes is set.
  • Idempotency limitation documented: vault_copy.go:42-44 long-help warns that re-running a partly-failed audit/usage stage duplicates events.
[LOW — open] Audit/usage runtime warning is still generic — operators won't see the duplicate-on-retry caveat at the moment of failure

Location: internal/vaultcopy/vaultcopy.go:90

Issue: The command long-help (vault_copy.go:42-44) explains that re-running a partly-failed audit/usage stage will duplicate events. But the runtime warning emitted when copyAudit/copyUsage actually fails — r.warnf("copy %s failed: %v", s.name, err) — is generic. An operator who sees copy audit failed: connection reset in the printed report has no in-the-moment signal that a retry will doubly-import already-landed chunks. Long-help text is easy to miss when the failure is itself what triggers the next action.

Suggestion: Special-case the audit/usage stages with a sharper message, e.g. r.warnf("copy %s failed: %v — re-running may duplicate already-imported events on the destination", s.name, err).

[LOW — open] Bot creation discards the auto-issued Sleuth API key — destination accumulates orphan keys

Location: internal/vaultcopy/vaultcopy.go:143-150 (combined with sleuth_mgmt.go:773-799)

Issue: copyBots does dst.CreateBot(ctx, mgmt.Bot{…}) and ignores the returned rawToken. On Sleuth, CreateBot auto-issues a default API key and returns the raw token exactly once on the create response (per the comment at sleuth_mgmt.go:785-790) — there is no recovery path. Every newly-created bot on a skills.new destination ends up with an active, unrecoverable auto-issued API key on the server. The current warning at line 150 ("bot API keys are not copyable; regenerate them on the destination") doesn't tell the operator they have orphan keys to revoke.

Suggestion: Sharpen the warning to include the cleanup instruction: "newly-created bots on the destination have an auto-issued API key you can't see — revoke it with 'sx bot key delete <bot>' and create a fresh one with 'sx bot key create'". Auto-revoking right after CreateBot would be cleaner but requires surfacing the keyID from the mutation payload.

[LOW — open] Scope read still happens after version upload — read failure leaves stranded versions

Location: internal/vaultcopy/vaultcopy.go:163-201

Issue: For each asset, copyAssets uploads every version first (lines 171-192) and only then calls scopeReader.AssetInstallScopes (line 194). If the scope read errors after uploads succeed, the asset content is on the destination but no installation rows are recorded — a dangling, unreferenced asset version set. With the new monoRepoConfigPaths round-trip the scope-read path now has more failure surfaces (each path-scoped install adds a network call), so this risk is slightly bigger than the prior round. The error becomes a warning and the report just shows "copied N assets" with no signal that some are stranded.

Suggestion: Read scopes before uploading versions so a scope-read failure aborts the asset cleanly with no half-state. Or add a Report.AssetsMissingScopes counter so the dangling case is visible in the report alongside Versions and Scopes.

[LOW — open] `monoRepoConfigPaths` is N+1 across an asset's path-scoped installs

Location: internal/vault/sleuth_mgmt.go:538-563 (per-installation loop) + :572-593 (per-call fetch)

Issue: For each path-scoped install row, AssetInstallScopes makes one extra RepoMonoRepoConfigs(repoGID) round-trip. An asset path-scoped against the same repo twice (e.g. mono#services/api and mono#services/web) fetches the same repo's configs twice; across many assets the same repos are re-fetched repeatedly. For N assets averaging M path installs that's N×M extra calls — and these dominate a copy whose other reads have all been moved to bulk/paginated forms.

Suggestion: Cache monoRepoConfigPaths results per (repoGID, configGID) — at minimum for the duration of a single AssetInstallScopes call, ideally across a copyAssets pass. A map[string]map[string][]string keyed by repoGID → configGID → paths populated lazily collapses duplicates without changing call shape.

[LOW — open] `sleuthBotListSoftCap = 1000` warning isn't surfaced in the copy report

Location: internal/vault/sleuth_mgmt.go:704-710 consumed by internal/vaultcopy/vaultcopy.go:133-153

Issue: warnIfBotListSoftCap emits a structured-log warning when len(nodes) >= 1000, but printReport in vault_copy.go:145-167 just shows Bots: 1000 with no surface signal. The logger.Get().Warn goes to stderr-as-JSON; users running sx vault copy against an org with 1000+ bots see no in-band signal that the source may have been truncated. Same shape as the asset cap that was already fixed via pagination — at a higher threshold and not propagated through Report.Warnings.

Suggestion: Either upgrade bots to a paginated GraphQL connection server-side, or have ListBots return a HasMore (mirroring ListTeamsResult) so copyBots can append a Report.Warnings entry.

[LOW — open] Chunking paths (5000-event audit, 1000-event usage) still uncovered by tests

Location: internal/vaultcopy/vaultcopy_roundtrip_test.go:58-67 and internal/vault/import_audit_e2e_test.go:33-50

Issue: The round-trip test seeds 1 audit event + 1 usage event; the path-vault e2e seeds 2 audit events. Both far below sleuthAuditImportChunkSize = 5000 and sleuthUsagePostChunkSize = 1000. The Sleuth chunking loops at sleuth_mgmt.go:1360-1386 (RecordUsageEvents) and :1464-1507 (ImportAuditEvents) are the novel, easily-regressed parts of this PR; the only validation today is the 15,885-event manual run from the PR description, which won't survive future refactors. A future off-by-one on end, or a switch from serial to batched encoding inside the chunk, would not trip any test. The path-vault e2e can't exercise this since chunking is Sleuth-only.

Suggestion: Add a unit test against sleuth_mgmt.go that seeds sleuthAuditImportChunkSize + 1 audit events (and sleuthUsagePostChunkSize + 1 usage events), uses a fake HTTP/GraphQL transport that records per-call event counts, and asserts the expected (chunks, last-chunk-size). Even a small bump beyond each boundary catches the regression class.

Fix this →

· [Branch](https://github.com/sleuth-io/sx/tree/detkin/sx-vault-copy-client)

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @detkin's task in 3m 33s —— View job


PR Review: #152 — Add sx vault copy

Commit 057a201 ("Clear auto-applied install for uninstalled assets") closes the !present corner where Sleuth dst's auto-publish-on-upload silently widened scope; the matching present + all-targets-unresolved corner is still open and produces the same wrongly-widened outcome. Both prior HIGH issues remain fixed; the same four LOWs from the previous round are still open, plus a small symmetry gap.

[MEDIUM] All-unresolved scope set leaves Sleuth dst with the auto-applied org-wide install

Location: internal/vaultcopy/vaultcopy.go:265-275 combined with internal/vault/sleuth_mgmt.go:495-498

Issue: 057a201 correctly clears the auto-applied install when the source asset is not installed (!present), but the equivalent "asset present, but every target is unresolvable in the destination" path still leaves the auto-applied org-wide install in place. Trace:

  1. Source: my-skill installed only to user@nowhere.com (a user that doesn't exist in the destination org).
  2. copyAssets uploads the version — skills.new auto-publishes it org-wide on the destination.
  3. AssetInstallScopes returns (scopes=[user@nowhere], present=true).
  4. copyAssetScopes enters the bulk branch with targets=[InstallKindUser{user@nowhere}].
  5. SleuthVault.SetAssetInstallations fails to resolve the user GID → unresolved=[user], no GraphQL call sent (line 495 short-circuits to avoid the "empty set ⇒ org-wide" pitfall).
  6. Back in copyAssetScopes: unresolved is warned, r.Scopes += 0.
  7. End state: my-skill is installed org-wide on the destination (the skills.new auto-install), even though the source intended it for a single user. The report says Scopes: 0 and warns about the skipped target, but there is no signal that the destination ended up more permissive than requested.

This is the same widening-by-omission the new !present clear was meant to prevent — it's just gated on a different shape of the input.

Suggestion: When the bulk path ends up applying zero targets (either the explicit short-circuit at sleuth_mgmt.go:495-498 or len(targets) - len(unresolved) == 0 in copyAssetScopes), call dst.ClearAssetInstallations(ctx, name) to undo any auto-applied install, mirroring the !present branch:

applied := len(targets) - len(unresolved)
if applied == 0 {
    if err := dst.ClearAssetInstallations(ctx, name); err != nil {
        r.warnf("clear auto-applied install on %q: %v", name, err)
    }
    return
}

A test stub that returns all-unresolved would catch this regression in scopes_test.go.

[LOW — open] Bot creation discards the auto-issued Sleuth API key — destination accumulates orphan keys

Location: internal/vaultcopy/vaultcopy.go:143-150 (combined with sleuth_mgmt.go:773-799)

Issue: copyBots discards the rawToken returned from dst.CreateBot. On Sleuth, CreateBot auto-issues a default API key returned exactly once and unrecoverable (sleuth_mgmt.go:785-790). The user-facing warning at line 150 ("bot API keys are not copyable; regenerate them on the destination") doesn't tell the operator they have orphan, active keys to revoke.

Suggestion: Sharpen the warning to include the cleanup instruction: "newly-created bots on the destination have an auto-issued API key you can't see — revoke it with 'sx bot key delete <bot>' and create a fresh one with 'sx bot key create'". Auto-revoking right after CreateBot would be cleaner but needs the keyID surfaced from the mutation payload.

[LOW — open] Audit/usage runtime warning still doesn't mention re-run duplicates

Location: internal/vaultcopy/vaultcopy.go:90

Issue: The command long-help (commands/vault_copy.go:42-44) explains that re-running a partly-failed audit/usage stage duplicates events. The runtime warning at line 90 — r.warnf("copy %s failed: %v", s.name, err) — is generic. An operator who sees copy audit failed: connection reset in the report has no in-the-moment signal that a retry will doubly-import already-landed chunks; the long-help is easy to miss when the failure itself is what triggers the next action.

Suggestion: Special-case the audit/usage stages, e.g. if s.name == "audit" || s.name == "usage" { r.warnf("copy %s failed: %v — re-running may duplicate already-imported events on the destination", s.name, err) } else { r.warnf("copy %s failed: %v", s.name, err) }.

[LOW — open] Scope read still happens after version upload — read failure leaves stranded versions and (on Sleuth) a widened install

Location: internal/vaultcopy/vaultcopy.go:163-201

Issue: For each asset, copyAssets uploads every version first (lines 171-192) and only then calls scopeReader.AssetInstallScopes (line 194). If the scope read errors after uploads succeed, the asset content is on the destination but no installation rows are recorded — a dangling, unreferenced asset version set, and on a Sleuth destination the auto-applied org-wide install survives (with 057a201 only the success path clears it). The error becomes a warning and the report just shows "copied N assets" with no signal that some are stranded. The new monoRepoConfigPaths round-trip adds failure surface (one extra GraphQL call per path-scoped install), so this risk is slightly bigger than the prior round.

Suggestion: Read scopes before uploading versions so a scope-read failure aborts the asset cleanly with no half-state. Or add a Report.AssetsMissingScopes counter so the dangling case is visible in the report alongside Versions and Scopes.

[LOW — open] `sleuthBotListSoftCap = 1000` warning isn't surfaced in the copy report

Location: internal/vault/sleuth_mgmt.go:704-710 consumed by internal/vaultcopy/vaultcopy.go:133-153

Issue: warnIfBotListSoftCap emits a structured-log warning via logger.Get().Warn when len(nodes) >= 1000, but the CLI report (printReport in vault_copy.go:145-167) just shows Bots: 1000 with no surface signal. The structured-log line goes to stderr-as-JSON and is invisible to users running sx vault copy interactively.

Suggestion: Either upgrade bots to a paginated GraphQL connection server-side, or have ListBots return a HasMore (mirroring ListTeamsResult) so copyBots can append a Report.Warnings entry rather than only a structured-log line.

[LOW — open] `monoRepoConfigPaths` is N+1 across an asset's path-scoped installs

Location: internal/vault/sleuth_mgmt.go:538-563 (per-installation loop) + :572-593 (per-call fetch)

Issue: For each path-scoped install row, AssetInstallScopes makes one extra RepoMonoRepoConfigs(repoGID) round-trip. An asset path-scoped against the same repo twice (e.g. mono#services/api and mono#services/web) fetches the same repo's configs twice; across many assets the same repos are re-fetched repeatedly. For N assets averaging M path installs that's N×M extra calls — and these dominate a copy whose other reads have all been moved to bulk/paginated forms.

Suggestion: Cache monoRepoConfigPaths results per (repoGID, configGID) — at minimum for the duration of a single AssetInstallScopes call, ideally across a copyAssets pass. A map[string]map[string][]string keyed by repoGID → configGID → paths populated lazily collapses duplicates without changing call shape.

[LOW — open] Chunking paths (5000-event audit, 1000-event usage) still uncovered by tests

Location: internal/vaultcopy/vaultcopy_roundtrip_test.go:58-67 and internal/vault/import_audit_e2e_test.go:33-50

Issue: The round-trip test seeds 1 audit event + 1 usage event; the path-vault e2e seeds 2 audit events. Both far below sleuthAuditImportChunkSize = 5000 and sleuthUsagePostChunkSize = 1000. The Sleuth chunking loops at sleuth_mgmt.go:1360-1386 (RecordUsageEvents) and :1464-1507 (ImportAuditEvents) are the novel, easily-regressed parts of this PR; the only validation today is the 15,885-event manual run from the PR description, which won't survive future refactors. A future off-by-one on end or a switch from serial to batched encoding inside the chunk would not trip any test. The path-vault e2e can't exercise this since chunking is Sleuth-only.

Suggestion: Add a unit test against sleuth_mgmt.go that seeds sleuthAuditImportChunkSize + 1 audit events (and sleuthUsagePostChunkSize + 1 usage events), uses a fake HTTP/GraphQL transport that records per-call event counts, and asserts the expected (chunks, last-chunk-size). Even a small bump beyond each boundary catches the regression class.

Fix this →

· [Branch](https://github.com/sleuth-io/sx/tree/detkin/sx-vault-copy-client) ·

- MEDIUM: clear auto-applied install when a present asset's scopes all
  fail to resolve (or the set errors), so it isn't left org-wide
- Revoke the auto-issued bot API key on copy (no orphan keys)
- Audit/usage failure warning notes re-run duplicates
- Read asset scopes before uploading versions (no stranded uploads)
- Surface the Sleuth bot-list cap in the report
- Cache mono-repo configs per read (drop monoRepoConfigPaths N+1)
- Extract chunkSlice helper for audit/usage chunking + unit test
@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @detkin's task in 3m 15s —— View job


PR Review: #152 — Add sx vault copy

Commit 3936c87 closes every remaining item from the previous round — the MEDIUM (clear-on-all-unresolved), all four LOWs (bot-key revoke, scope-read-before-upload, bot soft-cap surfacing, monoRepoConfigPaths N+1), the audit/usage runtime warning, and the chunking test gap — so this is now a clean approval pass from my side, with two small follow-ups worth considering.

[RESOLVED ✓] Clear-on-empty for all-unresolved (prior MEDIUM)

Location: internal/vaultcopy/vaultcopy.go:344-353 + internal/vaultcopy/scopes_test.go:93-107

Verified: After the bulk call returns unresolved, copyAssetScopes now computes applied := len(targets) - len(unresolved) and calls clearAutoInstall when applied == 0. TestCopyAssetScopes_BulkAllUnresolvedClears pins the contract. The matching bulkErr path at :337-339 also clears, so a hard error doesn't leave the asset wrongly org-wide either.

[RESOLVED ✓] Bot creation revokes the auto-issued Sleuth API key

Location: internal/vaultcopy/vaultcopy.go:160-203 (copyBots + revokeAutoIssuedBotKeys)

Verified: After CreateBot returns, if rawToken != "" the engine type-asserts the destination to a botKeyManager and revokes every key listed for the new bot. For freshly-created bots the only keys present are the auto-issued ones, so no orphan keys accumulate. The warning at line 174 now correctly says "create fresh ones on the destination" rather than "regenerate".

[RESOLVED ✓] Scopes read before version upload — no stranded versions on read failure

Location: internal/vaultcopy/vaultcopy.go:220-236

Verified: The scopeReader.AssetInstallScopes call now runs before any AddAsset; on error the asset is skipped entirely (continue), so a transient scope-read blip can't leave dangling versions or a widened auto-applied install on the destination.

[RESOLVED ✓] Audit/usage failure warning mentions duplicate-on-retry

Location: internal/vaultcopy/vaultcopy.go:84-92

Verified: When audit or usage stages fail, the warning is now "copy %s failed: %v — re-running may duplicate already-imported events on the destination". Other stages keep the generic warning. The long-help text already documented this; the runtime path now matches.

[RESOLVED ✓] Sleuth bot-list soft cap surfaced in the report

Location: internal/vaultcopy/vaultcopy.go:50-53, 152-154

Verified: copyBots checks len(bots) >= sleuthBotListSoftCap and emits an in-report warning rather than only the structured-log line the operator would miss. The const lives in the copy package with a comment that it's a heuristic.

[RESOLVED ✓] monoRepoConfigPaths N+1 collapsed via per-call cache

Location: internal/vault/sleuth_mgmt.go:537-601

Verified: AssetInstallScopes instantiates repoConfigCache := map[string]map[string][]string{} once and passes it into every monoRepoConfigPaths(..., repoConfigCache) call. Multiple path-scoped installs against the same repo now share a single RepoMonoRepoConfigs round-trip.

[RESOLVED ✓] Chunking now has direct unit coverage

Location: internal/vault/utils.go:18-33 + internal/vault/chunk_test.go

Verified: Audit (sleuth_mgmt.go:1471) and usage (:1368) chunking both go through the new chunkSlice helper. TestChunkSlice exercises the chunkSize + 1 boundaries for both 5000 and 1000 (plus 0, exact-multiple, and non-positive-size cases). The novel chunking logic is no longer manual-test-only.

[LOW] `len(bots) >= sleuthBotListSoftCap` can warn falsely on a vault with exactly 1000 bots

Location: internal/vaultcopy/vaultcopy.go:152-154

Issue: The heuristic fires whenever len(bots) >= 1000, but the underlying server cap returns at most 1000. So a vault with exactly 1000 bots produces a "any beyond that were not copied" warning even though the copy was complete. Not a correctness issue (the copy is still right), but the warning is noise on a perfectly-sized vault. The prior structured-log heuristic at warnIfBotListSoftCap has the same shape, but at least it lives in JSON logs the operator doesn't read.

Suggestion: Soften the message to "source returned %d bots — this matches the server's list cap; if your vault has more, the extras were not copied". Or have the Sleuth side return a HasMore (mirroring ListTeamsResult) so the warning is fact-based rather than heuristic.

[LOW] Unsupported scope kinds on a present asset silently leave the auto-applied install in place

Location: internal/vaultcopy/vaultcopy.go:294-321

Issue: For a present == true asset, if every source scope returns false from scopeToTarget (e.g. a future scope kind added without updating the switch), targets ends up empty, the function returns at :319-321 without setting or clearing, and the destination keeps its auto-applied org-wide install — silently widened. Every current manifest.ScopeKind is wired up, so this is a "should never happen" today, but the matching all-unresolved path (line 344-351) does clear; this path doesn't, and the divergence won't show up in any test.

Suggestion: Treat len(targets) == 0 after kind mapping the same as applied == 0 — call clearAutoInstall so a future unmapped scope kind doesn't silently globalize the asset:

if len(targets) == 0 {
    clearAutoInstall(ctx, dst, name, r)
    return
}

Add a regression test that stubs scopeToTarget to return false for every kind and asserts a single clear call.

Fix this →

  • Read CLAUDE.md and project standards
  • Read installed skill SKILL.md files (Sleuth-Python-focused — not directly applicable to this Go change)
  • Diff origin/main...HEAD to assess current state
  • Verify prior-review MEDIUM + LOWs against latest commits
  • Post review
    · Branch

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @detkin's task in 1m 43s —— View job


PR Review: #152 — Add sx vault copy

Commit 4310547 closes both prior-round LOWs cleanly — copyAssetScopes now calls clearAutoInstall on the unmapped-scopes path (mirroring the all-unresolved path), and the bot soft-cap warning is reworded so it no longer asserts truncation at exactly 1000; no new issues found, just one small symmetry suggestion on test coverage.

[RESOLVED ✓] Unmapped scope kinds no longer leave the auto-applied install in place

Location: internal/vaultcopy/vaultcopy.go:319-327

Verified: When every source scope fails scopeToTarget mapping on a present asset, copyAssetScopes now calls clearAutoInstall(ctx, dst, name, r) before returning, exactly mirroring the applied == 0 and bulkErr branches at :344 and :355. A future unmapped scope kind can no longer silently widen an asset to org-wide on Sleuth destinations.

[RESOLVED ✓] Bot soft-cap warning softened — no longer false-positive at exactly 1000

Location: internal/vaultcopy/vaultcopy.go:152-154

Verified: The warning now reads "source returned N bots, the most the server lists in one call; if the org has more, they were not copied" — fact-stating rather than asserting truncation. On a vault with exactly 1000 bots the operator gets a conditional notice rather than a wrong claim. (A real HasMore from ListBots would be cleaner still, but that requires the server-side bots field to become a paginated connection — fair as a follow-up.)

[RESOLVED ✓] All prior fixes from earlier rounds still in place

Verified:

  • Asset pagination (sleuth.go:629-665, page cursor at sleuthAssetsPageSize = 50, bounded by opts.Limit).
  • Slug-first scope matching (sleuth_mgmt.go:518-525 + asset_installations.graphql:5-6).
  • Path-scoped installs preserved via monoRepoConfigPaths with per-call cache (sleuth_mgmt.go:543-555, :572-593).
  • Bulk-scope best-effort, no per-target clobbering fallback (sleuth_mgmt.go:445-503 + vaultcopy.go:337-360).
  • --yes skips the dry-run preview pass (commands/vault_copy.go:88-94).
  • Bot creation revokes auto-issued Sleuth keys via the botKeyManager type assertion (vaultcopy.go:169-203).
  • Scopes read before version upload — read errors abort the asset cleanly (vaultcopy.go:220-235).
  • Audit/usage failure warnings mention duplicate-on-retry (vaultcopy.go:84-92).
  • Chunking has direct unit coverage via chunkSlice + chunk_test.go.
[LOW] Unmapped-scope-clear branch has no direct test (symmetry gap with all-unresolved)

Location: internal/vaultcopy/scopes_test.go

Issue: TestCopyAssetScopes_BulkAllUnresolvedClears (line 93) and TestCopyAssetScopes_BulkErrorClearsNoFallback (line 109) pin the clear-on-empty contract for the bulk paths, but the new len(targets) == 0 clear at vaultcopy.go:319-327 (the post-scopeToTarget unmapped-kind branch) isn't covered. The branch is defensive against a future scope kind being added without wiring scopeToTarget — exactly the kind of code that quietly stops working in a refactor and needs a test to lock the behavior in.

Suggestion: Add a test that stubs scopeToTarget to return false (or seeds a manifest.Scope with a kind that the current switch doesn't handle, e.g. via a fake manifest.ScopeKind zero value) and asserts f.clearCalls == 1 and r.Scopes == 0. A 10-line test that future-proofs the symmetry.

Fix this →

  • Read CLAUDE.md and installed-skill SKILL.md files (test-writer/graphql-field are Sleuth-Python-focused — not directly applicable to this Go change)
  • Diff origin/main...HEAD to assess current state
  • Verify prior-review LOW items against latest commits
  • Post review
    · Branch

@detkin detkin merged commit a4f1a8e into main Jun 1, 2026
4 checks passed
@detkin detkin deleted the detkin/sx-vault-copy-client branch June 1, 2026 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant