feat(state): add scan-state package for npm/python delta uploads#131
Merged
ashishkurmi merged 14 commits intoJun 19, 2026
Merged
Conversation
Signed-off-by: Swarit Pandey <swarit@stepsecurity.io>
Signed-off-by: Swarit Pandey <swarit@stepsecurity.io>
There was a problem hiding this comment.
Pull request overview
This PR introduces a new internal/state package that persists device-side scan state (projects, globals, and pending removals) to enable delta uploads for npm and Python across runs.
Changes:
- Add
StateJSON schema with load/save, reconciliation, partitioning, and commit-after-upload semantics. - Add canonical JSON hashing (
sha256:) to make output hashing stable across JSON key reordering. - Add unit tests covering load/save behavior, partitioning rules, full-sync horizon logic, and removal pending/ack lifecycle.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| internal/state/state.go | Implements the on-disk scan state model and core reconciliation/commit logic. |
| internal/state/state_test.go | Adds tests for load/save, reconcile/partition behavior, full-sync rules, and removals lifecycle. |
| internal/state/hash.go | Adds canonical JSON hashing to stabilize hashes across key reordering. |
| internal/state/hash_test.go | Adds tests for hash stability, format, and malformed JSON fallback behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+86
to
+88
| // Load reads scan-state.json. Missing file, parse error, or schema mismatch | ||
| // returns a fresh empty state — the next run becomes a full sync naturally. | ||
| // The error is non-nil only to surface why fallback happened, for logging. |
Comment on lines
+124
to
+126
| func (s *State) Save(path string) error { | ||
| dir := filepath.Dir(path) | ||
| if err := os.MkdirAll(dir, 0o750); err != nil { |
Comment on lines
+164
to
+168
| if s.LastFullSyncAt.IsZero() { | ||
| return true | ||
| } | ||
| return now.Sub(s.LastFullSyncAt) > horizon | ||
| } |
Comment on lines
+421
to
+429
| func (s *State) projectMap(ecosystem string) map[string]ProjectEntry { | ||
| switch ecosystem { | ||
| case EcosystemNPM: | ||
| return s.NPMProjects | ||
| case EcosystemPython: | ||
| return s.PythonProjects | ||
| } | ||
| return nil | ||
| } |
Comment on lines
+431
to
+439
| func (s *State) globalMap(ecosystem string) map[string]GlobalEntry { | ||
| switch ecosystem { | ||
| case EcosystemNPM: | ||
| return s.NPMGlobal | ||
| case EcosystemPython: | ||
| return s.PythonGlobal | ||
| } | ||
| return nil | ||
| } |
- Load doc: clarify error is non-nil only for read/parse failures; missing file and schema mismatch return nil. - Save: filepath.Clean the input so '..' segments don't surprise os.MkdirAll/Rename. Matches Load. - IsFullSyncDue: use >= against the horizon so exactly-at-boundary counts as due (doc says 'maximum gap'). - projectMap/globalMap: panic on unknown ecosystem instead of returning nil. Caller can only pass EcosystemNPM or EcosystemPython. Signed-off-by: Swarit Pandey <swarit@stepsecurity.io>
Phase 1 integration of the internal/state package. Payload shape unchanged — this verifies state load/save/commit-after-confirm reliability with zero backend dependency before later phases flip the wire format. - paths.ScanStateFile() resolves Home()/scan-state.json, returns '' when Home() is disabled. - state.ScanRecordFromBase64 builds a ScanRecord from a NodeScanResult by canonical-hashing its base64-decoded raw stdout. Falls back to raw-bytes hash on a malformed base64 string. - NodeScanner.ScanProjects gains a knownLastVerified map. orderScanProjects puts never-seen projects first by mtime desc, then known projects by LastVerifiedAt asc (stalest first). When the cap is hit, known-recent is dropped — the long tail finally gets reached. nil map = legacy mtime-desc order. - telemetry.Run loads scan-state.json after device_info, computes IsFullSyncDue (7-day horizon + agent-version drift), passes a known map to ScanProjects (skipped on fullSync). After a successful uploadToS3 it partitions the results and CommitAfterUpload + Save. Honors STEPSEC_DISABLE_SCAN_STATE=1 as an escape hatch. Python integration, global packages, and removed-project tracking are intentionally out of scope for this commit. Signed-off-by: Swarit Pandey <swarit@stepsecurity.io>
…-state Extends Phase 1 integration to cover the rest of the design: - pythonproject.go: split discover from scan so the caller can reorder by state. ListProjects now takes knownLastVerified and returns the un-capped discovered set as a second value. - nodescan.go: ScanProjects also returns the un-capped discovered set. Lets callers tell 'missing from disk' apart from 'dropped by the cap' when reconciling against prior state. - state.ScanRecordFromValue: build a ScanRecord by JSON-marshaling arbitrary data, for ecosystems like Python that hand back parsed package lists instead of raw PM stdout. - telemetry.commitScanState now handles npm projects, python projects, npm globals (per PM), python globals (per PM), and removed-project pending-ack appends. Cap-dropped projects survive untouched in state (they are discovered, so Reconcile doesn't see them as removed). - scan_state_test.go: round-trip tests covering first-run population, steady-state unchanged hashing, removed-project pending-ack, the cap-dropped invariant, failed-scan hash preservation, python flow, globals, and full-sync horizon refresh. Signed-off-by: Swarit Pandey <swarit@stepsecurity.io>
Adds an opt-out for the scan-state delta-upload optimization. Default remains the optimized path; setting use_legacy_package_scan=true in ~/.stepsecurity/config.json reverts to the pre-1.13 full-snapshot upload mechanism for npm and Python projects. Resolution order for the opt-out (any wins, all are non-destructive): 1. config.use_legacy_package_scan = true (persistent, in config.json) 2. STEPSEC_DISABLE_SCAN_STATE=1 (env, for incident response) 3. paths.Home() unresolvable (no place to write the file) When opted out, scanState stays nil; ScanProjects/ListProjects receive a nil 'known' map (legacy mtime-desc order) and commitScanState is skipped. State file is left untouched. Signed-off-by: Swarit Pandey <swarit@stepsecurity.io>
Comment on lines
+98
to
+103
| // orderVenvs prioritises never-seen venvs (mtime desc) before known venvs | ||
| // (LastVerifiedAt asc). A nil/empty map preserves discovery order. | ||
| func orderVenvs(candidates []venvCandidate, knownLastVerified map[string]time.Time) []venvCandidate { | ||
| if len(knownLastVerified) == 0 { | ||
| return candidates | ||
| } |
| _ = os.Remove(tmpPath) | ||
| return err | ||
| } | ||
| return os.Rename(tmpPath, cleanedPath) |
Completes the wire-format flip the state package was built for. When
scan-state is enabled, the payload now emits payload_schema_version=1
and routes each project to one of three slots:
- changed -> NodeProjects / PythonProjects (full body, today's slot)
- unchanged -> NodeProjectsUnchanged / PythonProjectsUnchanged
({path, scan_output_hash, last_uploaded_execution_id})
- removed -> NodeProjectsRemoved / PythonProjectsRemoved
({path, last_uploaded_execution_id})
Same three buckets for globals, except 'removed' never applies (a PM
disappearing from the payload is implicit per the design).
Wire types in internal/model: UnchangedProjectRef, RemovedProjectRef,
UnchangedGlobalRef. PayloadSchemaVersion field on Payload (omitempty so
legacy/disabled runs are wire-compatible with pre-1.13 backends).
internal/telemetry/delta.go:
- buildDeltaSnapshot runs after scans complete; partitions records,
reconciles against discovered, marks removals pending (in-memory),
and builds the wire-ready ref slices. nil state -> nil snapshot ->
caller uses legacy payload shape.
- commitDeltaSnapshot runs after a successful upload: CommitAfterUpload
(upserts records), AckRemovals (drops reported removals from
pending_ack), DropRemovedFromProjects (deletes from main maps),
atomic Save.
- removedRefsFor filters pending_ack against the discovered set so a
project that reappears before its prior removal was confirmed isn't
reported as removed again (would conflict with the backend's
changed-branch upsert).
telemetry.Run: builds the snapshot before assembling the payload, uses
its outputs to fill the legacy slots (changed bodies only) and the new
ref slots. After upload success, commitDeltaSnapshot replaces the old
commitScanState path. Legacy commitScanState function deleted.
Backwards compatibility preserved:
- config.UseLegacyPackageScan = true OR STEPSEC_DISABLE_SCAN_STATE=1
-> scanState is nil -> snap is nil -> legacy payload shape, no
schema version field set.
- Old backends seeing the new payload: payload_schema_version is
omitempty (defaults to 0 for legacy decoders); new fields are all
additive with omitempty tags. agent-api dispatches on version per
the companion PR.
Tests (internal/telemetry/{scan_state_test.go,payload_delta_test.go}):
- First-run population (everything changed)
- Steady-state unchanged refs carry hash + prior exec id
- Hash diff ships as changed body
- Removed projects -> RemovedRef, cleared from state after ack
- Cap-dropped projects stay in state, never reported removed
- Re-discovered project filtered from pending removals
- Failed scan doesn't overwrite prior hash
- Globals: first run changed, second run unchanged refs
- Full sync forces all changed
- nil state -> nil snapshot (delta disabled)
- JSON wire shape: legacy payload omits delta fields, delta payload
ships them, round-trips through marshal/unmarshal
Signed-off-by: Swarit Pandey <swarit@stepsecurity.io>
Reintegrates the two perf ideas from the closed PR step-security#81, layered on top of the delta-protocol foundation. Two effects: 1. Lockfile-mtime cache: per-project NodeScanResult cached at paths.Home()/node-scan-cache.json. A subsequent run reuses the cached result (skipping the npm/yarn/pnpm/bun CLI entirely) when package.json, the lockfile, AND node_modules mtimes are all <= LastScanUnix. AgentVersion is pinned per-entry so a post-upgrade run always re-scans. 2. Worker pool: cache misses run concurrently, bounded by min(NumCPU, 8). Override with STEPSEC_NODE_SCAN_WORKERS. The cache composes cleanly with the delta protocol — cached raw stdout canonical-hashes to the same value as state's stored hash, so delta classifies the project as unchanged and ships only a ref. Composition on a steady-state run: discover -> cache hit on every project -> no PM CLI -> ref-only payload -> backend bumps LastSeenAt. Four design improvements over step-security#81's original cut: - node_modules mtime is the third invalidation signal. The lockfile alone misses cases like 'rm -rf node_modules/foo' where the lockfile doesn't change but the installed state does. - Cache lives under paths.Home() (next to scan-state.json) instead of a hardcoded /var/lib path, matching the rest of the agent's per-device file layout. - AgentVersion pinning: a cache entry produced by v1.12 is treated as stale once v1.13 runs, defensive against parsing-format drift. - Bypass via STEPSEC_NODE_SCAN_CACHE_BYPASS=1 for incident response. Cache file is corruption-safe: missing, parse failure, or schema version mismatch (currently 2) all return an empty cache. Successful fresh scans only — failed scans are never cached so the next run retries them, matching the same invariant used by state.Partition. executor.Mock gains SetFileMtime(path, unixSec) that registers both the mtime and file existence in one call. Used by the cache invalidation tests; harmless general-purpose helper. Tests (internal/detector/nodescan_cache_test.go): - Cache roundtrip, miss/corrupt/wrong-version fallbacks - Hit when all three mtimes <= LastScanUnix - Miss when lockfile mtime moves - Miss when node_modules mtime moves (the rm -rf case) - Miss when no lockfile present (can't trust mtimes) - Miss on agent version drift - Miss on PM change (npm -> yarn) - Bypass=true short-circuits all checks - pruneCacheToDiscovered drops orphaned entries - Worker count: env override + sane default Signed-off-by: Swarit Pandey <swarit@stepsecurity.io>
The --telemetry-out dev flag was short-circuiting before commitDeltaSnapshot, so dev/stress-test runs that use it produced no scan-state.json and every subsequent run looked 'cold' (always classified as changed). Treat a successful local dump the same as a successful S3 upload for the purposes of persisting state — same in-memory snapshot, same commit, same save. This unblocks the offline integration test in step-security/integration-test which dumps payloads via --telemetry-out and asserts on scan-state.json / node-scan-cache.json across runs. Signed-off-by: Swarit Pandey <swarit@stepsecurity.io>
scanProjectsConcurrent dispatches scanProject across a worker pool. Each worker calls binaryAvailable, which reads + writes pmAvailability — a shared map on the scanner. Under enough projects + workers Go's runtime crashes the process with 'fatal error: concurrent map read and map write'. Surfaced by the integration-test workflow on a 30-project corpus; not caught by unit tests because nodescan_test's mock executor makes binaryAvailable resolve via a single shared lookup table that doesn't trigger the race. Add a sync.Mutex around the cache. Hot path stays cheap (at most one PM binary per device drives N lookups; first call writes, all subsequent calls are read-and-return). Signed-off-by: Swarit Pandey <swarit@stepsecurity.io>
- orderVenvs: only short-circuit on nil knownLastVerified. An empty map means state exists but has no Python entries yet, so every candidate is unknown and should still sort by mtime desc. - State.Save: remove destination before rename (Windows os.Rename fails if dest exists) and clean up the tmp file on rename failure so we don't leave .scan-state-*.tmp debris. Matches the pattern in internal/progress/filelog/filelog.go RotateIfOverCap. Signed-off-by: Swarit Pandey <swarit@stepsecurity.io>
Signed-off-by: Swarit Pandey <swarit@stepsecurity.io>
ashishkurmi
approved these changes
Jun 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Type of change
Testing
./stepsecurity-dev-machine-guard --verbose./stepsecurity-dev-machine-guard --json | python3 -m json.toolmake lintmake testRelated Issues