Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions ChangeLog.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,111 @@ _No unreleased changes._

---

## 26.15 — 2026-05-27

Per-subnet scan profiles (P2-05). Operators can now run aggressive
hourly deep scans on critical infrastructure while leaving the guest
network on a lazy daily liveness sweep — all from one config file, one
agent process. Closes the original P2 operator-feedback batch.

### Added

- **`config.SubnetProfile`** — per-subnet overrides for `ScanInterval`,
`Timeout`, `ProbePorts`, `DeepProbe`, `DeepProbePorts`, `UDPPorts`,
`EnrichARP`. Bool fields are `*bool` so a profile can explicitly
disable deep probing even when the global default is on (the zero
value would be ambiguous).
- **`config.ScannerConfig.Profiles []SubnetProfile`** — the new
per-subnet list. Mutually exclusive with the legacy `Subnets`
field; both are validated at boot.
- **`config.ResolvedProfile` + `ScannerConfig.Resolve()`** — flattens
Subnets + Profiles into one fully-defaulted list. Every field is
populated from either the profile override or the global default,
so the agent's runtime path has no further fallback logic.
- **`scanner.SubnetOptions`** — new struct passed to `Scan(ctx,
subnet, opts)`. Carries the per-call probe configuration so the
scanner can serve multiple profiles without per-scan reconstruction.
- **`config.True()` / `config.False()`** — bool-pointer helpers for
building profiles programmatically.

### Changed

- **`scanner.Scan(ctx, subnet)` → `scanner.Scan(ctx, subnet, SubnetOptions{})`**.
In-tree callers updated; out-of-tree callers pass `SubnetOptions{}`
to retain pre-26.15 behaviour.
- **`scanner.probe`, `deepScan`, `udpScan` helpers** now take their
timeout + port-list parameters explicitly rather than reading from
the Scanner struct. The Scanner-level fields remain as defaults
consulted by `resolve()` at the top of Scan.
- **`agent.New` now returns `(*Agent, error)`** — `Resolve()` runs at
construction so config errors (duplicate subnets, mutually-exclusive
flat list + profiles) surface at boot, not on the first scan tick.
Test suites updated.
- **Agent scheduling**: the single global `ScanInterval` ticker is
replaced with one that ticks at the *shortest* per-profile
interval. Each profile keeps its own `nextDue` timestamp; only
profiles past their due time get scanned on each tick. The
housekeeping pass (prune, change-detect diff, tracker updates) runs
every tick regardless, so zero-profile deployments — watchdog-only
mode — still work as before.
- **Tick-interval safety floor** of 1 second. Prevents pathological
busy-loops if an operator types `"1ns"` by accident.

### Tests

- 7 new tests in `internal/config` covering: legacy-Subnets path,
profile-overrides-win, explicit-False-beats-global-True, mutually-
exclusive validation, duplicate-subnet rejection, empty-subnet
rejection, zero-profile happy path.
- Scanner + agent test suites updated for the new `SubnetOptions{}`
third argument and `(a, err)` constructor.

### Migration notes

Existing configs keep working — set `scanner.subnets` and the global
fields (`scan_interval`, `timeout`, `probe_ports`, …) as before.
**Operators who want per-subnet tuning** switch to:

```json
{
"scanner": {
"profiles": [
{ "subnet": "10.0.0.0/24", "scan_interval": "1h", "deep_probe": true },
{ "subnet": "192.168.1.0/24", "scan_interval": "24h" }
],
"scan_interval": "5m",
"timeout": "2s",
"workers": 50
}
}
```

Any field absent from a profile inherits the corresponding global.
The flat `scanner.subnets` and per-subnet `scanner.profiles` fields
are mutually exclusive — boot fails fast if both are set.

---

## P2 operator-feedback batch — complete

Original asks from the operator pass: **all five shipped.**

| # | Item | Sprint |
|---|---|---|
| 1 | Service / application discovery | 26.12 |
| 2 | Change detection + webhook/syslog alerts | 26.13 |
| 3 | Device-type classifier | 26.11 |
| 4 | Query API beyond bulk export | 26.14 |
| 5 | Per-subnet scan profiles | 26.15 |

The agent now does end-to-end inventory: discovery → enrichment →
classification → change detection → alerting → queryable API, with
per-subnet scheduling. Next-feature backlog is empty; future work
should be driven by a fresh round of operator feedback or `/ultrareview`
findings.

---

## 26.14 — 2026-05-27

JSON query API (P2-04). Adds filterable, paginated `/api/v1/hosts` and
Expand Down
6 changes: 5 additions & 1 deletion cmd/internal/runtime/runtime.go
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,11 @@ func Run(opts Options) int {
slog.Info("alert sinks configured", "count", len(alertSinks))
}

a := agent.New(opts.Name, cfg.Scanner, db.Hosts(), db.Ports(), db.Scans(), tracker, mux)
a, err := agent.New(opts.Name, cfg.Scanner, db.Hosts(), db.Ports(), db.Scans(), tracker, mux)
if err != nil {
slog.Error("agent setup failed", "err", err)
return 1
}

adminSrv, err := admin.NewServer(
cfg.Admin.Addr, opts.Name,
Expand Down
4 changes: 2 additions & 2 deletions internal/admin/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -163,9 +163,9 @@ func (s *Server) handleAPIHostDetail(w http.ResponseWriter, r *http.Request) {
type hostFilter struct {
vendor string
deviceType string
hostname string // lowercase, for case-insensitive substring
hostname string // lowercase, for case-insensitive substring
subnet *net.IPNet // nil = no subnet filter
port int // 0 = no port filter
port int // 0 = no port filter
}

func parseHostFilter(q url.Values) (hostFilter, error) {
Expand Down
120 changes: 94 additions & 26 deletions internal/agent/agent.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,12 @@ type Agent struct {
alerts alerts.Emitter
now func() time.Time

// profiles is the resolved per-subnet config built at New time.
// Each entry carries its own ScanInterval; nextDue tracks when its
// next scan is permitted.
profiles []config.ResolvedProfile
nextDue map[string]time.Time

// trigger is a buffered channel that lets external callers
// (e.g. POST /scan) force an immediate cycle without waiting for the
// next ticker firing. Capacity 1 so concurrent triggers coalesce.
Expand All @@ -46,10 +52,14 @@ func New(
scans store.ScanStore,
tracker *health.Tracker,
alertEmitter alerts.Emitter,
) *Agent {
) (*Agent, error) {
if alertEmitter == nil {
alertEmitter = alerts.NoopEmitter()
}
profiles, err := cfg.Resolve()
if err != nil {
return nil, err
}
return &Agent{
name: name,
cfg: cfg,
Expand All @@ -67,11 +77,13 @@ func New(
UDPPorts: cfg.UDPPorts,
EnrichARP: cfg.EnrichARP,
}),
tracker: tracker,
alerts: alertEmitter,
now: time.Now,
trigger: make(chan struct{}, 1),
}
tracker: tracker,
alerts: alertEmitter,
now: time.Now,
profiles: profiles,
nextDue: make(map[string]time.Time, len(profiles)),
trigger: make(chan struct{}, 1),
}, nil
}

// Trigger requests an out-of-cycle scan. Returns true if the request was
Expand All @@ -88,15 +100,20 @@ func (a *Agent) Trigger() bool {
}
}

// Run starts the scan loop. It executes one scan immediately, then repeats on
// cfg.ScanInterval. It blocks until ctx is cancelled.
// Run starts the scan loop. It executes one cycle immediately (every
// profile, regardless of its own interval), then ticks at the shortest
// configured per-profile interval. Each subsequent tick only scans
// profiles whose next-due time has passed. Blocks until ctx is cancelled.
func (a *Agent) Run(ctx context.Context) {
log := slog.With("agent", a.name)
log.Info("scan loop started", "subnets", a.cfg.Subnets, "interval", a.cfg.ScanInterval)
log.Info("scan loop started",
"profiles", len(a.profiles),
"tick_interval", a.tickInterval(),
)

a.runCycle(ctx, log)
a.runCycle(ctx, log, true /* forceAll */)

ticker := time.NewTicker(a.cfg.ScanInterval.Duration)
ticker := time.NewTicker(a.tickInterval())
defer ticker.Stop()

for {
Expand All @@ -105,39 +122,82 @@ func (a *Agent) Run(ctx context.Context) {
log.Info("scan loop stopped")
return
case <-ticker.C:
a.runCycle(ctx, log)
a.runCycle(ctx, log, false)
case <-a.trigger:
log.Info("on-demand scan triggered")
a.runCycle(ctx, log)
a.runCycle(ctx, log, true)
}
}
}

// tickInterval is the shortest configured profile interval. The main
// loop ticks at this cadence and selects per-tick which profiles are
// actually due. A safety floor of one second prevents pathological
// busy-loops when an operator types "1ns" by accident.
func (a *Agent) tickInterval() time.Duration {
const floor = time.Second
min := time.Duration(0)
for _, p := range a.profiles {
if p.ScanInterval <= 0 {
continue
}
if min == 0 || p.ScanInterval < min {
min = p.ScanInterval
}
}
if min < floor {
min = floor
}
return min
}

func (a *Agent) runCycle(ctx context.Context, log *slog.Logger) {
log.Info("scan cycle started", "subnets", len(a.cfg.Subnets))
// runCycle scans every profile whose nextDue has passed (or all of them
// when forceAll is true — used for the initial cycle and on-demand
// triggers).
func (a *Agent) runCycle(ctx context.Context, log *slog.Logger, forceAll bool) {
started := a.now()

// Snapshot the pre-cycle host inventory so we can diff it against
// the post-cycle list and fire HostDiscovered / HostVanished events.
// Snapshotting before the scan (rather than tracking what the
// scanner returned) means the diff correctly reflects "ground truth
// changed", including hosts the operator added or removed
// externally.
prevHosts := snapshotByIP(ctx, a.hosts, log)

// Select due profiles. With no profiles configured (zero-config
// "watchdog only" deployment), due is empty but housekeeping below
// — prune, diff, tracker updates — still runs.
due := make([]config.ResolvedProfile, 0, len(a.profiles))
for _, p := range a.profiles {
if forceAll || !started.Before(a.nextDue[p.Subnet]) {
due = append(due, p)
}
}

if len(due) > 0 {
log.Info("scan cycle started", "due_profiles", len(due), "total_profiles", len(a.profiles))
}

cycleHosts := 0
cycleHealthy := true
for _, subnet := range a.cfg.Subnets {
for _, p := range due {
metrics.ScansTotal.Inc()
n, err := a.scanner.Scan(ctx, subnet)
n, err := a.scanner.Scan(ctx, p.Subnet, scanner.SubnetOptions{
Timeout: p.Timeout,
ProbePorts: p.ProbePorts,
DeepProbe: boolPtr(p.DeepProbe),
DeepProbePorts: p.DeepProbePorts,
UDPPorts: p.UDPPorts,
EnrichARP: boolPtr(p.EnrichARP),
})
if err != nil {
metrics.ScanErrorsTotal.Inc()
log.Warn("subnet scan failed", "subnet", subnet, "err", err)
log.Warn("subnet scan failed", "subnet", p.Subnet, "err", err)
cycleHealthy = false
continue
}
log.Debug("subnet scanned", "subnet", subnet, "hosts", n)
log.Debug("subnet scanned", "subnet", p.Subnet, "hosts", n, "interval", p.ScanInterval)
cycleHosts += n
// Schedule the next due time. Computed off the start of THIS
// cycle (not now) so a slow scan doesn't drift the cadence.
a.nextDue[p.Subnet] = started.Add(p.ScanInterval)
}

if pruned := a.pruneStale(ctx, log, started); pruned > 0 {
Expand Down Expand Up @@ -167,11 +227,13 @@ func (a *Agent) runCycle(ctx context.Context, log *slog.Logger) {
a.tracker.SetHealthy(cycleHealthy)

duration := a.now().Sub(started)
interval := a.cfg.ScanInterval.Duration
// Warning threshold: half the tick interval. Slower than that and
// the loop is at risk of dropped firings.
interval := a.tickInterval()
if interval > 0 && duration > interval/2 {
log.Warn("scan cycle nearly exceeded interval",
log.Warn("scan cycle nearly exceeded tick interval",
"duration", duration.Round(time.Millisecond),
"interval", interval,
"tick_interval", interval,
)
}
log.Info("scan cycle complete",
Expand All @@ -182,6 +244,12 @@ func (a *Agent) runCycle(ctx context.Context, log *slog.Logger) {
)
}

// boolPtr is a small inline helper for passing a value bool to a *bool
// option field. Pointer-bools let a profile distinguish "explicit false"
// from "inherit default"; from the resolved-profile side we know which
// way the bool was already resolved, so we always pass a non-nil pointer.
func boolPtr(b bool) *bool { return &b }

// pruneStale deletes hosts whose last_seen is older than the configured
// HostTTL. Returns the number of hosts pruned. Disabled when HostTTL is 0
// (the default), so existing deployments don't lose history silently.
Expand Down
Loading
Loading