Skip to content

feat: implement core/aggsigdb#430

Open
emlautarom1 wants to merge 12 commits into
mainfrom
emlautarom1/core-aggsigdb
Open

feat: implement core/aggsigdb#430
emlautarom1 wants to merge 12 commits into
mainfrom
emlautarom1/core-aggsigdb

Conversation

@emlautarom1
Copy link
Copy Markdown
Collaborator

Closes #169


There are two implementations in Charon: V1 uses an Actor model with commands and queries, and V2 uses a Mutex alongside Go runtime trickery. The implementation in use in Charon is V1 due to concurrency issues in V2.

This implementation is closer to V2, and we use tokio's tools (ex. Notify) to replicate the Go runtime features.

Tests have been ported when appropriate, and some new ones were added to increase coverage.

- Use Actor model
- Implement `store`
- Test `store`
- Expose top-level `new`
- Remove impossible errors
- Add inline docs
- Use Mutex/Notify pattern
- Remove need for handles
- Adjust tests and docs
- Use existing Deadliner API (Arc)
- Add tests
- Rust specific
- Use `try_for_each`
- Proper naming convention
@emlautarom1
Copy link
Copy Markdown
Collaborator Author

@claude

@claude
Copy link
Copy Markdown

claude Bot commented May 21, 2026

Claude finished @emlautarom1's task in 3m 32s —— View job


Claude PR Review — core/aggsigdb

  • Read PR diff and changed files
  • Compare with Charon Go implementation (memory.go, memory_v2.go)
  • Review correctness, concurrency, and Rust style
  • Post review findings

Overall: the implementation is clean, well-tested, and faithfully follows Charon's V2 mutex+notify design with idiomatic Tokio primitives. The Notify::notified() → enable() → check map → await pattern is correctly applied to avoid lost wakeups, and the eviction task lifetime via Weak<MemDBInner> is elegant. A few items below are worth addressing before merge — most are design questions rather than bugs.


🟠 Cancellation: wait_for has no way to be aborted

crates/core/src/aggsigdb/memory.rs:95-118MemDB::wait_for returns Box<dyn SignedData> and has no timeout/cancellation parameter. Charon's Await(ctx, ...) takes a context.Context and returns an error on cancel or DB stop. In Rust this can be papered over by callers with tokio::select!/tokio::time::timeout, but the lost flexibility is worth a conscious choice: e.g. accepting a CancellationToken or returning a Result so the caller can distinguish "I gave up" from "value not yet stored." Worth confirming this is the intended Pluto API surface and not just a temporary simplification, since the consumers of AggSigDB (sigagg / fetcher‐glue) need a cancel path.

🟠 Deadliner::c() returning None is silently a no-op

crates/core/src/aggsigdb/memory.rs:39-46:

match deadliner.c() {
    Some(evictions) => {
        tokio::spawn(Self::evict(Arc::downgrade(&this.0), evictions));
    }
    None => {
        // TODO: In Charon, `deadliner.c()` always returns `Some`
    }
}

A None here disables eviction entirely with no log/warn, which makes misconfiguration (e.g. somebody else already took the receiver via c()) extremely hard to diagnose — a long‑running node will quietly leak memory. Three options worth considering:

  1. expect("aggsigdb requires exclusive access to the deadliner") — matches Charon's invariant.
  2. tracing::warn! at minimum so it's visible.
  3. Reshape the Deadliner API so the receiver is passed in directly (then misuse is a type error).

🟡 tokio::sync::Mutex for purely-synchronous critical sections

crates/core/src/aggsigdb/memory.rs:25: data: Mutex<HashMap<...>> uses tokio::sync::Mutex, but the lock is never held across .await (the only post‑lock work is notify.notify_waiters() which is sync). std::sync::Mutex (or parking_lot::Mutex) would be cheaper and more idiomatic per Tokio's own guidance. The notify_waiters call comes after the lock drops, so swapping is safe.

🟡 Eviction is O(n) over the full map instead of O(k) per duty

crates/core/src/aggsigdb/memory.rs:51-61:

inner.data.lock().await.retain(|(d, _), _| d != &duty);

Charon V2 maintains a keysByDuty map[Duty][]memDBKey precisely to avoid scanning the entire data map on every eviction (memory_v2.go:18, 54, 128-138). At steady state with many active duties per slot, this matters. Not a correctness issue, but it drops a deliberate optimization from the original. Consider adding the secondary index — it's a few lines and removes a perf regression vs. Charon. Fix this →

🟡 notify_waiters is fired even on no-op stores

crates/core/src/aggsigdb/memory.rs:86-89: the TODO already calls this out. Two waste modes:

  1. An idempotent re-store (existing == new) still wakes every waiter.
  2. An error path (MismatchingData) also wakes them.

In a steady-state DV node with many parallel waiters this is fine but not free. The fix is a bool inserted_any flag built in the try_for_each, and only calling notify_waiters when true. Cheap to do now and the TODO is right there. The fancier per‑key wakeup mentioned in the second half of the TODO is overkill — the boolean is enough.

🟡 store calls deadliner.add(duty) before taking the data lock

crates/core/src/aggsigdb/memory.rs:64-68 — Charon V2 calls m.deadliner.Add(duty) inside the write lock (memory_v2.go:41, called from store while Store holds m.Lock()). The Rust port calls it before locking. The window matters only if add could schedule an immediate eviction for an already‑expired duty before we even take the lock: in that case evict removes nothing (key not yet inserted), then we insert, and the entry survives past its deadline. Charon has the same race in V2 actually (since Add posts to an mpsc, the eviction is scheduled async), so this isn't a regression — but the TODO(charon) at line 65 about "no deadline supported vs already expired" is exactly the same hole. Worth a comment acknowledging this, or filing a follow-up.

🟢 SignedDataSet type-erased rewrite (crates/core/src/types.rs)

The change from SignedDataSet<T: SignedData> to SignedDataSet(HashMap<PubKey, Box<dyn SignedData>>) makes sense for aggsigdb's heterogeneous storage. Verified there are no current callers of the old generic API outside this PR, so no migration risk. The new IntoIterator for both owned and & flavors is a nice touch. Minor: the &'a SignedDataSet IntoIterator yields &'a Box<dyn SignedData> — Clippy may eventually nudge you to &'a dyn SignedData via .map(|(k, v)| (k, v.as_ref())), but that's an ergonomic call.

🟢 Tests are good — one fragility note

The tokio::task::yield_now().await; assert!(!reader.is_finished()) pattern (memory.rs:225-226, 320-321, 350-356, 388-389) is racy on loaded CI: a single yield doesn't guarantee the spawned reader has reached notified.await. In practice this rarely flakes because the executor schedules eagerly, but on contended runners it can. Consider replacing with a small tokio::time::sleep(Duration::from_millis(50)) or — better — a Notify-based "reader is parked" signal in the test. Same critique applies to Charon's runtime.Gosched(), so this isn't unique to Pluto.

The new test coverage (write/read, write_unblocks, cannot_overwrite, write_idempotent, eviction, write_unblocks_many, unrelated_write_does_not_unblock) is more thorough than Charon's V2 internal tests — nice work.

🔵 Nits

  • crates/core/src/aggsigdb/mod.rs has only pub mod memory; — consider a one‑line module doc describing the AggSigDB role (aggregated signed duty data store; key invariants).
  • MemDB::new calls tokio::spawn, so it requires a Tokio runtime to be active at construction. Worth a # Panics note on the rustdoc, since a non-test caller could trip this from a builder run outside an async context.
  • crates/core/src/aggsigdb/memory.rs:99wait_for could take &Duty instead of Duty by value; you only need an owned Duty to build the lookup key, which Clone covers. Minor ergonomics.
  • The 692/-222 diff in Cargo.lock looks like an unrelated cargo update (alloy 1.5→1.6, winnow 0.7→1.0, secp256k1 versioning, etc.). No new direct deps were introduced by the aggsigdb crate itself, so the churn isn't load-bearing, but you may want to land it as a separate commit/PR for cleaner bisection.

Summary

Severity Item
🟠 wait_for has no cancellation/timeout primitive — confirm intent
🟠 Deadliner::c() == None silently disables eviction — log or expect
🟡 tokio::sync::Mutex unnecessary; std::sync::Mutex is sufficient
🟡 Drop keysByDuty index makes eviction O(n) vs Charon's O(k)
🟡 notify_waiters fires even on no-op stores
🟡 deadliner.add before lock — re-confirm the eviction race comment
🔵 Doc/ergonomic nits, Cargo.lock churn separability

Nothing here is a blocker; the implementation is correct and matches Charon V2 semantics. Most actionable is the keysByDuty index (Charon left it in for a reason) and the Deadliner::c() == None warn/expect. The wait_for cancellation question is worth nailing down before downstream consumers pin against this API.
· Branch: emlautarom1/core-aggsigdb

@emlautarom1 emlautarom1 marked this pull request as draft May 21, 2026 00:37
- Annotate `wait_for` abort mechanism
- Log when deadliner does not have a channel
@emlautarom1 emlautarom1 force-pushed the emlautarom1/core-aggsigdb branch from 8b30fe0 to b341668 Compare May 21, 2026 05:22
@emlautarom1
Copy link
Copy Markdown
Collaborator Author

@claude

@emlautarom1 emlautarom1 marked this pull request as ready for review May 21, 2026 05:23
@claude
Copy link
Copy Markdown

claude Bot commented May 21, 2026

Claude finished @emlautarom1's task in 2m 41s —— View job


Claude PR Review — core/aggsigdb (re-run)

  • Check what's changed since previous review
  • Re-read changed files
  • Compare with Charon Go implementation
  • Post updated review findings

Re-reviewed after the "Apply Claude suggestions" / "Refactor" commits (b341668, 33c15ac, 98993ba, 1ecfe68, 6a91cbe). Most of the previous round was addressed cleanly. One correction to the earlier review, two open items, and a couple of new nits below.


Correction to the previous review

🔴 tokio::sync::Mutex is required here, not optional — the previous review suggested swapping to std::sync::Mutex. That suggestion was wrong: crates/core/src/aggsigdb/memory.rs:65-68 holds the data lock across self.0.deadliner.add(duty.clone()).await, so the lock crosses an .await point and must be the async tokio::sync::Mutex. This also matches Charon V2 (memory_v2.go:60-78), which calls m.deadliner.Add() while holding m.Lock(). Keep the current code; ignore the prior nudge. The only alternative would be to restructure store to drop the lock before calling deadliner.add, but that introduces a TOCTOU vs. the eviction task that Charon explicitly avoids.


Addressed since last review

  • Deadliner::c() == None now emits tracing::warn! (memory.rs:43-45).
  • notify_waiters only fires when something was actually inserted (memory.rs:71, 82, 88-92), via the inserted flag — exactly the fix sketched last round.
  • ✅ Store now consumes SignedDataSet matching Charon's API (memory.rs:64).
  • try_for_each cleans up the per-entry insert loop (memory.rs:72-86); partial-insertion-on-error semantics preserved with an explicit comment, matching Charon V2 (memory_v2.go:72-76).
  • wait_for indefinite-blocking semantics documented (memory.rs:97-100).
  • mod.rs doc string added.

Still open

🟡 keysByDuty index for O(k) eviction (memory.rs:51-61) — Charon V2 explicitly maintains keysByDuty map[Duty][]memDBKey to make eviction touch only the keys for the evicted duty (memory_v2.go:18, 54, 128-138). The Rust port still uses retain over the whole data map. Not a correctness issue, but it drops a deliberate Charon optimization that matters at steady state with many active duties. The fix is ~10 lines: append to keysByDuty in the Vacant branch of store, then in evict look up keysByDuty.remove(&duty) and delete each of those keys from data directly. Fix this →

🟡 wait_for still takes Duty by value (memory.rs:101-105) — minor ergonomics: only an owned Duty is needed to build the lookup key, and the body already clones nothing of the duty (the tuple (duty, pub_key) consumes the parameter directly). Taking &Duty + &PubKey would let call sites avoid a clone they don't need. Not load-bearing.


New observations from this pass

🟡 store still holds the data lock across deadliner.add(...).await (memory.rs:65-68) — this is correct (matches Charon V2 and is why tokio::sync::Mutex is needed), but the Deadliner::add impl in crates/core/src/deadline.rs should be audited to confirm it can't ever take a long blocking action (network call, etc.) while we hold the AggSigDB write lock. From a quick read of deadline.rs it looks like add just pushes onto a channel/spawns a timer, so this is fine — but worth a one-line comment in store noting "lock is held across add().await deliberately; add must remain non-blocking" to lock that contract in. Otherwise a future change to deadline.rs could silently introduce head-of-line blocking on every aggsigdb write.

🟢 Tests tokio::task::yield_now().await + assert!(!reader.is_finished()) pattern (memory.rs:231-232, 326-327, 356-362, 394-395, 404-408) — flagged last round and worth re-flagging: a single yield_now() doesn't guarantee the spawned reader has reached notified.await on a contended runner. It's the same shape as Charon's runtime.Gosched() reliance, so this isn't unique to Pluto, but if you ever see a flake on CI this is the first thing to suspect. A tokio::time::sleep(50ms) or an explicit Notify-based "reader is parked" signal is more robust. Low priority — current behavior is correct, just fragile.

🟢 SignedDataSet IntoIterator yields &Box<dyn SignedData> (crates/core/src/types.rs:747-754) — Clippy's borrowed_box may eventually nudge to &dyn SignedData via .map(|(k, v)| (k, v.as_ref())). Not actionable until clippy actually fires; mentioning so it's not a surprise.

🔵 MemDB::new calls tokio::spawn (memory.rs:41) — flagged last round; still worth a # Panics rustdoc note since a non-test caller constructing this outside an async/Tokio context will panic. One line on the rustdoc.

🔵 Cargo.lock churn — the +241/-193 diff on Cargo.lock still looks like an unrelated dependency refresh (alloy, winnow, secp256k1, etc.) folded into this PR. No new direct deps were added by aggsigdb itself, so it's safe to land, but a follow-up cleanup PR splitting the lockfile bump from feature work would make bisects easier going forward.


Summary

Severity Item Status
🔴 Correction: tokio::sync::Mutex is required (lock crosses await)
🟡 keysByDuty index for O(k) eviction Open
🟡 Audit/comment that deadliner.add stays non-blocking New
🟡 wait_for could take &Duty / &PubKey Open
🟢 yield_now + is_finished test pattern is racy on contended CI New
🟢 IntoIterator yields &Box<dyn SignedData> — clippy may nudge New
🔵 MemDB::new # Panics rustdoc note Open
🔵 Cargo.lock churn separability Open

Nothing blocking. The keysByDuty optimization is the most worthwhile follow-up (it's a deliberate Charon design and ~10 lines of code); everything else is documentation/ergonomics.
· Branch: emlautarom1/core-aggsigdb

@emlautarom1
Copy link
Copy Markdown
Collaborator Author

There is a bug on Charon regarding immediately expired duties, which result in readers getting SignedData associated to an expired duty:

  1. Goroutine A calls Await(duty, pubKey) and blocks waiting for data.
  2. Goroutine B calls Store(duty, pubKey, data) for the same key.
  3. While executing the write, the deadliner determines the duty is already expired.
  4. The Await goroutine gets resolved with data tied to an expired duty.

Both V1 and V2 showcase this behaviour. The cause is that the result of deadliner.Add is ignored while also being imprecise:

  1. If the duty can be scheduled, proceed with the write.
  2. If the duty never expires, also proceed with the write.
  3. If the duty is already expired skip the write.

Right now, cases 2 and 3 are treated under false (note the existing TODO in Charon).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement core/aggsigdb

1 participant