Skip to content

fix: tighten oneshot memory ordering#111

Open
tisonkun wants to merge 4 commits into
mainfrom
codex/oneshot-ordering-fixes
Open

fix: tighten oneshot memory ordering#111
tisonkun wants to merge 4 commits into
mainfrom
codex/oneshot-ordering-fixes

Conversation

@tisonkun
Copy link
Copy Markdown
Collaborator

Summary

  • tighten oneshot sender/receiver state transitions to synchronize final channel ownership handoff
  • avoid racing with the sender when dropping a pending async receiver by first moving RECEIVING back to EMPTY
  • add concurrent send/drop with try_recv/poll regression tests inspired by upstream oneshot Miri coverage

Verification

  • cargo x test
  • cargo +nightly x lint

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Tightens the oneshot channel’s state-transition memory orderings to better synchronize message/waker visibility and final channel deallocation ownership, and adds regression tests covering concurrent send/drop vs try_recv/poll.

Changes:

  • Adjusted atomic orderings and added dedicated acquire fences around MESSAGE-handling paths to explicitly synchronize message visibility and final deallocation handoff.
  • Updated drop paths (Sender/Receiver/Recv) to better coordinate which side frees the channel under races.
  • Added concurrent send/drop completion tests for both try_recv and Future::poll paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
mea/src/oneshot/mod.rs Updates atomic orderings/fences and drop-path state transitions to strengthen synchronization and ownership handoff.
mea/src/oneshot/tests.rs Adds multi-threaded regression tests for concurrent send/drop vs try_recv/poll, plus a small thread helper.
Comments suppressed due to low confidence (3)

mea/src/oneshot/tests.rs:363

  • This test’s receiver thread loops with spin_loop() until Disconnected is observed; without any timeout/iteration cap, failures can turn into an infinite hang during join(). Consider adding a timeout/budget and failing fast if the expected state isn’t reached.
    let receiver_thread = spawn_named("receiver", move || {
        loop {
            match receiver.try_recv() {
                Ok(value) => panic!("unexpected value: {value}"),
                Err(TryRecvError::Empty) => spin_loop(),
                Err(TryRecvError::Disconnected) => break,
            }
        }

mea/src/oneshot/tests.rs:389

  • This test polls in a tight unbounded spin_loop() until completion. If the wake/state transition breaks, the test will hang indefinitely at join(). Add a bounded timeout/iteration budget (and panic on timeout) to keep CI reliable.
        loop {
            match Pin::new(&mut receiver).poll(&mut context) {
                Poll::Ready(Ok(999)) => break,
                Poll::Ready(result) => panic!("unexpected result: {result:?}"),
                Poll::Pending => spin_loop(),
            }
        }

mea/src/oneshot/tests.rs:415

  • Similar to the other concurrent tests, this thread spins in an unbounded loop until it observes the disconnect. A regression could cause an infinite hang. Add a timeout/iteration cap and fail fast rather than spinning forever.
        loop {
            match Pin::new(&mut receiver).poll(&mut context) {
                Poll::Ready(Err(oneshot::RecvError::Disconnected)) => break,
                Poll::Ready(result) => panic!("unexpected result: {result:?}"),
                Poll::Pending => spin_loop(),
            }
        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mea/src/oneshot/mod.rs Outdated
Comment thread mea/src/oneshot/tests.rs Outdated
@tisonkun tisonkun marked this pull request as draft May 13, 2026 03:20
@orthur2
Copy link
Copy Markdown
Contributor

orthur2 commented May 13, 2026

Memory ordering here is tough to think through and guarantee correctness, so my review may not be very comprehensive.

First and foremost, following Copilot's review, Recv::drop needs acquire synchronization before dropping the stored waker. Channel::write_waker publishes the waker with a Release CAS, but the new RECEIVING -> EMPTY CAS in Recv::drop is Relaxed. Since Recv<T> is Send, the future may be polled on one thread and dropped on another, so Relaxed is too weak here. And the Recv::poll waker-replacement path also has the same issue.

Then, I found an issue that existed before this PR. I am concerned that the AWAKING branch in Recv::drop can still deallocate the channel too early. This path writes DISCONNECTED before it handles AWAKING. So if the sender has just moved the state to AWAKING and is still taking the waker, the drop path can read back its own DISCONNECTED write and free the channel too early.

As Copilot noted, the new spin-loop tests should also be bounded with a timeout or iteration budget.

Comment thread mea/src/oneshot/mod.rs Outdated
Use acquire ordering before dropping stored wakers in Recv::poll and Recv::drop.

When Recv::drop observes AWAKING, wait for the sender to finish the transition instead of writing DISCONNECTED first.

Also bound the spin-loop tests and add coverage for pending receiver drop and cross-thread waker replacement.
@orthur2
Copy link
Copy Markdown
Contributor

orthur2 commented May 22, 2026

I pushed a follow-up commit to address it.

@tisonkun tisonkun marked this pull request as ready for review May 22, 2026 14:24
Signed-off-by: tison <wander4096@gmail.com>
Copy link
Copy Markdown
Contributor

@orthur2 orthur2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my side.

@tisonkun
Copy link
Copy Markdown
Collaborator Author

I'm reviewing the comments and order. From agent review, it says:

Blockers:

  1. Please explain the Recv::drop interleaving around the RECEIVING -> EMPTY -> DISCONNECTED path. In particular: if a sender concurrently succeeds in send during the temporary EMPTY window, who owns dropping the message, the waker, and the allocation, and in what order? The current code may rely on the next loop iteration doing EMPTY -> DISCONNECTED, but that safety argument is not written down clearly enough for this kind of code.
  2. Fix the misleading comments in oneshot/mod.rs, especially the poll MESSAGE branch saying the sender was dropped when it actually sent a message, and the stale memory-barrier comment around RECEIVING -> EMPTY. In unsafe/atomic code, wrong comments are not cosmetic.

Non-blocking: the new 5s spin/concurrency tests are useful as smoke/stress tests, but they should not be presented as proof of the memory-ordering argument. A loom/model test would be a good follow-up if this code is staying subtle.

@orthur2
Copy link
Copy Markdown
Contributor

orthur2 commented May 23, 2026

I'm reviewing the comments and order. From agent review, it says:

Blockers:

  1. Please explain the Recv::drop interleaving around the RECEIVING -> EMPTY -> DISCONNECTED path. In particular: if a sender concurrently succeeds in send during the temporary EMPTY window, who owns dropping the message, the waker, and the allocation, and in what order? The current code may rely on the next loop iteration doing EMPTY -> DISCONNECTED, but that safety argument is not written down clearly enough for this kind of code.
  2. Fix the misleading comments in oneshot/mod.rs, especially the poll MESSAGE branch saying the sender was dropped when it actually sent a message, and the stale memory-barrier comment around RECEIVING -> EMPTY. In unsafe/atomic code, wrong comments are not cosmetic.

Non-blocking: the new 5s spin/concurrency tests are useful as smoke/stress tests, but they should not be presented as proof of the memory-ordering argument. A loom/model test would be a good follow-up if this code is staying subtle.

The agent review makes sense. I focused on the state-machine change and missed that some existing comments became stale or misleading.

I added a small follow-up to fix the MESSAGE / RECEIVING -> EMPTY comments and spell out the Recv::drop case a bit more.

For the temporary EMPTY window, after Recv::drop successfully CAS RECEIVING -> EMPTY with Acquire, it has retracted the published waker and owns dropping that waker. The Acquire CAS synchronizes with write_waker's Release CAS, so the waker is initialized before drop_waker().If a sender's send transition wins during this temporary EMPTY window, it observes EMPTY, publishes MESSAGE, and returns Ok. Because it did not observe RECEIVING, it never takes or drops the waker, and because send succeeded from EMPTY, it does not own deallocating the channel. Recv::drop still owns the retracted waker, drops it, and loops. Once the sender has published MESSAGE, Recv::drop will observe MESSAGE with Acquire, drop the message, and deallocate the channel.If no sender wins that temporary EMPTY window, Recv::drop's next loop iteration can complete EMPTY -> DISCONNECTED. In that case, any later cleanup is owned by the sender side: either Sender::send observes DISCONNECTED and the SendError path owns the message/allocation cleanup, or Sender::drop observes DISCONNECTED and deallocates the channel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants