Some pipeline fixes in Bitreq#584
Conversation
|
|
||
| request_id = conn.next_request_id.fetch_add(1, Ordering::Relaxed); | ||
| request_id = conn | ||
| .next_request_id |
There was a problem hiding this comment.
what kind of weird platform are you running on that you overflowed this? ISTM the right solution is just make it an AtomicU64 for the broken machines trying to use this on 32-bit.
There was a problem hiding this comment.
I am using amd64. I don't see how this can be platform related.
When some failures occur conn.next_request_id is set to Usize::MAX.
I understand the intent of setting usize::Max is to stop using this connection.
However, AtomicUsize overflows on any platform according to the docs.
There was a problem hiding this comment.
Oh sorry I misunderstood that this was counting past to overflow, forgot it was writing MAX elsewhere.
| request_id = conn.next_request_id.fetch_add(1, Ordering::Relaxed); | ||
| request_id = conn | ||
| .next_request_id | ||
| .fetch_update(Ordering::Relaxed, Ordering::Relaxed, |current| { |
There was a problem hiding this comment.
Meh, wouldn't this be simpler by just changing the sentinel to usize::MAX / 2 and accepting the race isn't going to manage to run through usize::MAX / 2 elements before the first thread can get into retry_new_connection and re-write the sentinel value? Avoids the read+CAS in the optimistic case.
2f25cfc to
729cbd6
Compare
Adds an integration test that spins up a raw-TCP server returning a malformed HTTP response where `keep-alive: max 1`. Marked `#[ignore]` for now: it hangs on the current code. Subsequent commits in this series fix the underlying bugs and the final commit removes the `#[ignore]`. HINT: You can run cargo test -p bitreq --features async --test pipelining -- --ignored to run this test and verify that it fails on master.
Pipelined `send_async` calls on a single `Client` could hang indefinitely on responses the server had already sent. `Response::create_async` wrapped its input in a fresh `BufReader`. On a busy socket, that buffer could prefetch bytes for the next pipelined response; when the function returned, those bytes were dropped with it, stranding the next reader. Move the `BufReader` into `AsyncConnectionState::read` so it lives as long as the connection.
729cbd6 to
ee002eb
Compare
|
I added a test to reproduce some of the issues I observed. The old code set I switched the approach a bit. The |
| readable_request_id: AtomicUsize, | ||
| /// Set when the read half is in an indeterminate state and pending pipelined readers must | ||
| /// abort and retry on a fresh connection. | ||
| reads_dead: AtomicBool, |
There was a problem hiding this comment.
Hmm, why should we not just keep using readable_request_id? I'm not sure "more atomics" is really more readable since it now means we have to carefully review for cases where the atomic reads splitbrain.
There was a problem hiding this comment.
You are right. It doesn't help readability.
| /// Remaining permits for sending new requests. Acquired via `fetch_sub(1)`; an underflow is | ||
| /// detected as `prev == 0 || prev > MAX_SEND_PERMITS`. Poisoned by storing `0`; capped by | ||
| /// keep-alive `max=N` via `fetch_min(N)`. | ||
| permits: AtomicUsize, |
There was a problem hiding this comment.
This would probably be more readable as an AtomicIsize where we can just check against 0.
There was a problem hiding this comment.
Thanks. That is a very good suggestion
Track per-connection send admission in a dedicated permits counter on AsyncConnectionState. permits starts at usize::MAX/2 and is consumed via fetch_sub(1); a wrap (returned previous value 0, or already > MAX_PERMITS) signals the connection is exhausted or poisoned. Poisoning becomes permits.store(0) and the keep-alive max=N directive maps to permits.fetch_min(N). next_request_id is now a pure monotonic counter used only for pipelining order. Indeterminate read state moves to a sibling reads_dead AtomicBool so readable_request_id can also stay a pure counter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ee002eb to
11098da
Compare
We switched recently to bitcoind-asyn-client and I saw some panics and requests getting stuck.. See #583
I identified two issues
next_request_idwhich results in panicsI didn't take the time to get deeply familiar with how bitreq works.
I'm happy to see suggestions on other approaches to fix the problem.