Fix that NewClientHandler() could hang indefinitely, preventing new connection attempts#10833
Open
julianbrost wants to merge 1 commit into
Open
Fix that NewClientHandler() could hang indefinitely, preventing new connection attempts#10833julianbrost wants to merge 1 commit into
julianbrost wants to merge 1 commit into
Conversation
…onnection attempts
There is some race condition when the `async_write()`/`async_flush()` operation
for the `icinga::Hello` message fails (connection reset by peer for example)
around the same time the connect timeout fires and calls `cancel()` on the
stream, the following call to `async_shutdown()` may block indefinitely. If
that happens, the endpoint remains in the connecting state and no new
connection attemps are initiated.
This commit fixes the issue by removing the `Defer` containing the
`async_shutdown()`. The purpose of `async_shutdown()` is to signal a clean
termination of the connection to the peer, which really isn't something that
makes sense to to in a `Defer` block that is also executed in case of errors.
For the one situation where doing a clean TLS shutdown makes some sense
(closing anonymous client connections), a call to GracefulShutdown() is added
to that specific code path.
A large part of the change is just changing the indentation of the code, given
that a now unnecessary `try`/`catch` block is removed.
The following Go code creates a TLS server that can be used to demonstrate the
issue. Note that given that a race condition is involved, this is not reliable
and the sleep duration may need some fine-tuning. For this to work,
`ApiListener.tls_handshake_timeout` needs to be set to a large-enough value
like 60s to disable the timeout for `async_handshake()` itself so that the
overall connect timeout is the one that fires. However, changing the timeout is
not a prerequisite for the problem, it just makes it easier to reproduce. The
error can also happen with the default timeouts if the TCP connect takes long
enough so that the handshake is started late enough that its timeout expires
after the connect timeout.
package main
import (
"crypto/tls"
"log"
"net"
"time"
)
func main() {
cert, err := tls.LoadX509KeyPair("bad-agent.crt", "bad-agent.key")
if err != nil {
panic(err)
}
listener, err := tls.Listen("tcp", ":1337", &tls.Config{
Certificates: []tls.Certificate{cert},
})
if err != nil {
panic(err)
}
log.Println("Listening on", listener.Addr())
for {
conn, err := listener.Accept()
if err != nil {
panic(err)
}
go handle(conn.(*tls.Conn))
}
}
func handle(conn *tls.Conn) {
addr := conn.RemoteAddr().String()
log.Println(addr, "new connection")
time.Sleep(15*time.Second - 10*time.Millisecond)
log.Println(addr, "SetLinger(0)", conn.NetConn().(*net.TCPConn).SetLinger(0))
log.Println(addr, "Handshake()", conn.Handshake())
log.Println(addr, "conn.NetConn().Close()", conn.NetConn().Close())
}
With additional logging in the `catch` block for `boost::system::system_error`
and `Defer shutdownSslConn` (both removed by this commit), this showed the
following. Note that in particular, `async_shutdown()` never returned,
indicating that it hangs in there.
[2026-04-24 17:32:56 +0200] information/ApiListener: Reconnecting to endpoint 'bad-agent' via host 'host.docker.internal' and port '1337'
[2026-04-24 17:33:11 +0200] critical/ApiListener: Timeout while reconnecting to endpoint 'bad-agent' via host 'host.docker.internal' and port '1337', cancelling attempt
[2026-04-24 17:33:11 +0200] information/ApiListener: New client connection for identity 'bad-agent' to [172.17.0.1]:1337
[2026-04-24 17:33:12 +0200] information/ApiListener: rethrowing for bad-agent: Error: Connection reset by peer [system:104 at /usr/include/boost/asio/detail/reactive_socket_send_op.hpp:137 in function 'do_complete']
[2026-04-24 17:33:12 +0200] information/ApiListener: doing async_shutdown for bad-agent
417c3c1 to
3302c9b
Compare
Member
Author
|
Just a rebase to make the GitHub Actions happy again after fixes to them, no code changes inside the PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
There is some race condition when the
async_write()/async_flush()operation for theicinga::Hellomessage fails (connection reset by peer for example) around the same time the connect timeout fires and callscancel()on the stream, the following call toasync_shutdown()may block indefinitely. If that happens, the endpoint remains in the connecting state and no new connection attemps are initiated.This commit fixes the issue by removing the
Defercontaining theasync_shutdown(). The purpose ofasync_shutdown()is to signal a clean termination of the connection to the peer, which really isn't something that makes sense to to in aDeferblock that is also executed in case of errors. For the one situation where doing a clean TLS shutdown makes some sense (closing anonymous client connections), a call to GracefulShutdown() is added to that specific code path.A large part of the change is just changing the indentation of the code, given that a now unnecessary
try/catchblock is removed.Testing
The following Go code creates a TLS server that can be used to demonstrate the issue. Note that given that a race condition is involved, this is not reliable and the sleep duration may need some fine-tuning. For this to work,
ApiListener.tls_handshake_timeoutneeds to be set to a large-enough value like 60s to disable the timeout forasync_handshake()itself so that the overall connect timeout is the one that fires. However, changing the timeout is not a prerequisite for the problem, it just makes it easier to reproduce. The error can also happen with the default timeouts if the TCP connect takes long enough so that the handshake is started late enough that its timeout expires after the connect timeout.Obviously, with that server, you will never get a persistent working JSON-RPC connection, so Icinga 2 should continously try to reconnect. However, when running this with the current
masterbranch (75361cb), the reconnect attempts will eventually stop.With additional logging in the
catchblock forboost::system::system_errorandDefer shutdownSslConn(both removed by this commit), this showed the following. Note that in particular,async_shutdown()never returned, indicating that it hangs in there.When doing the same with this PR branch (417c3c1), Icinga 2 keeps trying to reconnect (which kind of makes sense given that the function call that was hanging does not exist any more).
Acknowledgements
Big thanks to @yhabteab who pointed me in the right direction with a theory what might be going on, finally allowing me to actually reproduce the problem.
References
fixes #10139