Fix that NewClientHandler() could hang indefinitely, preventing new connection attempts by julianbrost · Pull Request #10833 · Icinga/icinga2

julianbrost · 2026-04-30T11:35:39Z

There is some race condition when the async_write()/async_flush() operation for the icinga::Hello message fails (connection reset by peer for example) around the same time the connect timeout fires and calls cancel() on the stream, the following call to async_shutdown() may block indefinitely. If that happens, the endpoint remains in the connecting state and no new connection attemps are initiated.

This commit fixes the issue by removing the Defer containing the async_shutdown(). The purpose of async_shutdown() is to signal a clean termination of the connection to the peer, which really isn't something that makes sense to to in a Defer block that is also executed in case of errors. For the one situation where doing a clean TLS shutdown makes some sense (closing anonymous client connections), a call to GracefulShutdown() is added to that specific code path.

A large part of the change is just changing the indentation of the code, given that a now unnecessary try/catch block is removed.

Testing

The following Go code creates a TLS server that can be used to demonstrate the issue. Note that given that a race condition is involved, this is not reliable and the sleep duration may need some fine-tuning. For this to work, ApiListener.tls_handshake_timeout needs to be set to a large-enough value like 60s to disable the timeout for async_handshake() itself so that the overall connect timeout is the one that fires. However, changing the timeout is not a prerequisite for the problem, it just makes it easier to reproduce. The error can also happen with the default timeouts if the TCP connect takes long enough so that the handshake is started late enough that its timeout expires after the connect timeout.

package main

import (
    "crypto/tls"
    "log"
    "net"
    "time"
)

func main() {
    cert, err := tls.LoadX509KeyPair("bad-agent.crt", "bad-agent.key")
    if err != nil {
        panic(err)
    }

    listener, err := tls.Listen("tcp", ":1337", &tls.Config{
        Certificates: []tls.Certificate{cert},
    })
    if err != nil {
        panic(err)
    }

    log.Println("Listening on", listener.Addr())

    for {
        conn, err := listener.Accept()
        if err != nil {
            panic(err)
        }

        go handle(conn.(*tls.Conn))
    }
}

func handle(conn *tls.Conn) {
    addr := conn.RemoteAddr().String()
    log.Println(addr, "new connection")

    time.Sleep(15*time.Second - 10*time.Millisecond)

    log.Println(addr, "SetLinger(0)", conn.NetConn().(*net.TCPConn).SetLinger(0))
    log.Println(addr, "Handshake()", conn.Handshake())
    log.Println(addr, "conn.NetConn().Close()", conn.NetConn().Close())
}

Obviously, with that server, you will never get a persistent working JSON-RPC connection, so Icinga 2 should continously try to reconnect. However, when running this with the current master branch (75361cb), the reconnect attempts will eventually stop.

With additional logging in the catch block for boost::system::system_error and Defer shutdownSslConn (both removed by this commit), this showed the following. Note that in particular, async_shutdown() never returned, indicating that it hangs in there.

[2026-04-24 17:32:56 +0200] information/ApiListener: Reconnecting to endpoint 'bad-agent' via host 'host.docker.internal' and port '1337'
[2026-04-24 17:33:11 +0200] critical/ApiListener: Timeout while reconnecting to endpoint 'bad-agent' via host 'host.docker.internal' and port '1337', cancelling attempt
[2026-04-24 17:33:11 +0200] information/ApiListener: New client connection for identity 'bad-agent' to [172.17.0.1]:1337
[2026-04-24 17:33:12 +0200] information/ApiListener: rethrowing for bad-agent: Error: Connection reset by peer [system:104 at /usr/include/boost/asio/detail/reactive_socket_send_op.hpp:137 in function 'do_complete']
[2026-04-24 17:33:12 +0200] information/ApiListener: doing async_shutdown for bad-agent

When doing the same with this PR branch (417c3c1), Icinga 2 keeps trying to reconnect (which kind of makes sense given that the function call that was hanging does not exist any more).

Acknowledgements

Big thanks to @yhabteab who pointed me in the right direction with a theory what might be going on, finally allowing me to actually reproduce the problem.

References

fixes #10139

…onnection attempts There is some race condition when the `async_write()`/`async_flush()` operation for the `icinga::Hello` message fails (connection reset by peer for example) around the same time the connect timeout fires and calls `cancel()` on the stream, the following call to `async_shutdown()` may block indefinitely. If that happens, the endpoint remains in the connecting state and no new connection attemps are initiated. This commit fixes the issue by removing the `Defer` containing the `async_shutdown()`. The purpose of `async_shutdown()` is to signal a clean termination of the connection to the peer, which really isn't something that makes sense to to in a `Defer` block that is also executed in case of errors. For the one situation where doing a clean TLS shutdown makes some sense (closing anonymous client connections), a call to GracefulShutdown() is added to that specific code path. A large part of the change is just changing the indentation of the code, given that a now unnecessary `try`/`catch` block is removed. The following Go code creates a TLS server that can be used to demonstrate the issue. Note that given that a race condition is involved, this is not reliable and the sleep duration may need some fine-tuning. For this to work, `ApiListener.tls_handshake_timeout` needs to be set to a large-enough value like 60s to disable the timeout for `async_handshake()` itself so that the overall connect timeout is the one that fires. However, changing the timeout is not a prerequisite for the problem, it just makes it easier to reproduce. The error can also happen with the default timeouts if the TCP connect takes long enough so that the handshake is started late enough that its timeout expires after the connect timeout. package main import ( "crypto/tls" "log" "net" "time" ) func main() { cert, err := tls.LoadX509KeyPair("bad-agent.crt", "bad-agent.key") if err != nil { panic(err) } listener, err := tls.Listen("tcp", ":1337", &tls.Config{ Certificates: []tls.Certificate{cert}, }) if err != nil { panic(err) } log.Println("Listening on", listener.Addr()) for { conn, err := listener.Accept() if err != nil { panic(err) } go handle(conn.(*tls.Conn)) } } func handle(conn *tls.Conn) { addr := conn.RemoteAddr().String() log.Println(addr, "new connection") time.Sleep(15*time.Second - 10*time.Millisecond) log.Println(addr, "SetLinger(0)", conn.NetConn().(*net.TCPConn).SetLinger(0)) log.Println(addr, "Handshake()", conn.Handshake()) log.Println(addr, "conn.NetConn().Close()", conn.NetConn().Close()) } With additional logging in the `catch` block for `boost::system::system_error` and `Defer shutdownSslConn` (both removed by this commit), this showed the following. Note that in particular, `async_shutdown()` never returned, indicating that it hangs in there. [2026-04-24 17:32:56 +0200] information/ApiListener: Reconnecting to endpoint 'bad-agent' via host 'host.docker.internal' and port '1337' [2026-04-24 17:33:11 +0200] critical/ApiListener: Timeout while reconnecting to endpoint 'bad-agent' via host 'host.docker.internal' and port '1337', cancelling attempt [2026-04-24 17:33:11 +0200] information/ApiListener: New client connection for identity 'bad-agent' to [172.17.0.1]:1337 [2026-04-24 17:33:12 +0200] information/ApiListener: rethrowing for bad-agent: Error: Connection reset by peer [system:104 at /usr/include/boost/asio/detail/reactive_socket_send_op.hpp:137 in function 'do_complete'] [2026-04-24 17:33:12 +0200] information/ApiListener: doing async_shutdown for bad-agent

julianbrost · 2026-05-06T07:59:11Z

Just a rebase to make the GitHub Actions happy again after fixes to them, no code changes inside the PR.

julianbrost added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) ref/IP consider backporting Should be considered for inclusion in a bugfix release labels Apr 30, 2026

cla-bot Bot added the cla/signed label Apr 30, 2026

julianbrost mentioned this pull request Apr 30, 2026

Possible race condition with Boost.ASIO cancel() in timeout handling #10139

Open

julianbrost added this to the 2.17.0 milestone Apr 30, 2026

julianbrost requested a review from jschmidt-icinga April 30, 2026 12:13

julianbrost force-pushed the connection-hang branch from 417c3c1 to 3302c9b Compare May 6, 2026 07:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix that NewClientHandler() could hang indefinitely, preventing new connection attempts#10833

Fix that NewClientHandler() could hang indefinitely, preventing new connection attempts#10833
julianbrost wants to merge 1 commit into
masterfrom
connection-hang

julianbrost commented Apr 30, 2026

Uh oh!

julianbrost commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

julianbrost commented Apr 30, 2026

Testing

Acknowledgements

References

Uh oh!

julianbrost commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant