fix(tcpclient, iostream): close SSLIOStream on TLS handshake timeout to prevent fd leak#3636
Conversation
…to prevent fd leak (tornado#3614) When TCPClient.connect() times out during the TLS handshake phase, start_tls() has already transferred socket ownership to a new SSLIOStream and set the original stream's socket to None. The caller's TimeoutError handler could only call stream.close(), which is a no-op since the socket is already gone. Meanwhile the SSLIOStream (holding the real fd) remains registered on the IOLoop forever — leaking the file descriptor. Fix: - In IOStream.start_tls(): store the new SSLIOStream as self._tls_stream so callers can reach it after timeout - In TCPClient.connect(): on gen.with_timeout TimeoutError during start_tls(), close stream._tls_stream to release the fd - Add regression test: TestTLSHandshakeTimeoutCleanup verifies no file descriptor leak after TLS handshake timeout This is a targeted fix with minimal API surface change. The _tls_stream attribute is only set during start_tls() and cleared after cleanup.
| # start_tls() transferred socket ownership to a new | ||
| # SSLIOStream (self._tls_stream). Close it to prevent | ||
| # a file descriptor leak. See tornado#3614. | ||
| tls_stream = getattr(stream, "_tls_stream", None) |
There was a problem hiding this comment.
This feels dirty - we don't cancel the coroutine, so it could wake back up and find that its internal state (i.e. self._tls_stream) has been modified.
I think in this case it's all more or less harmless (except for the logging that is currently failing CI) because we don't rely on self._tls_stream after setting it and closing the socket will cause it to generate an internal StreamClosedError, but it's difficult to reason about.
I think it would probably be better to push awareness of the timeout into start_tls() so that we don't handle the timeout at a different level from other errors. Alternately, cancelling the task might do the job more cleanly for us (or at least it's supposed to. Tornado was originally built with non-cancelable futures and I'm not sure how much cancellation works now that we've picked that feature up from asyncio).
| ctx.check_hostname = False | ||
| ctx.verify_mode = _ssl.CERT_NONE | ||
|
|
||
| # Count open fds before (Linux /proc/self/fd) |
There was a problem hiding this comment.
This is platform-specific and at least needs to be guarded.
Is counting open FDs the best way to do this? This might be a time when mocks would be most appropriate, but that seems messy too.
|
|
||
| @gen_test | ||
| def test_tls_handshake_timeout_closes_stream(self) -> None: | ||
| import ssl as _ssl |
Problem (tornado#3614)
When
TCPClient.connect()is called with bothssl_optionsandtimeout, a TLS handshake timeout causes permanent file descriptor leak.Root Cause
start_tls()transfers socket ownership in three steps:IOStream→self.socket = NoneSSLIOStream(local variable only)Future[SSLIOStream]that resolves when handshake completesWhen
gen.with_timeout()fires:TimeoutError.socketis alreadyNone→close()is no-opSSLIOStream(holding the real fd) is only reachable through the inner Futuregen.with_timeoutexplicitly does not cancel the wrapped FutureImpact
Every timed-out TLS connection leaks one file descriptor. Long-running services that retry connections (e.g. HTTP clients with short timeouts) will eventually hit the OS fd limit (
EMFILE/ "too many open files").Fix
Two files, minimal change:
tornado/iostream.pyIn
start_tls(): store the SSLIOStream asself._tls_streambefore returning the future.tornado/tcpclient.pyIn
connect(): onTimeoutErrorduringstart_tls(), close_tls_stream.Testing
New test:
TestTLSHandshakeTimeoutCleanup.test_tls_handshake_timeout_closes_streamTCPClient.connect()with SSL + 0.01s timeoutTimeoutError/proc/self/fdcountAll existing tests pass:
tcpclient_test.py: ✅ 27 passed, 4 skippediostream_test.py: ✅ 142 passed, 61 skippedFixes #3614