Skip to content

fix(tcpclient, iostream): close SSLIOStream on TLS handshake timeout to prevent fd leak#3636

Closed
armorbreak001 wants to merge 1 commit into
tornadoweb:masterfrom
armorbreak001:fix/tls-handshake-timeout-socket-leak
Closed

fix(tcpclient, iostream): close SSLIOStream on TLS handshake timeout to prevent fd leak#3636
armorbreak001 wants to merge 1 commit into
tornadoweb:masterfrom
armorbreak001:fix/tls-handshake-timeout-socket-leak

Conversation

@armorbreak001

Copy link
Copy Markdown

Problem (tornado#3614)

When TCPClient.connect() is called with both ssl_options and timeout, a TLS handshake timeout causes permanent file descriptor leak.

Root Cause

start_tls() transfers socket ownership in three steps:

  1. Extract raw socket from original IOStreamself.socket = None
  2. Wrap in SSL → create new SSLIOStream (local variable only)
  3. Return Future[SSLIOStream] that resolves when handshake completes

When gen.with_timeout() fires:

  • Caller gets TimeoutError
  • Original stream's .socket is already Noneclose() is no-op
  • The SSLIOStream (holding the real fd) is only reachable through the inner Future
  • gen.with_timeout explicitly does not cancel the wrapped Future
  • SSLIOStream stays on IOLoop forever → fd leaked permanently

Impact

Every timed-out TLS connection leaks one file descriptor. Long-running services that retry connections (e.g. HTTP clients with short timeouts) will eventually hit the OS fd limit (EMFILE / "too many open files").

Fix

Two files, minimal change:

tornado/iostream.py

In start_tls(): store the SSLIOStream as self._tls_stream before returning the future.

self._tls_stream = ssl_stream  # Keep reference for timeout cleanup

tornado/tcpclient.py

In connect(): on TimeoutError during start_tls(), close _tls_stream.

except gen.TimeoutError:
    tls_stream = getattr(stream, "_tls_stream", None)
    if tls_stream is not None:
        tls_stream.close()
        stream._tls_stream = None
    raise

Testing

New test: TestTLSHandshakeTimeoutCleanup.test_tls_handshake_timeout_closes_stream

  • Opens TCP server socket (no TLS accept)
  • Calls TCPClient.connect() with SSL + 0.01s timeout
  • Expects TimeoutError
  • Verifies no fd leaked via /proc/self/fd count

All existing tests pass:

  • tcpclient_test.py: ✅ 27 passed, 4 skipped
  • iostream_test.py: ✅ 142 passed, 61 skipped

Fixes #3614

…to prevent fd leak (tornado#3614)

When TCPClient.connect() times out during the TLS handshake phase,
start_tls() has already transferred socket ownership to a new
SSLIOStream and set the original stream's socket to None. The caller's
TimeoutError handler could only call stream.close(), which is a no-op
since the socket is already gone. Meanwhile the SSLIOStream (holding
the real fd) remains registered on the IOLoop forever — leaking the
file descriptor.

Fix:
- In IOStream.start_tls(): store the new SSLIOStream as self._tls_stream
  so callers can reach it after timeout
- In TCPClient.connect(): on gen.with_timeout TimeoutError during
  start_tls(), close stream._tls_stream to release the fd
- Add regression test: TestTLSHandshakeTimeoutCleanup verifies
  no file descriptor leak after TLS handshake timeout

This is a targeted fix with minimal API surface change. The _tls_stream
attribute is only set during start_tls() and cleared after cleanup.

@bdarnell bdarnell left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intended to replace #3615, right? Please update existing PRs instead of starting new ones. And make sure that at least the checks run in the "quick" ci config pass before pushing changes.

Comment thread tornado/tcpclient.py
# start_tls() transferred socket ownership to a new
# SSLIOStream (self._tls_stream). Close it to prevent
# a file descriptor leak. See tornado#3614.
tls_stream = getattr(stream, "_tls_stream", None)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels dirty - we don't cancel the coroutine, so it could wake back up and find that its internal state (i.e. self._tls_stream) has been modified.

I think in this case it's all more or less harmless (except for the logging that is currently failing CI) because we don't rely on self._tls_stream after setting it and closing the socket will cause it to generate an internal StreamClosedError, but it's difficult to reason about.

I think it would probably be better to push awareness of the timeout into start_tls() so that we don't handle the timeout at a different level from other errors. Alternately, cancelling the task might do the job more cleanly for us (or at least it's supposed to. Tornado was originally built with non-cancelable futures and I'm not sure how much cancellation works now that we've picked that feature up from asyncio).

ctx.check_hostname = False
ctx.verify_mode = _ssl.CERT_NONE

# Count open fds before (Linux /proc/self/fd)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is platform-specific and at least needs to be guarded.

Is counting open FDs the best way to do this? This might be a time when mocks would be most appropriate, but that seems messy too.


@gen_test
def test_tls_handshake_timeout_closes_stream(self) -> None:
import ssl as _ssl

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

@armorbreak001

Copy link
Copy Markdown
Author

Closing in favor of #3615, which has been updated with the improved approach (tls_future.cancel() + _tls_stream cleanup + platform guard for test). Thanks for the review @bdarnell!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Socket leak in TCPClient.connect when TLS handshake times out

2 participants