Skip to content

Conversation

Copy link

Copilot AI commented Jan 23, 2026

Fix connection contention issue in http_utils.py - reuse ClientSession ✅

Summary

Fixed a critical connection contention issue where the Python Durable Functions SDK was creating a new aiohttp.ClientSession for every HTTP request to the internal RPC endpoint. This anti-pattern was causing intermittent ConnectionTimeoutErrors (~30s) under concurrent load, as observed in production (ICM 695094479).

Plan:

  • Explore repository structure and understand current implementation
  • Understand the problem: new ClientSession created per request causes connection contention
  • Design solution to reuse ClientSession with proper lifecycle management
  • Implement session singleton with lazy initialization and async lock
  • Add configurable timeout and connection pooling settings
  • Handle session cleanup and automatic reconnection on errors
  • Add comprehensive tests for the new session reuse functionality
  • Run existing tests to ensure no regressions (all 54 tests passing)
  • Run linters to ensure code quality (all checks passing)
  • Address code review feedback - refactor error handling
  • Run security checks (no vulnerabilities found)
  • Optimize settings for localhost IPC per reviewer feedback
  • Add try/finally blocks for safe session cleanup

Implementation Details:

  1. Session Singleton: Created global _client_session with async lock for thread-safe initialization
  2. Timeout Configuration: Optimized for localhost IPC - 240s total (4 minutes for slow operations like purge), 10s sock_connect (fast for localhost), sock_read=None
  3. Connection Pooling: TCPConnector with limit=30 (single host), limit_per_host=30
  4. Error Handling: Refactored into _handle_request_error() helper with session lock and try/finally for safe cleanup
  5. Lifecycle Management: Added _close_session() with session lock and try/finally for worker shutdown cleanup
  6. All three HTTP methods updated: POST, GET, DELETE now use shared session

Key considerations addressed:

  1. ✅ Thread-safe session initialization with async lock (double-check locking pattern)
  2. ✅ Session lifecycle management (cleanup on errors with lock protection)
  3. ✅ Connection limits optimized for localhost IPC (30 connections, no DNS cache)
  4. ✅ Handle remote host process recycles (automatic reconnect on connection errors)
  5. ✅ Code review feedback - reduced duplication via helper function
  6. ✅ Security checks passed - no vulnerabilities introduced
  7. ✅ Session locks added to error handler and close functions
  8. ✅ Try/finally blocks ensure session is reset even if close() fails

Test Coverage:

  • Session reuse across multiple requests ✅
  • Session recreation after close ✅
  • Session cleanup on connection errors ✅
  • All HTTP methods (GET, POST, DELETE) use shared session ✅
  • Timeout and connector configuration (updated for localhost IPC) ✅
  • Trace header propagation ✅

Security Summary:

No vulnerabilities found during CodeQL security analysis. The changes are safe and improve the reliability of the SDK under concurrent load.

Original prompt

This section details on the original issue you should resolve

<issue_title>Fix connection contention issue in http_utils.py - reuse ClientSession</issue_title>
<issue_description>## Summary

The Python Durable Functions SDK creates a new aiohttp.ClientSession for every HTTP request to the internal RPC endpoint. This is an anti-pattern that can cause connection contention and timeouts under concurrent load.

Problem

In http_utils.py, the post_async_request function creates a new session for each request:

python async def post_async_request(url: str, data: Any = None, ...) -> List[Union[int, Any]]: async with aiohttp.ClientSession() as session: # New session per request # ...

The aiohttp documentation explicitly warns against this pattern:

"Don't create a session per request. Most likely you need a session per application which performs all requests together."

Impact

During a production investigation (ICM 695094479), we observed intermittent ConnectionTimeoutError (~30s) when calling client.start_new() to start orchestrations. The error occurs in aiohttp/connector.py:_wrap_create_connection, indicating TCP connection establishment failures.

Under concurrent load (bursts of 6-9 simultaneous requests), multiple requests compete to establish new TCP connections instead of reusing pooled connections from a shared session.

Proposed Fix

Modify http_utils.py to reuse a single ClientSession with configurable timeout and connection pooling.

Considerations

  1. Thread safety - May need to use an async lock when initializing the session
  2. Session lifecycle - Need to handle session cleanup on worker shutdown
  3. Connection limits - The TCPConnector limits should be tuned appropriately

Additional Context

  • The 30s timeout matches aiohttp's default sock_connect timeout
  • Similar issues have been reported in the aiohttp community
    </issue_description>

<agent_instructions>Please fix the bug described in this issue. Make sure to follow the coding conventions of the existing code. Be mindful of any considerations around cached session lifecycle. For example, whether there's anything to do if the remote host process recycles while the client in the worker remains active.</agent_instructions>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 3 commits January 23, 2026 23:09
Co-authored-by: cgillum <2704139+cgillum@users.noreply.github.com>
Co-authored-by: cgillum <2704139+cgillum@users.noreply.github.com>
Co-authored-by: cgillum <2704139+cgillum@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix connection contention issue in http_utils.py Fix connection contention by reusing aiohttp ClientSession Jan 23, 2026
Copilot AI requested a review from cgillum January 23, 2026 23:14
…sion locks

Co-authored-by: cgillum <2704139+cgillum@users.noreply.github.com>
Copilot AI requested a review from cgillum January 24, 2026 00:17
Co-authored-by: cgillum <2704139+cgillum@users.noreply.github.com>
Copilot AI requested a review from cgillum January 24, 2026 00:31
@cgillum cgillum marked this pull request as ready for review January 24, 2026 00:46
Copy link
Collaborator

@andystaples andystaples left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with one tiny nit comment

async def _close_session() -> None:
"""Close the shared ClientSession if it exists.
This function should be called during worker shutdown.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This comment says we should call _close_session on worker shutdown, but it is only called by _handle_request_error. Can we add the call during shutdown or update the comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a way to hook into the shutdown? If not, I'll just update the comment since process shutdown should clean up all resources anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix connection contention issue in http_utils.py - reuse ClientSession

3 participants