Skip to content

Unable to update the BottlerocketShadow associated with this node #801

@Mieszko96

Description

@Mieszko96

Intermittent "error sending request" timeout between brupop agents and API server

Summary

brupop agents (v1.6.0) sporadically fail to send requests to the API server with generic "error sending request for url" error approximately every 30 seconds. The error occurs despite successful connections moments before.

Environment

  • brupop version: 1.6.0
  • brupop-crd version: 1.0.0
  • Kubernetes: EKS
  • Node OS: Bottlerocket 1.52.0 - 1.53.0
  • Agent Pod Count: 3 (DaemonSet)
  • API Server Replicas: 3
  • Namespace: brupop-bottlerocket-aws

Problem Description

Agents fail with:

'error sending request for url (https://brupop-apiserver.brupop-bottlerocket-aws.svc.cluster.local/bottlerocket-node-resource)'

The error pattern:

  1. Agent successfully connects to API server multiple times (3-6 successful requests)
  2. Connection attempt fails with "error sending request"
  3. Error repeats periodically (~every 30 seconds)
  4. Agents retry and eventually recover

Investigation Results

What was tried:

  1. Infrastructure checks ✅ All healthy:

    • All pods Running, no errors in API server logs
    • Service endpoints: 3 available
    • Certificates: Valid (expiry April 24, 2026)
    • Network policies: None blocking traffic
    • DNS resolution: Working
  2. Resource tuning ❌ No improvement:

    • Increased API server CPU: 10m → 500m
    • Increased memory: 40Mi → 128Mi
    • Increased limits: Added 200m/1000m CPU limits
    • Error frequency unchanged (~every 30 seconds)
  3. Debug investigation 🤔 Heisenbug behavior:

    • Enabling debug logging reduced error frequency (from ~30s to ~2min)
    • Suggests timing-sensitive or race-condition issue
    • Not a simple resource exhaustion problem

Likely Root Causes:

  1. Agent-side timeout: Client timeout hardcoded in agent, not configurable via Helm
  2. Connection pool issue: Possible connection reuse/cleanup problem
  3. TLS session timeout: HTTP/2 connection lifecycle issue
  4. Intermittent network issue: MTU, packet loss, or CNI quirk

What We Can EXCLUDE:

  • DNS issues: ❌ Would fail ALL requests, not just periodic ones
  • TLS/Certificate issues: ❌ Would fail ALL requests, we see successful handshakes
  • Network connectivity: ❌ Multiple successful requests prove network works
  • API server unavailability: ❌ Requests succeed within same 2-minute cycle
  • CPU throttling: ❌ Increasing CPU to 500m had no effect on error pattern

Error Log Examples

ERROR Level (Agent) - Generic Error

2026-02-03T14:33:03.553686Z ERROR run:update_status_in_shadow:update_metadata_shadow: agent::agentclient: 
error=Unable to operate on BottlerocketShadow: 'Unable to update the BottlerocketShadow associated with this node: 
'Unable to update BottlerocketShadow status (ip-10-1-26-121.ec2.internal, 139c4e30-0233-4fe0-8d48-a9de4dc35c3a): 
'error sending request for url (https://brupop-apiserver.brupop-bottlerocket-aws.svc.cluster.local/bottlerocket-node-resource)'''

DEBUG Level Pattern - Successful Connections Followed by Failure

# Multiple successful connections:
2026-02-03T14:15:19.050735Z DEBUG run:update_status_in_shadow:shadow_status_with_refreshed_system_matadata:get_os_info:invoke_apiclient: 
agent::apiclient::api: Invoking apiclient: ["raw", "-u", "/os"]

2026-02-03T14:15:24.150812Z DEBUG run:update_status_in_shadow:shadow_status_with_refreshed_system_matadata:get_os_info:invoke_apiclient: 
agent::apiclient::api: Invoking apiclient: ["raw", "-u", "/os"]

2026-02-03T14:15:29.153777Z DEBUG run:update_status_in_shadow:shadow_status_with_refreshed_system_matadata:get_os_info:invoke_apiclient: 
agent::apiclient::api: Invoking apiclient: ["raw", "-u", "/os"]

2026-02-03T14:15:34.316921Z DEBUG run:update_status_in_shadow:shadow_status_with_refreshed_system_matadata:get_os_info:invoke_apiclient: 
agent::apiclient::api: Invoking apiclient: ["raw", "-u", "/os"]

2026-02-03T14:15:39.417173Z DEBUG run:update_status_in_shadow:shadow_status_with_refreshed_system_matadata:get_os_info:invoke_apiclient: 
agent::apiclient::api: Invoking apiclient: ["raw", "-u", "/os"]

2026-02-03T14:15:39.549674Z  INFO run:update_status_in_shadow:shadow_status_with_refreshed_system_matadata: 
agent::agentclient: Checking for Bottlerocket updates.

2026-02-03T14:15:39.549702Z DEBUG run:update_status_in_shadow:shadow_status_with_refreshed_system_matadata:refresh_updates:invoke_apiclient: 
agent::apiclient::api: Invoking apiclient: ["raw", "-u", "/actions/refresh-updates", "-m", "POST"]

2026-02-03T14:15:39.652275Z DEBUG run:update_status_in_shadow:shadow_status_with_refreshed_system_matadata:get_update_status:invoke_apiclient: 
agent::apiclient::api: Invoking apiclient: ["raw", "-u", "/updates/status"]

# Then suddenly fails:
2026-02-03T14:21:31.713861Z ERROR run:update_status_in_shadow:update_metadata_shadow: agent::agentclient: 
error=Unable to operate on BottlerocketShadow: 'Unable to update the BottlerocketShadow associated with this node: 
'Unable to update BottlerocketShadow status (ip-10-1-26-121.ec2.internal, 139c4e30-0233-4fe0-8d48-a9de4dc35c3a): 
'error sending request for url (https://brupop-apiserver.brupop-bottlerocket-aws.svc.cluster.local/bottlerocket-node-resource)'''

DEBUG Level - API Server Connection Logs (H2 Protocol)

2026-02-03T14:18:30.314222Z DEBUG h2::codec::framed_write: send frame=Settings { flags: (0x0) }
2026-02-03T14:18:30.314302Z DEBUG Connection: h2::codec::framed_read: received frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 4194304, max_frame_size: 16384, max_header_list_size: 10485760 } peer=Server
2026-02-03T14:18:30.314312Z DEBUG Connection: h2::codec::framed_write: send frame=Settings { flags: (0x1: ACK) } peer=Server
2026-02-03T14:18:30.314316Z DEBUG Connection: h2::codec::framed_read: received frame=WindowUpdate { stream_id: StreamId(0), size_increment: 1073741824 } peer=Server
2026-02-03T14:18:30.314336Z DEBUG Connection: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(1), flags: (0x5: END_HEADERS | END_STREAM) } peer=Server
2026-02-03T14:18:30.314483Z DEBUG Connection: h2::codec::framed_write: send frame=Headers { stream_id: StreamId(1), flags: (0x4: END_HEADERS) } peer=Server
2026-02-03T14:18:30.314503Z DEBUG Connection: h2::codec::framed_write: send frame=Data { stream_id: StreamId(1) } peer=Server
2026-02-03T14:18:30.314507Z DEBUG Connection: h2::codec::framed_write: send frame=Data { stream_id: StreamId(1), flags: (0x1: END_STREAM) } peer=Server
2026-02-03T14:18:30.314527Z DEBUG Connection: h2::codec::framed_read: received frame=Settings { flags: (0x1: ACK) } peer=Server

Key Observations from Debug Logs

  1. Successful Connection Pattern: Agents successfully invoke API client multiple times in quick succession (every 5 seconds)

  2. CRITICAL TIMING PATTERN - EXACT SEQUENCE FOUND:

    • Every ~2 minutes agent calls run:update_status_in_shadow:shadow_status_with_refreshed_system_matadata
    • This logs "Checking for Bottlerocket updates" - SUCCEEDS
    • Then immediately/soon after calls run:update_status_in_shadow:update_metadata_shadow
    • THIS SECOND CALL FAILS with "error sending request"
    • Gap between success and failure: ~15-91 seconds (variable!)
    • Pattern repeats identically every ~2-2.5 minutes
  3. Variable Timing Between Calls:

    • Success at 15:03:22 → Fail at 15:03:22 = same timestamp (instant fail)
    • Success at 15:04:53 → Fail at 15:05:15 = 22 seconds gap
    • Success at 15:06:53 → Fail at 15:07:12 = 19 seconds gap
    • This variability suggests blocking/timeout in second call, not network latency
  4. Interval Consistency:

    • All agent pods follow exact same pattern
    • Errors occur every ~2-2.5 minutes across all nodes
    • Never fails on "Checking" call, always on "update_metadata_shadow"
  5. H2 Protocol: Using HTTP/2 framed communication with proper ACK/window management

  6. No API Server Errors: API server logs show no errors, clean protocol frames

  7. Heisenbug: Error frequency significantly reduced when debug logging enabled (from 30s to 2+ minutes)

    • Suggests thread timing issue or race condition
    • Debug overhead changes timing characteristics
  8. Root Cause Hypothesis:

    • Not CPU-related: Increasing CPU to 500m didn't help
    • Not TLS: No certificate/handshake errors found
    • Likely: Connection reuse/pooling bug - second call on same connection fails
    • Likely: Timeout on idle connection after ~10-20 seconds wait between calls
    • Possible: Connection pool keeps connection alive but server closes it silently
    • Possible: Agent waits too long before updating metadata, connection times out

Current Workaround

Increase API server resources significantly (CPU: 500m+) to reduce frequency, but does not fix underlying issue. Agents retry automatically and recover.

Request

Please investigate:

  1. Timeout configuration: Is there a configurable timeout in agents? Any Helm options missed?
  2. Connection pooling: Is there a known issue with HTTP/2 connection reuse?
  3. Known issues: Has this been reported before in v1.6.0 or newer versions?
  4. Suggested fix: Can you recommend if upgrade to newer version would help?

Steps to Reproduce

  1. Deploy brupop v1.6.0 in EKS cluster
  2. Monitor agent logs: kubectl logs -n brupop-bottlerocket-aws daemonset/brupop-agent --since=5m | grep "error sending request"
  3. Error should appear regularly every 20-40 seconds

Additional Context

  • This is blocking automated Bottlerocket node updates
  • Error frequency is deterministic (~30s intervals)
  • Heisenbug: Problem manifests differently with debug logging
  • All components appear healthy but communication fails periodically

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions