Unable to update the BottlerocketShadow associated with this node

# Intermittent "error sending request" timeout between brupop agents and API server

## Summary
brupop agents (v1.6.0) sporadically fail to send requests to the API server with generic "error sending request for url" error approximately every 30 seconds. The error occurs despite successful connections moments before.

## Environment
- **brupop version**: 1.6.0
- **brupop-crd version**: 1.0.0
- **Kubernetes**: EKS
- **Node OS**: Bottlerocket 1.52.0 - 1.53.0
- **Agent Pod Count**: 3 (DaemonSet)
- **API Server Replicas**: 3
- **Namespace**: brupop-bottlerocket-aws

## Problem Description
Agents fail with: 
```
'error sending request for url (https://brupop-apiserver.brupop-bottlerocket-aws.svc.cluster.local/bottlerocket-node-resource)'
```

The error pattern:
1. Agent successfully connects to API server multiple times (3-6 successful requests)
2. Connection attempt fails with "error sending request" 
3. Error repeats periodically (~every 30 seconds)
4. Agents retry and eventually recover

## Investigation Results

### What was tried:
1. **Infrastructure checks** ✅ All healthy:
   - All pods Running, no errors in API server logs
   - Service endpoints: 3 available
   - Certificates: Valid (expiry April 24, 2026)
   - Network policies: None blocking traffic
   - DNS resolution: Working

2. **Resource tuning** ❌ No improvement:
   - Increased API server CPU: 10m → 500m
   - Increased memory: 40Mi → 128Mi
   - Increased limits: Added 200m/1000m CPU limits
   - Error frequency unchanged (~every 30 seconds)

3. **Debug investigation** 🤔 Heisenbug behavior:
   - Enabling debug logging reduced error frequency (from ~30s to ~2min)
   - Suggests timing-sensitive or race-condition issue
   - Not a simple resource exhaustion problem

### Likely Root Causes:
1. **Agent-side timeout**: Client timeout hardcoded in agent, not configurable via Helm
2. **Connection pool issue**: Possible connection reuse/cleanup problem
3. **TLS session timeout**: HTTP/2 connection lifecycle issue
4. **Intermittent network issue**: MTU, packet loss, or CNI quirk

### What We Can EXCLUDE:
- **DNS issues**: ❌ Would fail ALL requests, not just periodic ones
- **TLS/Certificate issues**: ❌ Would fail ALL requests, we see successful handshakes
- **Network connectivity**: ❌ Multiple successful requests prove network works
- **API server unavailability**: ❌ Requests succeed within same 2-minute cycle
- **CPU throttling**: ❌ Increasing CPU to 500m had no effect on error pattern

## Error Log Examples

### ERROR Level (Agent) - Generic Error
```
2026-02-03T14:33:03.553686Z ERROR run:update_status_in_shadow:update_metadata_shadow: agent::agentclient: 
error=Unable to operate on BottlerocketShadow: 'Unable to update the BottlerocketShadow associated with this node: 
'Unable to update BottlerocketShadow status (ip-10-1-26-121.ec2.internal, 139c4e30-0233-4fe0-8d48-a9de4dc35c3a): 
'error sending request for url (https://brupop-apiserver.brupop-bottlerocket-aws.svc.cluster.local/bottlerocket-node-resource)'''
```

### DEBUG Level Pattern - Successful Connections Followed by Failure
```
# Multiple successful connections:
2026-02-03T14:15:19.050735Z DEBUG run:update_status_in_shadow:shadow_status_with_refreshed_system_matadata:get_os_info:invoke_apiclient: 
agent::apiclient::api: Invoking apiclient: ["raw", "-u", "/os"]

2026-02-03T14:15:24.150812Z DEBUG run:update_status_in_shadow:shadow_status_with_refreshed_system_matadata:get_os_info:invoke_apiclient: 
agent::apiclient::api: Invoking apiclient: ["raw", "-u", "/os"]

2026-02-03T14:15:29.153777Z DEBUG run:update_status_in_shadow:shadow_status_with_refreshed_system_matadata:get_os_info:invoke_apiclient: 
agent::apiclient::api: Invoking apiclient: ["raw", "-u", "/os"]

2026-02-03T14:15:34.316921Z DEBUG run:update_status_in_shadow:shadow_status_with_refreshed_system_matadata:get_os_info:invoke_apiclient: 
agent::apiclient::api: Invoking apiclient: ["raw", "-u", "/os"]

2026-02-03T14:15:39.417173Z DEBUG run:update_status_in_shadow:shadow_status_with_refreshed_system_matadata:get_os_info:invoke_apiclient: 
agent::apiclient::api: Invoking apiclient: ["raw", "-u", "/os"]

2026-02-03T14:15:39.549674Z  INFO run:update_status_in_shadow:shadow_status_with_refreshed_system_matadata: 
agent::agentclient: Checking for Bottlerocket updates.

2026-02-03T14:15:39.549702Z DEBUG run:update_status_in_shadow:shadow_status_with_refreshed_system_matadata:refresh_updates:invoke_apiclient: 
agent::apiclient::api: Invoking apiclient: ["raw", "-u", "/actions/refresh-updates", "-m", "POST"]

2026-02-03T14:15:39.652275Z DEBUG run:update_status_in_shadow:shadow_status_with_refreshed_system_matadata:get_update_status:invoke_apiclient: 
agent::apiclient::api: Invoking apiclient: ["raw", "-u", "/updates/status"]

# Then suddenly fails:
2026-02-03T14:21:31.713861Z ERROR run:update_status_in_shadow:update_metadata_shadow: agent::agentclient: 
error=Unable to operate on BottlerocketShadow: 'Unable to update the BottlerocketShadow associated with this node: 
'Unable to update BottlerocketShadow status (ip-10-1-26-121.ec2.internal, 139c4e30-0233-4fe0-8d48-a9de4dc35c3a): 
'error sending request for url (https://brupop-apiserver.brupop-bottlerocket-aws.svc.cluster.local/bottlerocket-node-resource)'''
```

### DEBUG Level - API Server Connection Logs (H2 Protocol)
```
2026-02-03T14:18:30.314222Z DEBUG h2::codec::framed_write: send frame=Settings { flags: (0x0) }
2026-02-03T14:18:30.314302Z DEBUG Connection: h2::codec::framed_read: received frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 4194304, max_frame_size: 16384, max_header_list_size: 10485760 } peer=Server
2026-02-03T14:18:30.314312Z DEBUG Connection: h2::codec::framed_write: send frame=Settings { flags: (0x1: ACK) } peer=Server
2026-02-03T14:18:30.314316Z DEBUG Connection: h2::codec::framed_read: received frame=WindowUpdate { stream_id: StreamId(0), size_increment: 1073741824 } peer=Server
2026-02-03T14:18:30.314336Z DEBUG Connection: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(1), flags: (0x5: END_HEADERS | END_STREAM) } peer=Server
2026-02-03T14:18:30.314483Z DEBUG Connection: h2::codec::framed_write: send frame=Headers { stream_id: StreamId(1), flags: (0x4: END_HEADERS) } peer=Server
2026-02-03T14:18:30.314503Z DEBUG Connection: h2::codec::framed_write: send frame=Data { stream_id: StreamId(1) } peer=Server
2026-02-03T14:18:30.314507Z DEBUG Connection: h2::codec::framed_write: send frame=Data { stream_id: StreamId(1), flags: (0x1: END_STREAM) } peer=Server
2026-02-03T14:18:30.314527Z DEBUG Connection: h2::codec::framed_read: received frame=Settings { flags: (0x1: ACK) } peer=Server
```

## Key Observations from Debug Logs

1. **Successful Connection Pattern**: Agents successfully invoke API client multiple times in quick succession (every 5 seconds)
2. **CRITICAL TIMING PATTERN - EXACT SEQUENCE FOUND**: 
   - Every ~2 minutes agent calls `run:update_status_in_shadow:shadow_status_with_refreshed_system_matadata`
   - This logs "Checking for Bottlerocket updates" - **SUCCEEDS**
   - Then immediately/soon after calls `run:update_status_in_shadow:update_metadata_shadow` 
   - **THIS SECOND CALL FAILS** with "error sending request"
   - Gap between success and failure: ~15-91 seconds (variable!)
   - Pattern repeats identically every ~2-2.5 minutes

3. **Variable Timing Between Calls**:
   - Success at 15:03:22 → Fail at 15:03:22 = **same timestamp** (instant fail)
   - Success at 15:04:53 → Fail at 15:05:15 = **22 seconds gap**
   - Success at 15:06:53 → Fail at 15:07:12 = **19 seconds gap**
   - This variability suggests **blocking/timeout in second call**, not network latency

4. **Interval Consistency**: 
   - All agent pods follow exact same pattern
   - Errors occur every ~2-2.5 minutes across all nodes
   - Never fails on "Checking" call, always on "update_metadata_shadow"

5. **H2 Protocol**: Using HTTP/2 framed communication with proper ACK/window management
6. **No API Server Errors**: API server logs show no errors, clean protocol frames
7. **Heisenbug**: Error frequency significantly reduced when debug logging enabled (from 30s to 2+ minutes)
   - Suggests thread timing issue or race condition
   - Debug overhead changes timing characteristics
8. **Root Cause Hypothesis**: 
   - **Not CPU-related**: Increasing CPU to 500m didn't help
   - **Not TLS**: No certificate/handshake errors found
   - **Likely**: **Connection reuse/pooling bug** - second call on same connection fails
   - **Likely**: **Timeout on idle connection** after ~10-20 seconds wait between calls
   - **Possible**: Connection pool keeps connection alive but server closes it silently
   - **Possible**: Agent waits too long before updating metadata, connection times out

## Current Workaround
Increase API server resources significantly (CPU: 500m+) to reduce frequency, but does not fix underlying issue. Agents retry automatically and recover.

## Request
Please investigate:
1. **Timeout configuration**: Is there a configurable timeout in agents? Any Helm options missed?
2. **Connection pooling**: Is there a known issue with HTTP/2 connection reuse?
3. **Known issues**: Has this been reported before in v1.6.0 or newer versions?
4. **Suggested fix**: Can you recommend if upgrade to newer version would help?

## Steps to Reproduce
1. Deploy brupop v1.6.0 in EKS cluster
2. Monitor agent logs: `kubectl logs -n brupop-bottlerocket-aws daemonset/brupop-agent --since=5m | grep "error sending request"`
3. Error should appear regularly every 20-40 seconds

## Additional Context
- This is blocking automated Bottlerocket node updates
- Error frequency is deterministic (~30s intervals)
- Heisenbug: Problem manifests differently with debug logging
- All components appear healthy but communication fails periodically


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to update the BottlerocketShadow associated with this node #801

Intermittent "error sending request" timeout between brupop agents and API server

Summary

Environment

Problem Description

Investigation Results

What was tried:

Likely Root Causes:

What We Can EXCLUDE:

Error Log Examples

ERROR Level (Agent) - Generic Error

DEBUG Level Pattern - Successful Connections Followed by Failure

DEBUG Level - API Server Connection Logs (H2 Protocol)

Key Observations from Debug Logs

Current Workaround

Request

Steps to Reproduce

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unable to update the BottlerocketShadow associated with this node #801

Description

Intermittent "error sending request" timeout between brupop agents and API server

Summary

Environment

Problem Description

Investigation Results

What was tried:

Likely Root Causes:

What We Can EXCLUDE:

Error Log Examples

ERROR Level (Agent) - Generic Error

DEBUG Level Pattern - Successful Connections Followed by Failure

DEBUG Level - API Server Connection Logs (H2 Protocol)

Key Observations from Debug Logs

Current Workaround

Request

Steps to Reproduce

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions