Skip to content

Deflake test_linux_network_stacksmash_64#122

Open
DanielBotnik wants to merge 1 commit into
angr:masterfrom
DanielBotnik:fix-flaky-network-stacksmash-test
Open

Deflake test_linux_network_stacksmash_64#122
DanielBotnik wants to merge 1 commit into
angr:masterfrom
DanielBotnik:fix-flaky-network-stacksmash-test

Conversation

@DanielBotnik
Copy link
Copy Markdown

Summary

test_linux_network_stacksmash_64 was wrapped in @flaky(max_runs=3, min_passes=1) to paper over two races in the phase-2 "run the exploit" step. This replaces the retry wrapper with deterministic synchronization.

The exploit itself is not the flaky part: network_overflow is a non-PIE binary whose talk() memcpys the received bytes to a fixed .bss address (0x4040c0) and the exploit overwrites the saved return address to jump there, so it works independently of ASLR. The flakiness was entirely in how the test launched and connected to the target:

  1. Random ports (random.randint(...)) could already be in use. The target server does not set SO_REUSEADDR, so its bind() then fails and the test errors out through no fault of the exploit.
  2. time.sleep(.5) was used to wait for the server before launching the exploit, whose generated script connects once with no retry. On a loaded CI runner the server can take longer than 0.5s to reach listen(), so the connection is refused.

(Phase 1 doesn't suffer this — archr connects with retry=30.)

Changes

  • _get_free_tcp_port() — bind to port 0 to get a port the OS reports as free, instead of a random one that may be taken.
  • _wait_until_listening() — poll the socket's LISTEN state (via psutil, already a transitive dependency through angr/pwntools) instead of sleeping a fixed interval. We check state rather than connecting because the target accept()s exactly one connection, so a probe connection would be consumed instead of the exploit's.
  • Remove the @flaky decorator and the now-unused flaky dependency.

Testing

Ran the test with flaky retries disabled (-p no:flaky):

  • 30/30 passes back-to-back.
  • 5/5 passes under full-CPU stress (the slow-startup condition that triggered the original flake).

Also verified directly that the old sleep(0.5) + no-retry connect fails with ConnectionRefusedError against a server that takes >0.5s to listen(), while _wait_until_listening handles it.

🤖 Generated with Claude Code

@ltfish
Copy link
Copy Markdown
Member

ltfish commented May 29, 2026

Ask Claude Code to try harder ;)

The test was wrapped in @flaky(max_runs=3) to paper over several timing races
in driving the network exploit. Fix the underlying races and drop the wrapper.

Root cause of the CI failure (exploit subprocess times out, empty stdout):
the target services a connection with a single recv(), but the generated
call_shellcode exploit sends the shellcode payload and the follow-up shell
commands (e.g. "echo hello\n", "exit\n") back to back. Under load these
coalesce into that one recv(), so the commands are consumed before the shell
is up, leaving the popped shell to block forever on an empty stdin.

Fixes:
- call_shellcode: wait SHELL_SPAWN_DELAY seconds after sending the shellcode
  before sending shell commands, so the payload is the only thing in the
  target's first read() and the shell is reading by the time the commands
  arrive. This makes generated network shellcode exploits reliable in general,
  not just this test.
- test: pick guaranteed-free ports (bind to port 0) instead of random ports,
  which may already be in use (the target sets no SO_REUSEADDR, so its bind()
  then fails).
- test: wait until the target is actually listening (via psutil) before
  launching the exploit, instead of a fixed time.sleep() that races the
  server's startup under load.
- Drop the now-unnecessary @flaky wrapper and the flaky dependency.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@DanielBotnik DanielBotnik force-pushed the fix-flaky-network-stacksmash-test branch from 9d086e2 to 7938586 Compare May 29, 2026 08:11
@DanielBotnik
Copy link
Copy Markdown
Author

Took another pass at this and found the real root cause behind the flakiness:

  1. The target services a connection with a single recv(), but the generated call_shellcode exploit sent the shellcode payload and the follow-up commands (echo hello\n, exit\n) back to back — under load they coalesce into that one read, so the commands get consumed before the shell is up and it blocks forever on empty stdin (the 30s timeout you saw).
  2. Fix: pace the exploit — wait SHELL_SPAWN_DELAY after the payload before sending commands, so the payload is alone in the first read.
  3. Also made the test deterministic: guaranteed-free ports (no EADDRINUSE) and wait-until-listening via psutil instead of a fixed sleep(.5).
  4. Dropped the @flaky wrapper and dependency since the races are now fixed at the source.
  5. Reproduced the hang deterministically and verified the paced exploit works; the call_shellcode change also makes real network shellcode exploits more reliable, not just this test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants