feat: handle Ctrl-Z / fg transparently for wrapped commands#174
Conversation
When fence wrapped an interactive command on a real TTY, pressing Ctrl-Z left the child stopped while fence kept blocking in Wait(); the user's outer shell never printed "[1]+ Stopped" and the session felt hung. Add a stop-aware wait4 loop (jobcontrol.go) that runs whenever fence is managing the host TTY foreground. On a child stop it hands the terminal back to fence's own pgrp, SIGSTOPs fence so the outer shell prints the familiar Stopped line, and on resume re-grants the TTY to the child and SIGCONTs the child group. Notably, the loop deliberately does NOT pass WCONTINUED to wait4. On Darwin the kernel reports a continued child with status 0x137F — the standard 0x7F "stopped" low byte plus SIGCONT (0x13) in the high byte — and golang.org/x/sys/unix's WaitStatus.Stopped() returns true for that encoding (it only checks high \!= SIGSTOP). With WCONTINUED enabled, every SIGCONT this loop issued to the child came back looking like a fresh Ctrl-Z, sending fence into an infinite [1]+ Stopped loop on every fg. WUNTRACED-only avoids the encoding mismatch and is all the main Ctrl-Z / fg flow needs; the spec's "third-party SIGCONT to fence" case still works because the resume handshake runs inline after each kill(self, SIGSTOP) returns. Also adds: - integration_jobcontrol_darwin_test.go: end-to-end PTY test that pins the Ctrl-Z / SIGCONT round-trip against `ps` state checks. - wait4_probe_test.go: gated diagnostic that records the Darwin wait4 encoding so a future contributor who tries to re-add WCONTINUED sees why it breaks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jy-tan
left a comment
There was a problem hiding this comment.
Thanks for the PR! A few comments, let me know what you think.
Since shouldManageHostTTYForeground is platform-agnostic, shall we also add an equivalent test for Linux?
| if err := unix.IoctlSetPointerInt(stdinFd, unix.TIOCSPGRP, childPgrp); err != nil { | ||
| if debug { | ||
| fencelog.Printf("[fence] Warning: failed to set child as foreground: %v\n", err) | ||
| } | ||
| } | ||
| defer func() { | ||
| _ = unix.IoctlSetPointerInt(stdinFd, unix.TIOCSPGRP, savedFgPgrp) | ||
| signal.Reset(syscall.SIGTTOU) | ||
| }() | ||
|
|
||
| jobControlEnabled = true | ||
| jobControlStdinFd = stdinFd | ||
| jobControlChildPgrp = childPgrp | ||
| } |
There was a problem hiding this comment.
do we want to set jobControlEnabled only if TIOCSPGRP to the child actually succeeds?
There was a problem hiding this comment.
Done in 2623075 — moved the defer for restoring savedFgPgrp and the three jobControl* assignments inside the else branch, so a failed TIOCSPGRP falls back to execCmd.Wait() without activating job-control state that cannot function.
While testing this I found two more bugs in the same Ctrl-Z/bg flow (fixed in b5d31f7):
-
waitWithJobControlre-grant afterbg: after resuming from self-SIGSTOP, the code always calledTIOCSPGRP(childPgrp). Withfgthat is correct; withbgthe shell is still the terminal foreground owner, so the call stole the TTY from the shell, causing itsread()to returnEIO(interpreted as EOF → shell exit). Fix: checkTIOCGPGRPafter resuming and only re-grant when fence is the foreground. -
Deferred cleanup on exit: the
deferinrunCommandunconditionally calledTIOCSPGRP(savedFgPgrp). In an interactive bash sessionsavedFgPgrpis fence's own pgrp (bash assigns each pipeline its own pgrp before handing it the terminal), so the defer was stealing the TTY back from the shell right before exit, leaving an orphaned foreground pgrp and triggering EIO on the shell's next read. Fix: only callTIOCSPGRP(savedFgPgrp)when the child's pgrp is still the terminal foreground; skip it when the shell has already reclaimed the TTY.
| // in Wait() while the inner pgrp was stopped, leaving the user's outer | ||
| // shell wedged with no `[1]+ Stopped …` notification. This test pins that | ||
| // regression down. | ||
| func TestMacOS_CtrlZSuspendsFence(t *testing.T) { |
There was a problem hiding this comment.
thinking of a test that resumes and then observes the wrapped command completes (fence is not blocked forever waiting).
something of this shape:
func TestMacOS_CtrlZSuspendsFenceAndResumesChild(t *testing.T) {
skipIfAlreadySandboxed(t)
fenceBin := t.TempDir() + "/fence"
build := exec.Command("go", "build", "-o", fenceBin, "../../cmd/fence") // #nosec G204
build.Stdout = os.Stdout
build.Stderr = os.Stderr
if err := build.Run(); err != nil {
t.Fatalf("failed to build fence: %v", err)
}
cmd := exec.Command(fenceBin, "sh", "-c", "printf 'READY\\n'; sleep 1; printf 'DONE\\n'") // #nosec G204
ptmx, err := pty.Start(cmd)
if err != nil {
t.Fatalf("failed to start fence command with PTY: %v", err)
}
var output lockedBuffer
done := make(chan struct{})
go func() {
_, _ = io.Copy(&output, ptmx)
close(done)
}()
defer func() {
_ = ptmx.Close()
if cmd.Process != nil {
// If the process is still stopped, SIGCONT first so Kill/Wait cleanup
// cannot leave a stopped child behind.
_ = syscall.Kill(cmd.Process.Pid, syscall.SIGCONT)
_ = cmd.Process.Kill()
}
_ = cmd.Wait()
}()
if !waitForOutput(&output, "READY", 2*time.Second) {
t.Fatalf("command did not become ready\noutput so far:\n%s", output.String())
}
if _, err := ptmx.Write([]byte{0x1A}); err != nil {
t.Fatalf("failed to write Ctrl-Z: %v", err)
}
if !waitForProcessState(t, cmd.Process.Pid, "T", 2*time.Second) {
t.Fatalf("fence did not enter stopped state after Ctrl-Z\noutput so far:\n%s", output.String())
}
if err := syscall.Kill(cmd.Process.Pid, syscall.SIGCONT); err != nil {
t.Fatalf("failed to SIGCONT fence: %v", err)
}
if !waitForOutput(&output, "DONE", 5*time.Second) {
t.Fatalf("child did not resume and complete\noutput so far:\n%s", output.String())
}
select {
case <-done:
case <-time.After(2 * time.Second):
t.Fatalf("PTY output did not finish after child completed\noutput so far:\n%s", output.String())
}
if err := cmd.Wait(); err != nil {
t.Fatalf("fence exited with error: %v\noutput:\n%s", err, output.String())
}
}with the following race-safe helpers:
type lockedBuffer struct {
mu sync.Mutex
buf bytes.Buffer
}
func (b *lockedBuffer) Write(p []byte) (int, error) {
b.mu.Lock()
defer b.mu.Unlock()
return b.buf.Write(p)
}
func (b *lockedBuffer) String() string {
b.mu.Lock()
defer b.mu.Unlock()
return b.buf.String()
}
func waitForOutput(output *lockedBuffer, needle string, timeout time.Duration) bool {
deadline := time.Now().Add(timeout)
for time.Now().Before(deadline) {
if strings.Contains(output.String(), needle) {
return true
}
time.Sleep(25 * time.Millisecond)
}
return false
}what do you think?
There was a problem hiding this comment.
Added TestMacOS_CtrlZSuspendsFenceAndResumesChild in bf130b9, following exactly the shape you suggested. The race-safe helpers (lockedBuffer, waitForOutput) are extracted into a separate integration_jobcontrol_darwin_helpers_test.go (same package and build tag) so both Darwin test functions share one copy.
Also added integration_jobcontrol_linux_test.go (//go:build linux) with TestLinux_CtrlZSuspendsFence and TestLinux_CtrlZSuspendsFenceAndResumesChild — same PTY approach, same ps -o state= polling — to cover the stop/resume paths on Linux where WUNTRACED and TIOCSPGRP are also live.
There was a problem hiding this comment.
Thanks for adding the stronger resume tests. One cleanup from my earlier suggestion: since CI runs with -race, async PTY capture should use the locked buffer anywhere the test goroutine reads it. The new full round-trip tests do this, but the older duplicate Test*_CtrlZSuspendsFence tests still use plain bytes.Buffer, can we either convert those too or remove the weaker duplicates?
There was a problem hiding this comment.
Done in f68efcc — went with "remove the weaker duplicates": the Darwin/Linux harness is now a single //go:build darwin || linux helper (integration_jobcontrol_helpers_test.go) sharing one lockedBuffer, and the per-OS files are thin entry points, so there's no plain bytes.Buffer left and nothing's duplicated. (The locked-buffer conversion of the older tests itself landed earlier in 57dd5e7.)
If IoctlSetPointerInt(TIOCSPGRP) fails (e.g. EPERM when stdin is not the controlling terminal), the child is not the foreground pgrp and the wait4 loop will never see a Stopped event. Guarding the defer and the three jobControl* assignments inside the else branch lets the code fall back to execCmd.Wait(), which is safe. Previously the defer for restoring savedFgPgrp would also fire even though the fg pgrp was never changed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Extract lockedBuffer and waitForOutput into a darwin helpers file so both Darwin test functions share one race-safe copy. - Add TestMacOS_CtrlZSuspendsFenceAndResumesChild to cover the full Ctrl-Z/fg round-trip: stop half was already tested; this pins the resume-and-complete path (waitWithJobControl re-grant + SIGCONT). - Add integration_jobcontrol_linux_test.go (//go:build linux) with TestLinux_CtrlZSuspendsFence and TestLinux_CtrlZSuspendsFenceAndResumesChild to verify the same stop/resume paths on Linux, where WUNTRACED and TIOCSPGRP are also available. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After `bg`, the shell is the terminal foreground owner. Two places were incorrectly calling TIOCSPGRP and handing the TTY elsewhere: 1. waitWithJobControl: after resuming from self-SIGSTOP, always re-granted the terminal to the child pgrp. With `fg` that is correct; with `bg` the shell is still the foreground and the TIOCSPGRP call caused the shell's stdin read to return EIO (interpreted as EOF → shell exit). Fix: check TIOCGPGRP after resuming and only re-grant when fence is the foreground (fg path). 2. The deferred cleanup in runCommand: unconditionally called TIOCSPGRP(savedFgPgrp) on exit. In an interactive bash session savedFgPgrp is fence's own pgrp (bash gave fence its own pgrp before handing it the terminal), so the defer was stealing the TTY back from the shell right before exit, leaving an empty foreground pgrp and causing the shell's next read to return EIO. Fix: only restore savedFgPgrp when the child's pgrp is still the terminal foreground (i.e. we own it); skip the call when the shell has already reclaimed the TTY. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
So with claude code's help I tried to address the comments and found some extra issues. I've tested on both MacOS and Linux. Looking more I'm finding an issue I want to investigate more, I'll let you know when this is ready. |
When fence calls execCmd.Start() the child is in its own pgrp but not yet the terminal foreground — TIOCSPGRP is called a few syscalls later. Interactive apps (e.g. Claude Code / Node.js) may read from stdin during that window, receive SIGTTIN (background read from controlling terminal), and stop. waitWithJobControl was treating every ws.Stopped() event as a user Ctrl-Z: it handed the TTY back to fence's pgrp and stopped fence itself, leaving both processes stopped and the session appearing hung. Check ws.StopSignal() before the Ctrl-Z path: if the signal is SIGTTIN, the child stopped due to the start-up race rather than user input. By the time wait4 returns, TIOCSPGRP has already made the child the foreground pgrp, so sending SIGCONT lets it retry the read successfully. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
ok fixed one issue I found, it's in a reviewable state. |
jy-tan
left a comment
There was a problem hiding this comment.
Thanks for this! Just a few more comments.
| case ws.Stopped(): | ||
| // SIGTTIN means the child read from the terminal while it was | ||
| // still in the background — a race between Start() and the | ||
| // initial TIOCSPGRP call in the caller. The child is now the | ||
| // foreground pgrp, so resume it; do not treat this as Ctrl-Z. | ||
| if ws.StopSignal() == syscall.SIGTTIN { | ||
| if debug { | ||
| fencelog.Printf("[fence:jobctl] child stopped on SIGTTIN (start race); resuming\n") | ||
| } | ||
| _ = syscall.Kill(-childPgrp, syscall.SIGCONT) | ||
| continue | ||
| } |
There was a problem hiding this comment.
This SIGTTIN branch assumes the child is already the foreground pgrp. That’s true for the startup race, but not after Ctrl-Z -> bg; a background child reading from stdin should remain stopped, not be auto-SIGCONTed. Can we gate this on TIOCGPGRP == childPgrp before resuming?
i.e.,
if ws.StopSignal() == syscall.SIGTTIN {
currentFg, fgErr := unix.IoctlGetInt(stdinFd, unix.TIOCGPGRP)
if fgErr == nil && currentFg == childPgrp {
if debug {
fencelog.Printf("[fence:jobctl] child stopped on SIGTTIN after foreground handoff; resuming\n")
}
_ = syscall.Kill(-childPgrp, syscall.SIGCONT)
continue
}
if debug {
fencelog.Printf("[fence:jobctl] child stopped on SIGTTIN while not foreground (currentFg=%d); treating as job-control stop\n", currentFg)
}
}There was a problem hiding this comment.
Done in a8cead2, implemented as you suggested: the SIGTTIN branch now reads TIOCGPGRP and only auto-SIGCONTs when currentFg == childPgrp (the foreground-handoff race). When the child stopped on SIGTTIN while not foreground (after Ctrl-Z → bg), it falls through to the normal job-control stop so it stays stopped and you can fg it.
One alternative Claude proposed: in the not-foreground case, continue and leave the child stopped without fence touching the terminal at all — instead of falling through to tcsetpgrp(self) + self-SIGSTOP while the shell currently owns the terminal. That sidesteps any momentary TTY handoff while the shell is foreground (the concern behind b5d31f7), but it doesn't surface the stop to the outer shell, so fg wouldn't recover the child. I went with your fall-through since the transient handoff is the same one the normal Ctrl-Z path already relies on — but if you'd rather not touch the terminal in the bg case, the continue variant is a one-line change. Preference?
There was a problem hiding this comment.
Yes, leaning toward the continue variant because the shell already owns the terminal in this path, and a background fence calling TIOCSPGRP(self) seems like the thing we just fixed for Ctrl-Z -> bg. But I agree this needs proof either way. Can we add/manual-test Ctrl-Z -> bg, then a child stdin read, and verify the shell stays alive and fg recovers cleanly?
There was a problem hiding this comment.
Went with the third path rather than a bare continue, because continue alone deadlocks on fg recovery: the child stays stopped on SIGTTIN, fence keeps running (never self-SIGSTOPs), so there is nothing for fg to bring back and nobody to re-grant the TTY.
| if err := unix.IoctlSetPointerInt(stdinFd, unix.TIOCSPGRP, childPgrp); err != nil { | ||
| if debug { | ||
| fencelog.Printf("[fence] Warning: failed to set child as foreground: %v\n", err) | ||
| } |
There was a problem hiding this comment.
If TIOCSPGRP(childPgrp) fails, we've already ignored SIGTTOU but never reset it. Can we reset SIGTTOU in the error branch before falling back to normal Wait()?
There was a problem hiding this comment.
Done in f6f9f40. I register signal.Reset(SIGTTOU) with defer immediately after signal.Ignore(SIGTTOU), so it's reset on every exit path — including the TIOCSPGRP-failure fallback to Wait() you flagged. Used a defer rather than only the error branch so no later early-return can leak the ignored disposition; the foreground-pgrp restore stays success-only.
| // in Wait() while the inner pgrp was stopped, leaving the user's outer | ||
| // shell wedged with no `[1]+ Stopped …` notification. This test pins that | ||
| // regression down. | ||
| func TestMacOS_CtrlZSuspendsFence(t *testing.T) { |
There was a problem hiding this comment.
Thanks for adding the stronger resume tests. One cleanup from my earlier suggestion: since CI runs with -race, async PTY capture should use the locked buffer anywhere the test goroutine reads it. The new full round-trip tests do this, but the older duplicate Test*_CtrlZSuspendsFence tests still use plain bytes.Buffer, can we either convert those too or remove the weaker duplicates?
Both TestMacOS_CtrlZSuspendsFence and TestLinux_CtrlZSuspendsFence were writing to a plain bytes.Buffer from a goroutine while the test body read it on error paths, creating a data race under -race. Switch them to the existing locked-buffer types so all four job-control tests are race-safe. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both the main.go foreground-restore defer and the jobcontrol.go fg re-grant encode the same invariant — only call TIOCSPGRP if we still own the terminal foreground — with different reference/target pgrps. Extract a single setForegroundIfOwner helper used by both so a future tweak to the guard can't diverge between the two sites. No behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The SIGTTIN branch treated every SIGTTIN stop as the Start()/TIOCSPGRP foreground-handoff race and unconditionally re-SIGCONTed the child. After Ctrl-Z -> bg the shell owns the terminal and the child is in the background; if it then reads stdin the kernel stops it with SIGTTIN, this loop resumed it, it read again, and spun at 100% CPU. Gate the resume fast path on TIOCGPGRP == childPgrp; when the child is not the foreground pgrp, fall through to the normal job-control stop so it stays stopped and the outer shell can fg it later. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
signal.Ignore(SIGTTOU) ran unconditionally, but signal.Reset lived only in the TIOCSPGRP-success defer. A failed TIOCSPGRP fell through to the execCmd.Wait() path with SIGTTOU still ignored process-wide, which can mask later background terminal writes/ioctls. Register the reset immediately after Ignore so it runs on every exit path; the foreground-pgrp restore stays success-only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The job-control path reaps the child via wait4 and never calls execCmd.Wait(), so the os.Process is never marked done and the SignalForwarder stays live until the LIFO-deferred cleanup() runs (after the TTY restore and linuxMonitors.Stop). On macOS (no pidfd) a programmatic signal in that window could route to a reaped/recycled PID. Call cleanup() (idempotent) right after waitWithJobControl returns to close the multi-defer window. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The job-control path returns 128+signal for a signaled child, but the non-job-control execCmd.Wait() fallback used exitErr.ExitCode(), which reports -1 for that case, so an identical "child killed by signal" yielded different fence exit codes depending on whether stdin was a TTY. Extract resolveExitCode to apply the shell convention on both paths, and add a unit test covering the normal-exit and signaled cases. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The wait4 switch handled only Stopped/Exited/Signaled. WUNTRACED should never report anything else, but the surrounding logic is platform-sensitive, so a status matching none of them would silently re-block with no diagnostic. Add a default that logs the raw status under debug. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
lockedBuffer/waitForOutput/waitForProcessState and the two Ctrl-Z test bodies were duplicated verbatim between the Darwin file and the linux-prefixed copies. Move the shared harness into a //go:build darwin || linux helper (integration_jobcontrol_helpers_test.go) and reduce the per-OS files to thin OS-named entry points. `ps -o state=` behaves the same on both, so there were no platform-specific assertions to keep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
After addressing your three comments I ran a double Codex + Claude code review of this round and it turned up more than I expected. I folded the fixes into this PR, let me know if it's still manageable. Extra changes:
Two further findings were considered and rejected: assuming |
After Ctrl-Z -> bg, the shell reclaims the terminal. If the bg child then reads stdin it stops on SIGTTIN. The previous fall-through called TIOCSPGRP(selfPgrp) unconditionally — stealing the TTY from the shell, the same class of bug fixed in b5d31f7 for the Ctrl-Z resume path. Restructure waitWithJobControl's Stopped case into an if/else: - SIGTTIN + child is foreground: start-race auto-resume (unchanged). - SIGTTIN + child not foreground: self-SIGSTOP without TIOCSPGRP so the shell can fg fence; the common resume path below then re-grants the TTY to the child. - All other stops (SIGTSTP): TIOCSPGRP(selfPgrp) + self-SIGSTOP (unchanged). Add ctrlZBgStdinReadFgRecovery integration test (Darwin + Linux) that covers the Ctrl-Z / fg / child-stdin-read round-trip. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Let me know how it looks now. I did some manual tests (simple ones) and added some more automated tests |
|
Thanks @voidexpr! I'll add to a new release within a couple days. |
This is all done by claude code but seems reasonable. I did some simple manual testing. I really appreciate your work with fence, I'm just starting to use it but I'm hoping to use it a lot more so hopefully this patch can help. The use case motivating this: fence around claude code and trying to put it in the background to type a few commands before going back to claude code.
When fence wrapped an interactive command on a real TTY, pressing Ctrl-Z left the child stopped while fence kept blocking in Wait(); the user's outer shell never printed "[1]+ Stopped" and the session felt hung.
Add a stop-aware wait4 loop (jobcontrol.go) that runs whenever fence is managing the host TTY foreground. On a child stop it hands the terminal back to fence's own pgrp, SIGSTOPs fence so the outer shell prints the familiar Stopped line, and on resume re-grants the TTY to the child and SIGCONTs the child group.
Notably, the loop deliberately does NOT pass WCONTINUED to wait4. On Darwin the kernel reports a continued child with status 0x137F — the standard 0x7F "stopped" low byte plus SIGCONT (0x13) in the high byte — and golang.org/x/sys/unix's WaitStatus.Stopped() returns true for that encoding (it only checks high != SIGSTOP). With WCONTINUED enabled, every SIGCONT this loop issued to the child came back looking like a fresh Ctrl-Z, sending fence into an infinite [1]+ Stopped loop on every fg. WUNTRACED-only avoids the encoding mismatch and is all the main Ctrl-Z / fg flow needs; the spec's "third-party SIGCONT to fence" case still works because the resume handshake runs inline after each kill(self, SIGSTOP) returns.
Also adds:
psstate checks.