Skip to content

fix(node): reap leaked git child processes (#53)#61

Open
beardthelion wants to merge 5 commits into
mainfrom
fix/git-child-reaping
Open

fix(node): reap leaked git child processes (#53)#61
beardthelion wants to merge 5 commits into
mainfrom
fix/git-child-reaping

Conversation

@beardthelion

@beardthelion beardthelion commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

A node serving git over smart-HTTP slowly accumulates zombie (defunct) processes until the container's PID table is full, after which every git subprocess fails to fork with EAGAIN and sync/serve break until the container is restarted. Reported in #53 with roughly 2300 defunct processes after about three weeks of uptime.

Root cause

The node runs as PID 1 in the container with no init/reaper. Its own directly-spawned git subprocesses are awaited and reaped fine, so the leak is not un-waited direct children. The vector is a reparented grandchild: when a clone or fetch client disconnects mid upload-pack, the served git dies and its pack-objects child reparents to PID 1, where nothing ever reaps it. One leak per aborted clone, accumulating over weeks. The sync worker is only where the symptom first surfaces (its git clone --mirror is the first thing that cannot fork); it is not the source.

Two smaller in-process leaks: create_issue and build_filtered_pack dropped a spawned child unwaited if the write to its stdin failed.

Fix

  • Run the node under tini and set init: true in the compose quickstart so reparented orphans are always reaped, plus pids_limit as a blast-radius cap.
  • The serve path now runs git in its own process group and SIGTERMs that group when the request future is dropped (client disconnect) or returns early, so git and pack-objects are torn down together instead of orphaning the grandchild. SIGTERM rather than SIGKILL so git can run its cleanup and remove its .git/*.lock files first, which keeps an aborted push from leaving a stale lock that blocks the next one. A drop guard owns this and is disarmed once the request completes; it reaps the child before surfacing any error, so it can never signal a pid that was already reaped and reused.
  • create_issue and build_filtered_pack now always wait the child before surfacing a stdin-write error.

Tests

cargo test -p gitlawb-node passes (114), including new coverage for direct-child termination, grandchild teardown via the process group (the #53 case), and the disarmed no-signal path. clippy -D warnings and rustfmt are clean.

Not in scope (follow-ups)

  • No timeout on a hung (as opposed to disconnected) served git: a wedged upload-pack/pack-objects can still block a request and hold a PID. Worth a bounded timeout, tracked separately.
  • pids_limit is compose-only and counts threads; a cross-environment app-level concurrency cap (and a Fly equivalent) would be more robust than relying on the cgroup limit.
  • A run_git_service wiring test (a fake git on PATH) to lock in the disconnect/reap path end to end.

Fixes #53

Summary by CodeRabbit

  • Bug Fixes

    • Improved process lifecycle management to prevent zombie processes
    • Enhanced error handling in Git subprocess operations
    • Strengthened signal handling for graceful shutdown
  • Chores

    • Updated container initialization for improved process management and orphaned process cleanup
  • Tests

    • Added regression tests for process group lifecycle management

The node runs as PID 1 in the container and never reaps reparented
orphans, so git grandchildren (e.g. pack-objects left behind when a
served clone disconnects) accumulate as zombies until fork() fails with
EAGAIN. Run under tini and set compose init: true so reparented orphans
are reaped, and add pids_limit as a blast-radius cap.

Refs #53
Serving git over smart-HTTP could leak zombie processes that, over weeks,
exhaust the PID table until fork() fails with EAGAIN and all git
operations break until restart.

The serve path (run_git_service) now runs git in its own process group and
SIGKILLs that group when the request future is dropped (client disconnect)
or returns early, so git and its pack-objects child are torn down together
instead of orphaning the grandchild to PID 1 as a zombie. A drop guard,
disarmed once the request completes cleanly, owns this teardown.

create_issue and build_filtered_pack spawned a std Child and dropped it
unwaited if the stdin write failed (and build_filtered_pack could panic on
a missing stdin handle). Both now always wait the child before surfacing
any error, so the failure path cannot leak.

Adds libc for the group-kill syscall and three regression tests covering
direct-child kill, grandchild reaping via the group, and the disarmed
no-kill path.

Refs #53
Address code-review findings on the #53 fix:

- run_git_service no longer early-returns on a stdin-write error. It captures
  the write result, always reaps the child, then disarms the group guard
  before surfacing the error. Previously a write/wait error path could drop
  the still-armed guard after the child was already reaped, so the group
  SIGKILL could in theory land on a reused pgid. (correctness, reliability,
  adversarial all flagged this region.)
- Drop tini's -g flag: the node manages its own children and graceful
  shutdown, so group-wide signalling would only disturb in-flight git helpers
  on shutdown.
- Correct the SAFETY comment (kill(2) is not relevant to async-signal-safety
  here) and the alive() test-helper comment (EPERM does not read as alive).
SIGKILL gave git no chance to run its signal cleanup, so an aborted
receive-pack could leave a stale .git/*.lock that blocks the next push.
SIGTERM lets git remove its lock files before exiting. Signalling the
whole process group still reaches pack-objects directly, so the grandchild
is terminated rather than orphaned, and a single immediate signal avoids
any delayed-kill pid-reuse window.

Refs #53
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@beardthelion, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 54 minutes and 11 seconds. Learn how PR review limits work.

To continue reviewing without waiting, enable usage-based billing in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 12678b6b-e30f-4f01-885d-dc3f5eb1b9de

📥 Commits

Reviewing files that changed from the base of the PR and between 610f93e and b3a0599.

📒 Files selected for processing (1)
  • crates/gitlawb-node/src/git/smart_http.rs
📝 Walkthrough

Walkthrough

The PR fixes zombie process accumulation by adding tini-based PID 1 supervision in the Dockerfile and init: true / pids_limit in docker-compose, a Unix KillGroupOnDrop guard that sends SIGTERM to the git subprocess process group on drop, consistent write_result/wait_with_output child reaping in run_git_service, build_filtered_pack, and create_issue, plus regression tests for the guard.

Changes

Zombie Process Reaping & Lifecycle Management

Layer / File(s) Summary
Container-level process supervision
Dockerfile, docker-compose.yml
Dockerfile installs tini and updates ENTRYPOINT to tini -- gitlawb-node; docker-compose.yml adds init: true and pids_limit: 1024 to the node service with explanatory comments.
KillGroupOnDrop guard, libc dep, and run_git_service process-group management
crates/gitlawb-node/Cargo.toml, crates/gitlawb-node/src/git/smart_http.rs
Adds libc = "0.2" dependency; implements a Unix-only KillGroupOnDrop struct that sends SIGTERM to a child's process group on drop and exposes a disarm() method; updates run_git_service to spawn git with process_group(0), install the guard, capture stdin writes via write_result, and disarm after wait_with_output.
Consistent write_result + wait in build_filtered_pack and create_issue
crates/gitlawb-node/src/git/smart_http.rs, crates/gitlawb-node/src/git/issues.rs
build_filtered_pack captures pack-objects stdin write in a write_result closure, returns BrokenPipe when stdin is unavailable, and always reaches wait_with_output; create_issue applies the same pattern so a write failure cannot skip child reaping.
KillGroupOnDrop regression tests
crates/gitlawb-node/src/git/smart_http.rs
Adds an alive helper and three Unix-only async tests: process-group termination on drop, signal propagation to a grandchild, and no-kill behavior when the guard is disarmed.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant run_git_service
  participant KillGroupOnDrop
  participant git

  Client->>run_git_service: HTTP request
  run_git_service->>git: spawn(process_group(0))
  run_git_service->>KillGroupOnDrop: install guard(pgid)
  run_git_service->>git: write stdin (write_result)

  alt Client disconnects / future dropped early
    KillGroupOnDrop->>git: SIGTERM(process group)
  else Request completes normally
    run_git_service->>git: wait_with_output()
    run_git_service->>KillGroupOnDrop: disarm()
    run_git_service-->>Client: response
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

  • #53 (Zombie process leak in repo sync): This PR directly addresses the root cause described in issue #53 — child processes not being reaped — by adding write_result/wait_with_output consistency, the KillGroupOnDrop process-group guard, and container-level tini/init: true supervision.
  • #62: The PR implements the KillGroupOnDrop guard and process-group infrastructure that issue #62 lists as a prerequisite foundation for follow-up timeout and concurrency work.

Poem

🐇 No more zombies haunting the PID tree,
Tini stands guard at PID one with glee.
KillGroupOnDrop whispers SIGTERM goodnight,
write_result and wait keep reaping airtight.
Every git child now rests in peace — hooray!
The fork table is free, and sync's here to stay. 🌿

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: fixing zombie git child processes that leak in the node.
Description check ✅ Passed The description is comprehensive, covering summary, root cause, fix, tests, and follow-ups; it fully matches the template structure with all required sections populated.
Linked Issues check ✅ Passed All objectives from issue #53 are addressed: container-level reaping with tini/init, process-group termination in run_git_service, stdin-write handling in create_issue and build_filtered_pack, and comprehensive test coverage.
Out of Scope Changes check ✅ Passed All changes are tightly scoped to zombie process reaping via container init, process groups, drop guards, and stdin-write safety; no unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/git-child-reaping

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@beardthelion

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/gitlawb-node/src/git/smart_http.rs`:
- Line 790: The comment at line 790 in the smart_http.rs file incorrectly
references SIGKILL when describing the guard behavior, but the actual
implementation sends SIGTERM (as seen in the assertion at line 796). Update the
comment from "SIGKILL to the group" to "SIGTERM to the group" to accurately
reflect the signal being sent when the guard is dropped.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 98011bbb-530c-4363-9dfc-22bf6020d374

📥 Commits

Reviewing files that changed from the base of the PR and between e37ea7f and 610f93e.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • Dockerfile
  • crates/gitlawb-node/Cargo.toml
  • crates/gitlawb-node/src/git/issues.rs
  • crates/gitlawb-node/src/git/smart_http.rs
  • docker-compose.yml

Comment thread crates/gitlawb-node/src/git/smart_http.rs Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Zombie process leak in repo sync — git children not reaped, eventually exhausts fork() (EAGAIN "Resource temporarily unavailable")

1 participant