fix(node): reap leaked git child processes (#53)#61
Conversation
The node runs as PID 1 in the container and never reaps reparented orphans, so git grandchildren (e.g. pack-objects left behind when a served clone disconnects) accumulate as zombies until fork() fails with EAGAIN. Run under tini and set compose init: true so reparented orphans are reaped, and add pids_limit as a blast-radius cap. Refs #53
Serving git over smart-HTTP could leak zombie processes that, over weeks, exhaust the PID table until fork() fails with EAGAIN and all git operations break until restart. The serve path (run_git_service) now runs git in its own process group and SIGKILLs that group when the request future is dropped (client disconnect) or returns early, so git and its pack-objects child are torn down together instead of orphaning the grandchild to PID 1 as a zombie. A drop guard, disarmed once the request completes cleanly, owns this teardown. create_issue and build_filtered_pack spawned a std Child and dropped it unwaited if the stdin write failed (and build_filtered_pack could panic on a missing stdin handle). Both now always wait the child before surfacing any error, so the failure path cannot leak. Adds libc for the group-kill syscall and three regression tests covering direct-child kill, grandchild reaping via the group, and the disarmed no-kill path. Refs #53
Address code-review findings on the #53 fix: - run_git_service no longer early-returns on a stdin-write error. It captures the write result, always reaps the child, then disarms the group guard before surfacing the error. Previously a write/wait error path could drop the still-armed guard after the child was already reaped, so the group SIGKILL could in theory land on a reused pgid. (correctness, reliability, adversarial all flagged this region.) - Drop tini's -g flag: the node manages its own children and graceful shutdown, so group-wide signalling would only disturb in-flight git helpers on shutdown. - Correct the SAFETY comment (kill(2) is not relevant to async-signal-safety here) and the alive() test-helper comment (EPERM does not read as alive).
SIGKILL gave git no chance to run its signal cleanup, so an aborted receive-pack could leave a stale .git/*.lock that blocks the next push. SIGTERM lets git remove its lock files before exiting. Signalling the whole process group still reaches pack-objects directly, so the grandchild is terminated rather than orphaned, and a single immediate signal avoids any delayed-kill pid-reuse window. Refs #53
|
Warning Review limit reached
More reviews will be available in 54 minutes and 11 seconds. Learn how PR review limits work. To continue reviewing without waiting, enable usage-based billing in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe PR fixes zombie process accumulation by adding ChangesZombie Process Reaping & Lifecycle Management
Sequence Diagram(s)sequenceDiagram
participant Client
participant run_git_service
participant KillGroupOnDrop
participant git
Client->>run_git_service: HTTP request
run_git_service->>git: spawn(process_group(0))
run_git_service->>KillGroupOnDrop: install guard(pgid)
run_git_service->>git: write stdin (write_result)
alt Client disconnects / future dropped early
KillGroupOnDrop->>git: SIGTERM(process group)
else Request completes normally
run_git_service->>git: wait_with_output()
run_git_service->>KillGroupOnDrop: disarm()
run_git_service-->>Client: response
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/gitlawb-node/src/git/smart_http.rs`:
- Line 790: The comment at line 790 in the smart_http.rs file incorrectly
references SIGKILL when describing the guard behavior, but the actual
implementation sends SIGTERM (as seen in the assertion at line 796). Update the
comment from "SIGKILL to the group" to "SIGTERM to the group" to accurately
reflect the signal being sent when the guard is dropped.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 98011bbb-530c-4363-9dfc-22bf6020d374
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
Dockerfilecrates/gitlawb-node/Cargo.tomlcrates/gitlawb-node/src/git/issues.rscrates/gitlawb-node/src/git/smart_http.rsdocker-compose.yml
A node serving git over smart-HTTP slowly accumulates zombie (defunct) processes until the container's PID table is full, after which every
gitsubprocess fails to fork with EAGAIN and sync/serve break until the container is restarted. Reported in #53 with roughly 2300 defunct processes after about three weeks of uptime.Root cause
The node runs as PID 1 in the container with no init/reaper. Its own directly-spawned git subprocesses are awaited and reaped fine, so the leak is not un-waited direct children. The vector is a reparented grandchild: when a clone or fetch client disconnects mid
upload-pack, the servedgitdies and itspack-objectschild reparents to PID 1, where nothing ever reaps it. One leak per aborted clone, accumulating over weeks. The sync worker is only where the symptom first surfaces (itsgit clone --mirroris the first thing that cannot fork); it is not the source.Two smaller in-process leaks:
create_issueandbuild_filtered_packdropped a spawned child unwaited if the write to its stdin failed.Fix
tiniand setinit: truein the compose quickstart so reparented orphans are always reaped, pluspids_limitas a blast-radius cap.gitandpack-objectsare torn down together instead of orphaning the grandchild. SIGTERM rather than SIGKILL so git can run its cleanup and remove its.git/*.lockfiles first, which keeps an aborted push from leaving a stale lock that blocks the next one. A drop guard owns this and is disarmed once the request completes; it reaps the child before surfacing any error, so it can never signal a pid that was already reaped and reused.create_issueandbuild_filtered_packnow always wait the child before surfacing a stdin-write error.Tests
cargo test -p gitlawb-nodepasses (114), including new coverage for direct-child termination, grandchild teardown via the process group (the #53 case), and the disarmed no-signal path.clippy -D warningsandrustfmtare clean.Not in scope (follow-ups)
upload-pack/pack-objectscan still block a request and hold a PID. Worth a bounded timeout, tracked separately.pids_limitis compose-only and counts threads; a cross-environment app-level concurrency cap (and a Fly equivalent) would be more robust than relying on the cgroup limit.run_git_servicewiring test (a fakegiton PATH) to lock in the disconnect/reap path end to end.Fixes #53
Summary by CodeRabbit
Bug Fixes
Chores
Tests