feat(vmm): replace ProcessMonitor polling with pidfd/kqueue event-driven detection#407
feat(vmm): replace ProcessMonitor polling with pidfd/kqueue event-driven detection#407lilongen wants to merge 3 commits intoboxlite-ai:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR removes the sleep-based polling latency from ProcessMonitor::wait_for_exit() by using event-driven process-exit notifications on Linux/macOS, while retaining a polling fallback path. It also expands jailer FD-cleanup logic to use more efficient platform mechanisms before falling back to brute-force closing.
Changes:
- Implement event-driven process exit detection via Linux
pidfd_open()and macOSkqueue/EVFILT_PROC, with a 100ms polling fallback. - Add async tests covering exit-code capture, delayed exit detection, and non-existent PID behavior.
- Update jailer FD closing strategy to prefer
close_range, then/proc/self/fdenumeration viagetdents64, then a dynamic brute-force bound viagetrlimit.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| boxlite/src/util/process.rs | Replaces polling wait_for_exit() with pidfd/kqueue event-driven waiting plus fallback and adds tests. |
| boxlite/src/jailer/common/fd.rs | Reworks inherited-FD cleanup to prefer close_range and /proc enumeration, with an RLIMIT-based brute-force fallback and new tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) | ||
| }; | ||
|
|
||
| if nread <= 0 { |
There was a problem hiding this comment.
getdents64 errors are currently treated the same as EOF (if nread <= 0 { break; }), and the function still returns true. If nread < 0 (e.g., EINTR), this will stop enumeration early and skip the brute-force fallback, potentially leaving inherited FDs open. Handle nread < 0 as a failure (close dir_fd and return false).
| if nread <= 0 { | |
| if nread < 0 { | |
| // Error while enumerating /proc/self/fd: treat as failure so that | |
| // callers can fall back to a brute-force close strategy. | |
| // SAFETY: close() is async-signal-safe. dir_fd was opened by us above. | |
| unsafe { libc::close(dir_fd) }; | |
| return false; | |
| } | |
| if nread == 0 { | |
| // EOF reached: enumeration completed successfully. |
| fn brute_force_close_fds(first_fd: i32) { | ||
| let max_fd = get_max_fd(); | ||
| for fd in first_fd..max_fd { | ||
| // SAFETY: close() is async-signal-safe. Closing a non-open FD returns EBADF (ignored). | ||
| unsafe { libc::close(fd) }; |
There was a problem hiding this comment.
brute_force_close_fds() relies on get_max_fd() (soft RLIMIT_NOFILE) as an upper bound. If the process opened high-numbered FDs and then lowered its soft limit before fork, rlim_cur can be below the highest open FD, so this loop may miss closing some inherited FDs in the last-resort path. Consider using a more conservative bound (e.g., max(rlim_cur, rlim_max) and/or a minimum floor matching the old hardcoded limits) to avoid FD leaks when falling back to brute force.
| //! ## Strategy (in order of preference) | ||
| //! | ||
| //! 1. **Linux 5.9+**: `close_range(first_fd, ~0U, 0)` — O(1), kernel closes all. | ||
| //! 2. **Linux < 5.9**: Enumerate `/proc/self/fd` via raw `getdents64` syscall | ||
| //! (no memory allocation) and close only open FDs. Falls back to brute-force | ||
| //! with a dynamic upper bound from `getrlimit(RLIMIT_NOFILE)`. |
There was a problem hiding this comment.
The PR description focuses on ProcessMonitor::wait_for_exit() (pidfd/kqueue), but this file introduces a substantial new FD-closing strategy (close_range + /proc getdents64 + rlimit). If this is intentional, it should be called out in the PR description (or split into a separate PR) so reviewers understand the added scope and security impact.
| if libc::fcntl(owned.as_raw_fd(), libc::F_SETFL, libc::O_NONBLOCK) < 0 { | ||
| return None; // OwnedFd closes the FD on drop | ||
| } |
There was a problem hiding this comment.
fcntl(F_SETFL, O_NONBLOCK) overwrites the FD’s existing status flags. Even if pidfds typically start with 0 flags, it’s safer to preserve existing flags by reading with F_GETFL and OR-ing O_NONBLOCK; otherwise this could unexpectedly clear other status flags if they exist in future kernels/libc behavior.
| if libc::fcntl(owned.as_raw_fd(), libc::F_SETFL, libc::O_NONBLOCK) < 0 { | |
| return None; // OwnedFd closes the FD on drop | |
| } | |
| let mut flags = libc::fcntl(owned.as_raw_fd(), libc::F_GETFL); | |
| if flags < 0 { | |
| return None; // OwnedFd closes the FD on drop | |
| } | |
| flags |= libc::O_NONBLOCK; | |
| if libc::fcntl(owned.as_raw_fd(), libc::F_SETFL, flags) < 0 { | |
| return None; // OwnedFd closes the FD on drop | |
| } |
| if libc::fcntl(owned.as_raw_fd(), libc::F_SETFL, libc::O_NONBLOCK) < 0 { | ||
| return None; // OwnedFd closes the kqueue FD on drop | ||
| } |
There was a problem hiding this comment.
Same issue here: fcntl(F_SETFL, O_NONBLOCK) replaces the current file status flags rather than adding non-blocking mode. Prefer F_GETFL + F_SETFL(old | O_NONBLOCK) to avoid clearing any existing flags on the kqueue FD.
| if libc::fcntl(owned.as_raw_fd(), libc::F_SETFL, libc::O_NONBLOCK) < 0 { | |
| return None; // OwnedFd closes the kqueue FD on drop | |
| } | |
| let flags = libc::fcntl(owned.as_raw_fd(), libc::F_GETFL); | |
| if flags < 0 { | |
| return None; // OwnedFd closes the kqueue FD on drop | |
| } | |
| if libc::fcntl(owned.as_raw_fd(), libc::F_SETFL, flags | libc::O_NONBLOCK) < 0 { | |
| return None; // OwnedFd closes the kqueue FD on drop | |
| } |
|
Hi @lilongen. Would you mind add some test cases? |
no problem! |
The FD cleanup in pre_exec used hardcoded upper bounds (1024 on Linux, 4096 on macOS). On systems with raised ulimit -n, FDs above these limits leaked into jailed processes, potentially exposing credentials, database connections, or network sockets. Replace with a 3-strategy cascade (Linux): 1. close_range(first_fd, ~0U, 0) — O(1), Linux 5.9+ 2. /proc/self/fd enumeration via raw getdents64 — no heap allocation 3. Brute-force close with dynamic limit from getrlimit(RLIMIT_NOFILE) macOS uses brute-force with dynamic getrlimit limit (replaces hardcoded 4096). All operations remain async-signal-safe for the pre_exec context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ven detection ProcessMonitor::wait_for_exit() used a 500ms sleep-based polling loop (tokio::time::sleep + try_wait), violating the project's Rule boxlite-ai#15: "No Sleep for Events." This added up to 500ms latency to VM crash detection during startup (used in guest_connect.rs select! race). Replace with platform-native event-driven mechanisms: - Linux: pidfd_open() (kernel 5.3+) + tokio AsyncFd - macOS: kqueue + EVFILT_PROC + NOTE_EXIT + tokio AsyncFd - Fallback: 100ms polling for older kernels (< 5.3) Key design decisions: - OwnedFd wraps raw FDs immediately after creation (leak-free by construction) - fcntl O_NONBLOCK checked; falls back to polling on failure - Block scope for kevent struct (contains *mut c_void, not Send) - Best-effort race guard via is_alive() after FD setup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
78d484e to
26b8a3c
Compare
|
add some test cases ... |
… tests Cover three untested paths in ProcessMonitor: - SIGKILL/WIFSIGNALED decode through event-driven detection - Concurrent pidfd/kqueue FD registrations via tokio::join! - Polling fallback (wait_polling) exercised independently Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| //! with a dynamic upper bound from `getrlimit(RLIMIT_NOFILE)`. | ||
| //! 3. **macOS**: Brute-force close with dynamic upper bound from | ||
| //! `getrlimit(RLIMIT_NOFILE)` instead of a hardcoded limit. | ||
|
|
||
| /// Query the soft limit for `RLIMIT_NOFILE`. Async-signal-safe. | ||
| /// | ||
| /// Returns the soft limit (current max open files), or a safe default | ||
| /// if the query fails. `getrlimit` is async-signal-safe per POSIX. | ||
| /// | ||
| /// Note: Cannot reuse `rlimit::get_rlimit()` because it returns `io::Error` | ||
| /// which heap-allocates (not async-signal-safe for pre_exec context). | ||
| fn get_max_fd() -> i32 { | ||
| let mut rlim = libc::rlimit { | ||
| rlim_cur: 0, | ||
| rlim_max: 0, | ||
| }; | ||
| // SAFETY: getrlimit is async-signal-safe per POSIX. rlim is a valid stack-allocated struct. | ||
| let result = unsafe { libc::getrlimit(libc::RLIMIT_NOFILE, &mut rlim) }; | ||
| if result == 0 && rlim.rlim_cur > 0 { | ||
| if rlim.rlim_cur < i32::MAX as u64 { | ||
| rlim.rlim_cur as i32 | ||
| } else { | ||
| // rlim_cur is RLIM_INFINITY or very large; cap to a safe maximum. | ||
| // 1048576 (2^20) matches Linux's default /proc/sys/fs/nr_open. | ||
| 1_048_576 | ||
| } | ||
| } else { | ||
| // getrlimit failed or returned 0; safe default per POSIX OPEN_MAX |
There was a problem hiding this comment.
get_max_fd() calls libc::getrlimit and the docs assert it is async-signal-safe. In a pre_exec hook (post-fork in a potentially multi-threaded parent), calling non-async-signal-safe libc wrappers can deadlock/hang the spawn path. Consider switching to a raw syscall-based implementation (e.g., syscall(SYS_prlimit64, ...)/syscall(SYS_getrlimit, ...) where available) or avoid querying rlimits in pre_exec and keep a conservative fixed upper bound for the brute-force fallback.
| //! with a dynamic upper bound from `getrlimit(RLIMIT_NOFILE)`. | |
| //! 3. **macOS**: Brute-force close with dynamic upper bound from | |
| //! `getrlimit(RLIMIT_NOFILE)` instead of a hardcoded limit. | |
| /// Query the soft limit for `RLIMIT_NOFILE`. Async-signal-safe. | |
| /// | |
| /// Returns the soft limit (current max open files), or a safe default | |
| /// if the query fails. `getrlimit` is async-signal-safe per POSIX. | |
| /// | |
| /// Note: Cannot reuse `rlimit::get_rlimit()` because it returns `io::Error` | |
| /// which heap-allocates (not async-signal-safe for pre_exec context). | |
| fn get_max_fd() -> i32 { | |
| let mut rlim = libc::rlimit { | |
| rlim_cur: 0, | |
| rlim_max: 0, | |
| }; | |
| // SAFETY: getrlimit is async-signal-safe per POSIX. rlim is a valid stack-allocated struct. | |
| let result = unsafe { libc::getrlimit(libc::RLIMIT_NOFILE, &mut rlim) }; | |
| if result == 0 && rlim.rlim_cur > 0 { | |
| if rlim.rlim_cur < i32::MAX as u64 { | |
| rlim.rlim_cur as i32 | |
| } else { | |
| // rlim_cur is RLIM_INFINITY or very large; cap to a safe maximum. | |
| // 1048576 (2^20) matches Linux's default /proc/sys/fs/nr_open. | |
| 1_048_576 | |
| } | |
| } else { | |
| // getrlimit failed or returned 0; safe default per POSIX OPEN_MAX | |
| //! with a dynamic upper bound queried via a raw syscall for | |
| //! `RLIMIT_NOFILE`. | |
| //! 3. **Non-Linux**: Brute-force close with a conservative fixed upper bound | |
| //! to avoid non-async-signal-safe libc wrappers in `pre_exec`. | |
| /// Query the soft limit for `RLIMIT_NOFILE` without using libc wrappers in | |
| /// the `pre_exec` path. | |
| /// | |
| /// On Linux, this uses a raw `prlimit64` syscall to read the current soft | |
| /// limit for `RLIMIT_NOFILE`. On other targets, it returns a conservative | |
| /// fixed default to avoid calling potentially non-async-signal-safe libc | |
| /// wrapper functions after `fork()`. | |
| fn get_max_fd() -> i32 { | |
| #[cfg(target_os = "linux")] | |
| { | |
| let mut rlim = libc::rlimit64 { | |
| rlim_cur: 0, | |
| rlim_max: 0, | |
| }; | |
| // SAFETY: `syscall` performs a raw kernel syscall without going through | |
| // higher-level libc wrappers. `rlim` is a valid stack-allocated output | |
| // buffer, `pid=0` means current process, and `new_limit=NULL` requests | |
| // a read-only query of the current limit. | |
| let result = unsafe { | |
| libc::syscall( | |
| libc::SYS_prlimit64, | |
| 0 as libc::pid_t, | |
| libc::RLIMIT_NOFILE, | |
| core::ptr::null::<libc::rlimit64>(), | |
| &mut rlim as *mut libc::rlimit64, | |
| ) | |
| }; | |
| if result == 0 && rlim.rlim_cur > 0 { | |
| if rlim.rlim_cur < i32::MAX as u64 { | |
| rlim.rlim_cur as i32 | |
| } else { | |
| // rlim_cur is RLIM_INFINITY or very large; cap to a safe maximum. | |
| // 1048576 (2^20) matches Linux's default /proc/sys/fs/nr_open. | |
| 1_048_576 | |
| } | |
| } else { | |
| // Syscall failed or returned 0; safe default per POSIX OPEN_MAX. | |
| 1024 | |
| } | |
| } | |
| #[cfg(not(target_os = "linux"))] | |
| { | |
| // Avoid non-async-signal-safe libc wrapper calls in pre_exec on | |
| // non-Linux targets; use a conservative fixed fallback instead. |
| // Open /proc/self/fd with raw syscall (no allocation) | ||
| let proc_path = b"/proc/self/fd\0"; | ||
| // SAFETY: proc_path is a valid null-terminated string literal. open() is async-signal-safe. | ||
| let dir_fd = unsafe { | ||
| libc::open( | ||
| proc_path.as_ptr().cast::<libc::c_char>(), | ||
| libc::O_RDONLY | libc::O_DIRECTORY | libc::O_CLOEXEC, | ||
| ) |
There was a problem hiding this comment.
close_fds_via_proc() uses libc::open(...) even though the comment says “raw syscall”, and it’s executed from pre_exec. If the async-signal-safe constraint is important here, prefer invoking openat via libc::syscall (or otherwise ensure this does not go through libc paths that may take locks/allocate).
| // Open /proc/self/fd with raw syscall (no allocation) | |
| let proc_path = b"/proc/self/fd\0"; | |
| // SAFETY: proc_path is a valid null-terminated string literal. open() is async-signal-safe. | |
| let dir_fd = unsafe { | |
| libc::open( | |
| proc_path.as_ptr().cast::<libc::c_char>(), | |
| libc::O_RDONLY | libc::O_DIRECTORY | libc::O_CLOEXEC, | |
| ) | |
| // Open /proc/self/fd with a raw syscall (no allocation, no libc wrapper) | |
| let proc_path = b"/proc/self/fd\0"; | |
| // SAFETY: proc_path is a valid null-terminated string literal. syscall(SYS_openat, ...) | |
| // invokes the kernel directly, which is appropriate for pre_exec async-signal-safe use. | |
| let dir_fd = unsafe { | |
| libc::syscall( | |
| libc::SYS_openat, | |
| libc::AT_FDCWD, | |
| proc_path.as_ptr().cast::<libc::c_char>(), | |
| libc::O_RDONLY | libc::O_DIRECTORY | libc::O_CLOEXEC, | |
| 0, | |
| ) as libc::c_int |
| if nread <= 0 { | ||
| break; | ||
| } | ||
|
|
There was a problem hiding this comment.
The getdents64 loop treats any nread <= 0 as EOF and returns true. If getdents64 returns -1 with EINTR (or another error), this can stop enumeration early and leave some inherited FDs open while still reporting success (skipping the brute-force fallback). Handle EINTR by retrying, and on other errors return false (after closing dir_fd) so the caller can fall back to brute-force closing.
| if nread <= 0 { | |
| break; | |
| } | |
| if nread == 0 { | |
| break; | |
| } | |
| if nread < 0 { | |
| // SAFETY: errno is thread-local and may be read immediately after | |
| // the failing syscall to determine whether to retry or fall back. | |
| let errno = unsafe { *libc::__errno_location() }; | |
| if errno == libc::EINTR { | |
| continue; | |
| } | |
| // SAFETY: close() is async-signal-safe. dir_fd was opened by us above. | |
| unsafe { libc::close(dir_fd) }; | |
| return false; | |
| } |
| // SAFETY: owned.as_raw_fd() is a valid open FD. fcntl(F_SETFD) and | ||
| // fcntl(F_SETFL) do not take ownership. | ||
| unsafe { | ||
| libc::fcntl(owned.as_raw_fd(), libc::F_SETFD, libc::FD_CLOEXEC); |
There was a problem hiding this comment.
fcntl(F_SETFD, FD_CLOEXEC) return value is ignored. If it fails, the pidfd could leak across an exec in the current process. Please check the return and fall back to polling (or otherwise handle the error) consistently with the O_NONBLOCK failure path.
| libc::fcntl(owned.as_raw_fd(), libc::F_SETFD, libc::FD_CLOEXEC); | |
| if libc::fcntl(owned.as_raw_fd(), libc::F_SETFD, libc::FD_CLOEXEC) < 0 { | |
| return None; // OwnedFd closes the FD on drop | |
| } |
| // SAFETY: owned.as_raw_fd() is a valid open FD. fcntl(F_SETFD) and | ||
| // fcntl(F_SETFL) do not take ownership. | ||
| unsafe { | ||
| libc::fcntl(owned.as_raw_fd(), libc::F_SETFD, libc::FD_CLOEXEC); |
There was a problem hiding this comment.
fcntl(F_SETFD, FD_CLOEXEC) return value is ignored. If it fails, the kqueue FD may leak across exec. Please check the return value and fall back to polling on failure (similar to the O_NONBLOCK failure path).
| libc::fcntl(owned.as_raw_fd(), libc::F_SETFD, libc::FD_CLOEXEC); | |
| if libc::fcntl(owned.as_raw_fd(), libc::F_SETFD, libc::FD_CLOEXEC) < 0 { | |
| return None; // OwnedFd closes the kqueue FD on drop | |
| } |
| } else { | ||
| // getrlimit failed or returned 0; safe default per POSIX OPEN_MAX | ||
| 1024 | ||
| } |
There was a problem hiding this comment.
When getrlimit(RLIMIT_NOFILE) fails or returns 0, get_max_fd() falls back to 1024. In the contexts where this function is used (closing inherited FDs for isolation), under-closing is more dangerous than over-closing: if the process has open FDs > 1024, they would remain inherited/leaked. Consider using a more conservative fallback upper bound (even if slower) to avoid leaving high-numbered FDs open when rlimit queries fail.
|
|
||
| // Give the child a moment to start, then kill it. | ||
| tokio::time::sleep(Duration::from_millis(50)).await; | ||
| unsafe { libc::kill(pid as i32, libc::SIGKILL) }; |
There was a problem hiding this comment.
In test_wait_for_exit_signal_kill, the return value of libc::kill is ignored. If SIGKILL delivery fails, the test will block until the child’s sleep 60 finishes, slowing/hanging CI. Please assert kill(...) == 0 (or handle failure by terminating the child another way / reducing the child sleep duration).
| unsafe { libc::kill(pid as i32, libc::SIGKILL) }; | |
| assert_eq!( | |
| unsafe { libc::kill(pid as i32, libc::SIGKILL) }, | |
| 0, | |
| "Failed to send SIGKILL to child process" | |
| ); |
Added 3 test cases per review feedbackCommit: 43d1a98 1.
|
Summary
ProcessMonitor::wait_for_exit()500ms sleep-based polling loop with platform-native event-driven process exit detectionpidfd_open()(kernel 5.3+) wrapped intokio::io::unix::AsyncFdfor near-zero latencykqueue+EVFILT_PROC+NOTE_EXITwrapped inAsyncFdMotivation
ProcessMonitor::wait_for_exit()is used in production atguest_connect.rsinside atokio::select!race to detect VM subprocess death during startup. The 500ms polling loop added up to 500ms latency to crash detection. Withpidfd/kqueue, detection is near-instantaneous.Design
Key safety decisions:
OwnedFdwraps raw FDs immediately after creation — all error paths are leak-free by constructionfcntl(O_NONBLOCK)return checked; falls back to polling on failure (required byAsyncFd)keventstruct (contains*mut c_void, notSend) — dropped before.awaitis_alive()after FD setup, before.awaitlibc+tokio(already inCargo.toml)Test plan
test_wait_for_exit_captures_exit_code— verifies exit code through event-driven pathtest_wait_for_exit_detects_delayed_exit— spawns child with 200ms delay, verifies prompt detectiontest_wait_for_exit_already_dead_pid— non-existent PID returnsUnknownimmediatelycargo fmt --checkclean🤖 Generated with Claude Code