Skip to content

fix(bootstrap): detect low inotify limits before gateway startup#654

Closed
RPirruccio wants to merge 1 commit intoNVIDIA:mainfrom
RPirruccio:552-inotify-preflight-check/RPirruccio
Closed

fix(bootstrap): detect low inotify limits before gateway startup#654
RPirruccio wants to merge 1 commit intoNVIDIA:mainfrom
RPirruccio:552-inotify-preflight-check/RPirruccio

Conversation

@RPirruccio
Copy link
Copy Markdown

@RPirruccio RPirruccio commented Mar 29, 2026

Summary

Detect low fs.inotify.max_user_instances before k3s starts and surface it in openshell doctor check. On hosts running existing container workloads, the default limit (128) can be exhausted, causing containerd's CRI plugin to fail with "too many open files" when creating its fsnotify watcher. The RuntimeService never registers, which surfaces as the "K8s namespace not ready" timeout.

The fix is warn-only. It doesn't change any kernel params, just tells you what's wrong and how to fix it yourself.

Related Issue

Related to #552

Changes

  • cluster-entrypoint.sh: check inotify limit before starting k3s, warn if below 256 with exact fix command
  • doctor_check() in run.rs: add inotify instances check (Linux only) after existing Docker check

Testing

Tested on Ubuntu 24.04, Docker 29.0.4, kernel 6.17.0 with existing k3s cluster on host. Root cause confirmed via docker exec:

failed to load plugin: failed to create CRI service: failed to create cni conf monitor:
failed to create fsnotify watcher: too many open files

sudo sysctl -w fs.inotify.max_user_instances=512 resolved immediately.

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

…DIA#552)

The embedded k3s cluster and its components (containerd, kubelet, flannel,
CoreDNS) create many inotify instances. On hosts that already run Kubernetes
or other container workloads, the default fs.inotify.max_user_instances limit
(128) can be exhausted, causing containerd's CRI plugin to fail with "too many
open files" when creating its fsnotify watcher. This prevents the RuntimeService
from registering, which surfaces as the opaque "K8s namespace not ready" timeout.

Add inotify limit checks in two places:

1. cluster-entrypoint.sh: warn before starting k3s if the limit is below 256.
   Prints the current value and the exact fix command. Does not auto-modify
   kernel parameters -- enterprise environments may audit sysctl changes.

2. doctor_check() in the CLI: adds an inotify instances check after the existing
   Docker check, so `openshell doctor check` catches the issue diagnostically.

Closes NVIDIA#552

Signed-off-by: Riccardo Pirruccio <rickp1795@gmail.com>
@RPirruccio RPirruccio requested a review from a team as a code owner March 29, 2026 00:37
@github-actions
Copy link
Copy Markdown

Thank you for your submission! We ask that you sign our Developer Certificate of Origin before we can accept your contribution. You can sign the DCO by adding a comment below using this text:


I have read the DCO document and I hereby sign the DCO.


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the DCO Assistant Lite bot.

@github-actions
Copy link
Copy Markdown

Thank you for your interest in contributing to OpenShell, @RPirruccio.

This project uses a vouch system for first-time contributors. Before submitting a pull request, you need to be vouched by a maintainer.

To get vouched:

  1. Open a Vouch Request discussion.
  2. Describe what you want to change and why.
  3. Write in your own words — do not have an AI generate the request.
  4. A maintainer will comment /vouch if approved.
  5. Once vouched, open a new PR (preferred) or reopen this one after a few minutes.

See CONTRIBUTING.md for details.

@github-actions github-actions bot closed this Mar 29, 2026
@RPirruccio
Copy link
Copy Markdown
Author

I have read the DCO document and I hereby sign the DCO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant