Skip to content

[Comgr/Caching/Support] Don't use mmap in localCache, on NFS#3049

Open
jmmartinez wants to merge 1 commit into
amd-stagingfrom
users/jmmartinez/nfs
Open

[Comgr/Caching/Support] Don't use mmap in localCache, on NFS#3049
jmmartinez wants to merge 1 commit into
amd-stagingfrom
users/jmmartinez/nfs

Conversation

@jmmartinez

@jmmartinez jmmartinez commented Jun 24, 2026

Copy link
Copy Markdown

When the cache directory is on an NFS, we may hit a SIGBUS signal when there is contention on the cache.

This happens typically with MPI jobs running on an NFS mounted $HOME directory.

With NFS, the attributes are stored in a cache. The mmap call may succeed, but when we try to access memory returned from this call we may hit a SIGBUS because the file was modified by another process on another machine.

If we execute through the regular read() path, we also fail with a Stale file descriptor error. But at least this path is properly handled and we can continue without using the cache.

To achieve this, we make:

  1. getOpenFileImpl() not use mmap if IsVolatile is true,
  2. localCache passes IsVolatile=true if the cache is
    mounted on an NFS directory.

Notice that (1) may affect other paths appart from the localCache. But from what I checked, IsVolatile is only true for tools like clangd on user files. The rest either use IsVolatile equal to false.

This fix is worth upstreaming with some extra work, but I haven't done so yet because I want to explore other solutions before committing to it.

Related to LCOMPILER-2307.

When the cache directory is on an NFS, we may hit a SIGBUS signal
when there is contention on the cache.

This happens typically with MPI jobs running on an NFS mounted $HOME
directory.

With NFS, the attributes are stored in a cache. The mmap call may
succeed, but when we try to access memory returned from this call we may
hit a SIGBUS because the file was modified by another process on another
machine.

If we execute through the regular `read()` path, we also fail with a
`Stale file descriptor` error. But at least this path is properly handled
and we can continue without using the cache.

To achieve this, we make:
1) `getOpenFileImpl()` not use `mmap` if `IsVolatile` is `true`,
2) We make the `localCache` pass `IsVolatile=true` if the cache is
  mounted on an NFS directory.

Notice that (1) may affect other paths appart from the `localCache`. But
from what I checked, `IsVolatile` is only `true` for tools like
`clangd` on user files. The rest either use `IsVolatile` equal to `false`.

This fix is worth upstraming, but I haven't done so yet because I want
to explore other solutions before commiting to it.
@jmmartinez jmmartinez requested review from chinmaydd and lamb-j June 24, 2026 13:10
@jmmartinez jmmartinez self-assigned this Jun 24, 2026
@jmmartinez jmmartinez added the comgr Related to Code Object Manager label Jun 24, 2026
@lamb-j

lamb-j commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Given that LCOMPILER-2307 is a P2, can we attempt an upstream fix first? I think it's ok if it requires some iteration

@chinmaydd

Copy link
Copy Markdown

Hmm, tend to agree with @lamb-j here since the downstream customer has a temporary fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comgr Related to Code Object Manager

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants