Skip to content

mm: Unconditional per-VMA locks and cleanups#12438

Open
kernel-patches-daemon-bpf[bot] wants to merge 5 commits into
bpf-next_basefrom
series/1109563=>bpf-next
Open

mm: Unconditional per-VMA locks and cleanups#12438
kernel-patches-daemon-bpf[bot] wants to merge 5 commits into
bpf-next_basefrom
series/1109563=>bpf-next

Conversation

@kernel-patches-daemon-bpf

Copy link
Copy Markdown

Pull request for series with
subject: mm: Unconditional per-VMA locks and cleanups
version: 2
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1109563

@kernel-patches-daemon-bpf

Copy link
Copy Markdown
Author

Upstream branch: 2e8ad1f
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1109563
version: 2

hansendc added 5 commits June 10, 2026 16:25
The per-VMA locks have been around for several years. They've had some
bugs worked out of them and have seen quite wide use. However, they
are still only available when architectures explicitly enable them.
Remove the conditional compilation around the per-VMA locks, making
them available on all architectures and configs.

The approach up to now seemed to be to add ARCH_SUPPORTS_PER_VMA_LOCK
when the architecture started using per-VMA locks in the fault
handler. But, contrary to the naming, the Kconfig option does not
really indicate whether the architecture supports per-VMA locks or
not. It is more of a marker for whether the architecture is likely to
benefit from per-VMA locks.

To me, the most important thing side-effect of universal availability
is letting per-VMA locks be used in SMP=n configs. This lets us use
per-VMA locking in all x86 code without fallbacks.

Overall, this just generally makes the kernel simpler. Just look at
the diffstat. It also opens the door to users that want to use the
per-VMA locks in common code. Doing *that* brings additional
simplifications.

The downside of this is adding some fields to vm_area_struct and
mm_struct. There are likely ways to optimize this, especially for
things like SMP=n configs. For now, do the simplest thing: use the
same implementation everywhere.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: linux-mm@kvack.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Todd Kjos <tkjos@android.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: netdev@vger.kernel.org
tl;dr: lock_vma_under_rcu() is already a trylock. No need to do both
it and mmap_read_trylock().

Long Version:

== Background ==

Historically, binder used an mmap_read_trylock() in its shrinker code.
This ensures that reclaim is not blocked on an mmap_lock. Commit
95bc2d4 ("binder: use per-vma lock in page reclaiming") added
support for the per-VMA lock, but left mmap_read_trylock() as a
fallback.

This was presumably because the per-VMA locking can fail for several
reasons and most (all?) lock_vma_under_rcu() callers have a fallback
to mmap_read_trylock().

== Problem ==

The fallback is not worth the complexity here. lock_vma_under_rcu() is
essentially already a non-blocking trylock. The main reason it fails
is also the reason mmap_read_trylock() fails: something is holding
mmap_write_lock().

The only remedy for a collision with mmap_write_lock() is to wait,
which this code can not do. So the "fallback" after
lock_vma_under_rcu() failure is not really a fallback: it is really
likely to just be retrying in vain. That retry in an of itself isn't
horrible. But it adds complexity.

== Solution ==

Now that per-VMA locks are universally available, lock_vma_under_rcu()
will not persistently fail. Rely on it alone and simplify the code.

Full disclosure: I originally tried to do this with
lock_vma_under_rcu_wait(), but it did not fit well with the mmap_lock
trylock semantics. Claude caught this in a review and suggested the
approach in this path. It seemed sane to me. So, Suggesed-by: Claude,
I guess.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Lorenzo Stoakes <ljs@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: linux-mm@kvack.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Todd Kjos <tkjos@android.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: netdev@vger.kernel.org
== Background ==

There are basically two parallel ways to look up a VMA: the
traditional way, which is protected by mmap_lock, and the RCU-based
per-VMA lock way which is based on RCU and refcounts.

== Problem ==

The mmap_lock one is more straightforward to use but it has a big
disadvantage in that it can not be mixed with page faults since those
can take mmap_lock for read, which can deadlock when mixed with page
faults. For example:

	mmap_read_lock(mm);
	// Another thread does mmap_write_lock().
	// New mmap_lock readers are blocked.
	vma = vma_lookup(mm, address);
	// This deadlocks on mmap_read_lock() if it faults:
	copy_from_user(address);
	mmap_read_unlock(mm);

The RCU one can be mixed with faults, but it is not available in all
configs, so all RCU users need to be able to fall back to the
traditional way.

== Solution ==

Add a variant of the RCU-based lookup that waits for writers. This is
basically the same as the existing RCU-based lookup, but it also takes
mmap_lock for read and waits for writers to finish before returning
the VMA. This has some advantages:

 1. Callers do not need to have a fallback path for when they
    collide with writers.
 2. It can be used in contexts where page faults can happen because
    it can take the mmap_lock for read but never *holds* it.
 3. Its fast path does not require taking mmap_lock for read.

Basically, when applied correctly, this approach results in faster
*and* simpler code.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: linux-mm@kvack.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Todd Kjos <tkjos@android.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: netdev@vger.kernel.org
Previously, the per-VMA locking could fail in the face of writers
which necessitate a fallback to mmap_lock. The new
vma_start_read_unlocked() will wait for writers instead of failing.

Use the new helper. Wait for writers. Remove the fallback to mmap_lock.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Lorenzo Stoakes <ljs@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: linux-mm@kvack.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Todd Kjos <tkjos@android.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: netdev@vger.kernel.org
Previously, the per-VMA locking could fail in the face of writers
which necessitates a fallback to mmap_lock. The new
lock_vma_under_rcu_wait() will wait for writers instead of failing.

Use the new helper. Wait for writers. Remove the fallback to mmap_lock.

This really is a nice cleanup. It removes the need to pass the lock
state back and forth to find_tcp_vma().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Lorenzo Stoakes <ljs@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: linux-mm@kvack.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Todd Kjos <tkjos@android.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: netdev@vger.kernel.org
@kernel-patches-daemon-bpf

Copy link
Copy Markdown
Author

Upstream branch: 30dee2c
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1109563
version: 2

@kernel-patches-daemon-bpf kernel-patches-daemon-bpf Bot force-pushed the series/1109563=>bpf-next branch from f69c59b to ebe6901 Compare June 10, 2026 23:25
@kernel-patches-review-bot

Copy link
Copy Markdown
commit b687b1a163de1cfd4e91b9f8f9537d2782de82a0
Author: Dave Hansen <dave.hansen@linux.intel.com>

mm: Add RCU-based VMA lookup helper that waits for writers

This adds vma_start_read_unlocked(), a VMA lookup that uses the RCU-based
per-VMA lock on its fast path and falls back to taking mmap_lock for read
to wait out writers on its slow path. The intent is to let callers avoid
maintaining their own mmap_lock fallback path.

> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index 68489dbef0c4..924dd3e865fc 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -257,6 +257,9 @@ static inline bool vma_start_read_locked(struct vm_area_struct *vma)
>  	return vma_start_read_locked_nested(vma, 0);
>  }
>
> +struct vm_area_struct *vma_start_read_unlocked(struct mm_struct *mm,
> +					       unsigned long address);
> +

> diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> index e20d01e8d38f..441d2beb325d 100644
> --- a/mm/mmap_lock.c
> +++ b/mm/mmap_lock.c
> @@ -338,6 +338,33 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> +struct vm_area_struct *vma_start_read_unlocked(struct mm_struct *mm,
> +					       unsigned long address)
> +{
> +	struct vm_area_struct *vma;
> +
> +	/* Fast path: return stable VMA covering 'address': */
> +	vma = lock_vma_under_rcu(mm, address);
> +	if (vma)
> +		return vma;
> +
> +	/* Slow path: preclude VMA writers by getting mmap read lock. */
> +	guard(rwsem_read)(&mm->mmap_lock);
> +	if (!vma_start_read_locked(vma))
> +		return NULL;
> +
> +	return vma;
> +}

Can the slow path here dereference a NULL vma?

The slow path is only reached when lock_vma_under_rcu() returned NULL, since
the 'if (vma) return vma;' branch was not taken. So 'vma' is NULL at the point
vma_start_read_locked(vma) is called.

vma_start_read_locked() forwards to vma_start_read_locked_nested(vma, 0), which
starts with mmap_assert_locked(vma->vm_mm) and then operates on vma->vm_refcnt,
both of which dereference 'vma':

  static inline bool vma_start_read_locked(struct vm_area_struct *vma)
  {
  	return vma_start_read_locked_nested(vma, 0);
  }

This looks reachable on normal runtime paths: lock_vma_under_rcu() returns NULL
when no VMA covers the address, on a writer collision, or when the address
falls outside the locked VMA's range.

Separately, should the slow path perform its own lookup before locking? The
comment says the function should "Find the VMA covering 'address' and lock it
for reading," but the slow path never calls vma_lookup()/find_vma() and instead
reuses the stale 'vma' local. Would something like the following be the intent?

  vma = vma_lookup(mm, address);
  if (!vma)
  	return NULL;
  if (!vma_start_read_locked(vma))
  	return NULL;
  return vma;

Callers are wired up later in this series ('tcp: Remove mmap_lock fallback
path' and 'binder: Remove mmap_lock fallback'), so the path appears reachable
once those land, and no later fix for vma_start_read_unlocked() shows up in
between.



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: mm: Add RCU-based VMA lookup helper that waits for writers
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27312528837

@kernel-patches-review-bot

Copy link
Copy Markdown
commit 3e3467afd8b4d369ea7e4532b91f259416b5bc9c
Author: Dave Hansen <dave.hansen@linux.intel.com>

binder: Make shrinker rely solely on per-VMA lock

This drops the mmap_read_trylock() fallback in binder_alloc_free_page()
and relies only on lock_vma_under_rcu(), removing the mm_locked tracking
and the associated cleanup branches.

> diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
> index e4488ad86a65..9d8239b2640e 100644
> --- a/drivers/android/binder_alloc.c
> +++ b/drivers/android/binder_alloc.c

[ ... ]

> @@ -1203,11 +1197,9 @@ enum lru_status binder_alloc_free_page(struct list_head *item,
>  err_invalid_vma:
>  	mutex_unlock(&alloc->mutex);
>  err_get_alloc_mutex_failed:
> -	if (mm_locked)
> -		mmap_read_unlock(mm);
> -	else
> +	if (vma)
>  		vma_end_read(vma);
> -err_mmap_read_lock_failed:
> +err_vma_lock_failed:
>  	mmput_async(mm);
>  err_mmget:
>  	return LRU_SKIP;

The old err_mmap_read_lock_failed: label was reached by a goto from the
mmap_read_trylock() fallback that this patch removes. After renaming it to
err_vma_lock_failed:, is there any remaining goto that targets it?

Grepping the file, err_vma_lock_failed appears only at the label definition
and nowhere as a goto target, so the label is now reached only by fall-through
from err_get_alloc_mutex_failed:.

A label reached purely by fall-through is not considered "used", so wouldn't
-Wunused-label (enabled via -Wall) warn here, and turn into a build failure
under the default CONFIG_WERROR=y?

Would dropping the err_vma_lock_failed: label and letting mmput_async(mm) fall
through avoid that?



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: binder: Make shrinker rely solely on per-VMA lock
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27312528837

@kernel-patches-daemon-bpf

Copy link
Copy Markdown
Author

@kernel-patches-daemon-bpf

Copy link
Copy Markdown
Author

@kernel-patches-review-bot

Copy link
Copy Markdown
commit c4e197dd47640d9ed0b93394ce249569f60dce76
Author: Dave Hansen <dave.hansen@linux.intel.com>

mm: Make per-VMA locks available universally

The per-VMA locks have been around for several years. They've had some
bugs worked out of them and have seen quite wide use. However, they
are still only available when architectures explicitly enable them.
Remove the conditional compilation around the per-VMA locks, making
them available on all architectures and configs.

> diff --git a/mm/Kconfig b/mm/Kconfig
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1394,19 +1394,6 @@
>  config LRU_GEN_WALKS_MMU
>  	def_bool y
>  	depends on LRU_GEN && ARCH_HAS_HW_PTE_YOUNG
> -# }
> -
> -config ARCH_SUPPORTS_PER_VMA_LOCK
> -       def_bool n
> -
> -config PER_VMA_LOCK
> -	def_bool y
> -	depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
> -	help
> -	  Allow per-vma locking during page fault handling.
> -
> -	  This feature allows locking each virtual memory area separately when
> -	  handling page faults instead of taking mmap_lock.
>
>  config LOCK_MM_AND_FIND_VMA
>  	bool

Removing the CONFIG_PER_VMA_LOCK config symbol makes it permanently
undefined, but kernel/bpf/stackmap.c (not modified in this commit) still
keys its locking protocol on it. This creates a lock imbalance in the BPF
stack build-id collection path.

In stack_map_lock_vma(), lock_vma_under_rcu() now always executes the real
implementation (not a no-op stub). When it succeeds, it returns a VMA
holding a per-VMA read lock, with mmap_lock NOT held:

kernel/bpf/stackmap.c:stack_map_lock_vma() {
    ...
    vma = lock_vma_under_rcu(mm, ip);
    if (vma) {
        lock->vma = vma;
        return vma;
    }
    ...
}

But stack_map_unlock_vma() still has the #ifdef CONFIG_PER_VMA_LOCK around
its cleanup:

kernel/bpf/stackmap.c:stack_map_unlock_vma() {
#ifdef CONFIG_PER_VMA_LOCK
    vma_end_read(lock->vma);
#else
    mmap_read_unlock(lock->mm);
#endif
    lock->vma = NULL;
}

With CONFIG_PER_VMA_LOCK now undefined, the #else branch executes
mmap_read_unlock(lock->mm) on an mmap_lock rwsem that was never acquired.

This corrupts the rwsem state, and also leaks the per-VMA read lock (never
released via vma_end_read), permanently blocking VMA detach/free. Should
kernel/bpf/stackmap.c be updated to remove these #ifdef guards as well?

[ ... ]

> diff --git a/rust/kernel/mm.rs b/rust/kernel/mm.rs
> --- a/rust/kernel/mm.rs
> +++ b/rust/kernel/mm.rs
> @@ -174,7 +174,6 @@
>      /// When per-vma locks are disabled, this always returns `None`.
>      #[inline]
>      pub fn lock_vma_under_rcu(&self, vma_addr: usize) -> Option<VmaReadGuard<'_>> {
> -        #[cfg(CONFIG_PER_VMA_LOCK)]
>          {
>              // SAFETY: Calling `bindings::lock_vma_under_rcu` is always okay given an mm where
>              // `mm_users` is non-zero.
> @@ -188,12 +187,6 @@
>                  });
>              }
>          }
> -
> -        // Silence warnings about unused variables.
> -        #[cfg(not(CONFIG_PER_VMA_LOCK))]
> -        let _ = vma_addr;
> -
> -        None
>      }

Removing the trailing None here breaks compilation. The function signature
is pub fn lock_vma_under_rcu(&self, vma_addr: usize) ->
Option<VmaReadGuard<'_>>. The return Some(...) only executes when
!vma.is_null(). On the vma.is_null() path (lookup miss or lock contention)
the if has no else, so the inner block evaluates to (), and the function's
tail expression is (). This is a type mismatch (rustc E0308: expected
Option<VmaReadGuard<'_>>, found ()).

Shouldn't the trailing None be kept to serve as the return value for the
null/contention path?



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: mm: Make per-VMA locks available universally
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27312528837

@kernel-patches-daemon-bpf

Copy link
Copy Markdown
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant