mm/memcontrol: optimize __memcg_slab_post_alloc_hook batch charging#3
Draft
Copilot wants to merge 2 commits into
Draft
mm/memcontrol: optimize __memcg_slab_post_alloc_hook batch charging#3Copilot wants to merge 2 commits into
Copilot wants to merge 2 commits into
Conversation
…oc_hook
Rewrite the per-object slab memcg charging loop with several improvements:
1. Hoist the loop-invariant max_run computation before the loop.
The cap (INT_MAX - PAGE_SIZE) / obj_size is constant across all
iterations.
2. Replace the single upfront bulk charge + per-object vmstat updates
with per-pgdat-run batching. The scan-ahead loop groups consecutive
objects on the same pgdat node, then:
- obj_cgroup_charge() is called once per run (not once for all
objects upfront)
- mod_objcg_state() is called once per run instead of once per
object, reducing repeated local_lock_irqsave/irqrestore from
N acquisitions to approximately one per pgdat run
- Objects that fail alloc_slab_obj_exts() are simply skipped
rather than being charged upfront and then individually uncharged
3. Avoid redundant virt_to_slab() for the first object of each run
by reusing the slab pointer already computed at the top of the
outer loop.
4. Use run_end absolute index with while (++i < run_end) to advance i
directly, eliminating a separate run_len counter and relative
indexing. The skip_next flag pattern is not needed: when
alloc_slab_obj_exts() fails in the scan-ahead, we simply break
and the next outer iteration retries that object naturally.
Agent-Logs-Url: https://github.com/teawater/linux/sessions/cd4a349b-c91d-46eb-889a-fa3f75c83e73
Co-authored-by: teawater <432382+teawater@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix performance issues in batch memcg charging implementation
mm/memcontrol: optimize __memcg_slab_post_alloc_hook batch charging
Apr 16, 2026
teawater
pushed a commit
that referenced
this pull request
May 7, 2026
Eduard Zingerman says:
====================
bpf: replace min/max fields with struct cnum{32,64}
This RFC replaces s64, u64, s32, u32 scalar range domains tracked by
verifier by a pair of circular numbers (cnums): one for 64-bit domain
and another for 32-bit domain. Each cnum represents a range as a
single arc on the circular number line, from which signed and unsigned
bounds are derived on demand. See also wrapped intervals
representation as in [1].
The use of such representation simplifies arithmetic and conditions
handling in verifier.c and allows to express 32 <-> 64 bit deductions
in a more mathematically rigorous way.
[1] https://jorgenavas.github.io/papers/ACM-TOPLAS-wrapped.pdf
Changelog
=========
v2 -> v1:
- Fixes in source code comments highlighted by bot+bpf-ci.
RFCv1 -> v2:
- Dropped RFC tag.
- Dropped cnum{32,64}_mul(), too much complexity and no veristat
or selftests gains.
- cnum32_from_cnum64() normalizes to CNUM32_EMPTY when input
cnum64.size >= U32_MAX, previously only checked for > U32_MAX
(bot+bpf-ci).
- cnum32_from_cnum64() and cnum64_cnum32_intersect() now check for
empty inputs (sashiko).
- In regs_refine_cond_op() case BPF_JSLT use cnum{32,64}_from_srange
constructors instead of unsigned variants (sashiko).
- cnum{32,64}_intersect_with{,_urange,_srange}() helpers added
(Alexei)
[RFCv1] https://lore.kernel.org/bpf/0c47b0b7ea476647746806c46fded4353be885f7.camel@gmail.com/T/
[v2] https://lore.kernel.org/bpf/c4fb60bafd526ae2d92b86e2250255aef0ba5ee1.camel@gmail.com/T/
Patch set layout
================
Patch #1 introduces the cnum (circular number) data type and basic
operations: intersection, addition, multiplication, negation, 32-to-64
and 64-to-32 projections.
CBMC based proofs for these operations are available at [2].
(The proofs lag behind the current patch set and need an update if
this RFC moves further. Although, I don't expect any significant
changes).
[2] https://github.com/eddyz87/cnum-verif
Patch #2 mechanically converts all direct accesses to bpf_reg_state's
min/max fields to use accessor functions, preparing for the field
replacement.
Patch #3 replaces the eight independent min/max fields in
bpf_reg_state with two cnum fields. The signed and unsigned bounds are
derived on demand from the cnum via the accessor functions.
This eliminates the separate signed<->unsigned cross-deduction logic,
simplifies reg_bounds_sync(), regs_refine_cond_op() and some
arithmetic operations.
Patch #4 adds selftest cases for improved 32->64 range refinements
enabled by cnum64_cnum32_intersect().
Precision trade-offs
====================
A cnum represents a contiguous arc on the circular number line.
Current master tracks four independent ranges (s64, u64, s32, u32)
whose intersection could implicitly represent value sets that are
two disjoint arcs on the circle - something a single cnum cannot.
Cnums lose precision when a cross-domain conditional comparison
(e.g., unsigned comparison on a register with a signed-derived range)
produces an intersection that would be two disjoint arcs.
The cnum must over-approximate by returning one of the input arcs.
E.g. a pair of ranges u64:[1,U64_MAX] s:[-1,1] represents a set of two
values: {-1,1}, this can only be approximated as a set of three values
by cnum: {-1,0,1}.
Cnums gain precision in arithmetic: when addition, subtraction or
multiplication causes register value to cross both signed and unsigned
min/max boundaries. Current master resets the register as unbound in
such cases, while cnums preserve the exact wrapping arc.
In practice precision loss turned out to matter only for a set of
specifically crafted selftests from reg_bounds.c (see patch #3 for
more details). On the other hand, precision gains are observable for a
set real-life programs.
Verification performance measurements
=====================================
Tested on 6683 programs across four corpora: Cilium (134 programs),
selftests (4635), sched_ext (376), and Meta internal BPF corpus
(1540). Program verification verdicts are identical across all four
program sets - no program changes between success and failure.
Instruction count comparison (cnum vs signed/unsigned pair):
- 83 programs improve, 44 regress, 6596 unchanged
- Total saved: ~290K insns, total added: 10K insns
- Largest improvements:
-26% insns in a internal network firewall program;
-47% insns in rusty/wd40 sched_ext schedulers.
- Largest regression: +24% insns in one Cilium program (5062 insns
added).
Raw performance data is at the bottom of the email.
Raw verification performance metrics
====================================
Filtered out by -f insns_pct>5 -f !insns<10000
========= selftests: master vs cnums-everywhere-v2 =========
File Program Insns (A) Insns (B) Insns (DIFF)
---- ------- --------- --------- ------------
Total progs: 4665
total_insns diff min: -15.71%
total_insns diff max: 45.45%
-20 .. -10 %: 3
-5 .. 0 %: 6
0 .. 5 %: 4654
45 .. 50 %: 2
========= scx: master vs cnums-everywhere-v2 =========
File Program Insns (A) Insns (B) Insns (DIFF)
--------------- -------------- --------- --------- ----------------
scx_rusty.bpf.o rusty_enqueue 39842 22053 -17789 (-44.65%)
scx_rusty.bpf.o rusty_stopping 37738 19949 -17789 (-47.14%)
scx_wd40.bpf.o wd40_stopping 37729 19880 -17849 (-47.31%)
Total progs: 376
total_insns diff min: -47.31%
total_insns diff max: 19.61%
-50 .. -40 %: 3
-5 .. 0 %: 5
0 .. 5 %: 366
5 .. 15 %: 1
15 .. 20 %: 1
========= meta: master vs cnums-everywhere-v2 =========
File Program Insns (A) Insns (B) Insns (DIFF)
--------------------------------- ---------- --------- --------- ----------------
<meta-internal-firewall-program> ...egress 222327 164648 -57679 (-25.94%)
<meta-internal-firewall-program> ..._tc_eg 222839 164772 -58067 (-26.06%)
<meta-internal-firewall-program> ...egress 222327 164648 -57679 (-25.94%)
<meta-internal-firewall-program> ..._tc_eg 222839 164772 -58067 (-26.06%)
Total progs: 1540
total_insns diff min: -26.06%
total_insns diff max: 6.69%
-30 .. -20 %: 4
-15 .. -5 %: 6
-5 .. 0 %: 36
0 .. 5 %: 1493
5 .. 10 %: 1
========= cilium: master vs cnums-everywhere-v2 =========
File Program Insns (A) Insns (B) Insns (DIFF)
---------- ------------------------------- --------- --------- ---------------
bpf_host.o tail_handle_ipv4_cont_from_host 20962 26024 +5062 (+24.15%)
bpf_host.o tail_handle_ipv6_cont_from_host 17036 18672 +1636 (+9.60%)
Total progs: 134
total_insns diff min: -3.32%
total_insns diff max: 24.15%
-5 .. 0 %: 12
0 .. 5 %: 120
5 .. 15 %: 1
20 .. 25 %: 1
---
====================
Link: https://patch.msgid.link/20260424-cnums-everywhere-rfc-v1-v3-0-ca434b39a486@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The per-object memcg charging loop in
__memcg_slab_post_alloc_hook()re-acquiredstock_lockonce per object viamod_objcg_state(), and charged all objects upfront requiring individual uncharg calls whenalloc_slab_obj_exts()failed.Changes
Per-pgdat run batching: scan-ahead groups consecutive objects sharing a pgdat node into a run;
obj_cgroup_charge()andmod_objcg_state()are called once per run instead of once per object — reducinglocal_lock_irqsave/irqrestorefrom N to ~1 for typical same-node bulk allocationsCharge-on-assign: upfront bulk charge (
size * obj_full_size(s)) replaced with per-run charging; objects that failalloc_slab_obj_exts()are skipped entirely, eliminating the per-failureobj_cgroup_uncharge()callHoist loop-invariant
max_run:(INT_MAX - PAGE_SIZE) / obj_sizecomputed once before the loop; also caps run length to keepbatch_byteswithinintrange formod_objcg_state()Reuse first object's
slabpointer:slabfrom the outer loop head is used directly for the first object'sobj_extassignment, saving avirt_to_slab()call per runrun_endabsolute indexing: replacesrun_len+ relativej+skip_nextflag;while (++i < run_end)advancesito the correct position automatically — failedalloc_slab_obj_exts()in scan-ahead simply breaks, and the next outer iteration retries naturallyOriginal prompt
Problem
Commit 7ef269f ("mm/memcontrol: batch memcg charging in __memcg_slab_post_alloc_hook") introduced batch memcg charging but the implementation has several performance issues that cause overhead even when the optimization helps:
Issues in the current code (mm/memcontrol.c, function
__memcg_slab_post_alloc_hook, around line 3280-3364):1.
max_sizerecomputed every outer iteration (line 3307-3308)This value is constant across all iterations —
obj_sizeandsizedon't change. It should be hoisted before the loop.2.
skip_nextflag adds unnecessary complexity (lines 3289, 3318-3319, 3360-3363)When
alloc_slab_obj_exts()fails for an object in the scan-ahead loop, the code setsskip_next = true, breaks, then at the end does:This is unnecessary. Without
skip_next, settingi += run_lenmakes the outer loop land on the failed object. The next iteration callsvirt_to_slab()+alloc_slab_obj_exts()which fails again →i++; continue;— same net result, and this is the rare error path so the one extra check is negligible.3. Redundant
virt_to_slab()callsEach object in a run gets
virt_to_slab()called twice:struct slab *slab_j = virt_to_slab(p[j]);slab = virt_to_slab(p[i + j]);For the first object of each run, it's called three times total (line 3291 + scan-ahead for next run's check + assignment).
The first object's slab pointer (
slabfrom line 3291) is already available but thrown away in the assignment loop which re-computes it atj=0.4. Confusing index arithmetic
The scan-ahead loop uses absolute index
jfromi+1, but the assignment loop uses relativejfrom 0 withp[i + j]. This is error-prone and harder to read.5.
run_leninitialized to 0 then immediately set to 1 (lines 3286, 3300)Minor but sloppy:
size_t run_len = 0;followed byrun_len = 1;a few lines later.Proposed Fix
Optimize the loop in
__memcg_slab_post_alloc_hook()with these changes:Hoist
max_runcomputation before the loop — it's a loop-invariant constant.Remove the
skip_nextflag entirely — simplify the control flow. Whenalloc_slab_obj_exts()fails in scan-ahead, justbreak. The next outer iteration handles it naturally.Avoid redundant
virt_to_slab()for the first object — handle it separately using theslabpointer already computed at the top of the outer loop, then loop over remaining objects.Use
run_endabsolute index — replacerun_len+jrelative indexing withrun_endabsolute index. The assignment loop usesdo { ... } while (++i < run_end)which advancesidirectly, eliminating separate index tracking.Here is the optimized replacement for the loop body in
__memcg_slab_post_alloc_hook()(mm/memcontrol.c). Replace lines 3280-3364 (thefor (i = 0; i < size; )loop and its body) with: