mm/memcontrol: optimize __memcg_slab_post_alloc_hook batch charging by Copilot · Pull Request #3 · teawater/linux

Copilot · 2026-04-16T09:02:02Z

The per-object memcg charging loop in __memcg_slab_post_alloc_hook() re-acquired stock_lock once per object via mod_objcg_state(), and charged all objects upfront requiring individual uncharg calls when alloc_slab_obj_exts() failed.

Changes

Per-pgdat run batching: scan-ahead groups consecutive objects sharing a pgdat node into a run; obj_cgroup_charge() and mod_objcg_state() are called once per run instead of once per object — reducing local_lock_irqsave/irqrestore from N to ~1 for typical same-node bulk allocations
Charge-on-assign: upfront bulk charge (size * obj_full_size(s)) replaced with per-run charging; objects that fail alloc_slab_obj_exts() are skipped entirely, eliminating the per-failure obj_cgroup_uncharge() call
Hoist loop-invariant max_run: (INT_MAX - PAGE_SIZE) / obj_size computed once before the loop; also caps run length to keep batch_bytes within int range for mod_objcg_state()
Reuse first object's slab pointer: slab from the outer loop head is used directly for the first object's obj_ext assignment, saving a virt_to_slab() call per run
run_end absolute indexing: replaces run_len + relative j + skip_next flag; while (++i < run_end) advances i to the correct position automatically — failed alloc_slab_obj_exts() in scan-ahead simply breaks, and the next outer iteration retries naturally

/* Before: N lock acquisitions for N objects */
for (i = 0; i < size; i++) {
    ...
    mod_objcg_state(objcg, slab_pgdat(slab), cache_vmstat_idx(s), obj_full_size(s));
}

/* After: ~1 lock acquisition per pgdat run */
while (run_end < size && (run_end - i) < max_run) { /* scan ahead */ }
batch_bytes = (int)(obj_size * (run_end - i));
obj_cgroup_charge(objcg, flags, batch_bytes);
/* assign obj_ext per object ... */
mod_objcg_state(objcg, pgdat, cache_vmstat_idx(s), batch_bytes);

Original prompt

Problem

Commit 7ef269f ("mm/memcontrol: batch memcg charging in __memcg_slab_post_alloc_hook") introduced batch memcg charging but the implementation has several performance issues that cause overhead even when the optimization helps:

Issues in the current code (mm/memcontrol.c, function `__memcg_slab_post_alloc_hook`, around line 3280-3364):

1. max_size recomputed every outer iteration (line 3307-3308)

max_size = min((size_t)((INT_MAX - PAGE_SIZE) / obj_size), size);

This value is constant across all iterations — obj_size and size don't change. It should be hoisted before the loop.

2. skip_next flag adds unnecessary complexity (lines 3289, 3318-3319, 3360-3363)
When alloc_slab_obj_exts() fails for an object in the scan-ahead loop, the code sets skip_next = true, breaks, then at the end does:

if (skip_next)
    i = i + run_len + 1;
else
    i += run_len;

This is unnecessary. Without skip_next, setting i += run_len makes the outer loop land on the failed object. The next iteration calls virt_to_slab() + alloc_slab_obj_exts() which fails again → i++; continue; — same net result, and this is the rare error path so the one extra check is negligible.

3. Redundant virt_to_slab() calls
Each object in a run gets virt_to_slab() called twice:

Once in the scan-ahead loop (line 3311): struct slab *slab_j = virt_to_slab(p[j]);
Once again in the obj_ext assignment loop (line 3350): slab = virt_to_slab(p[i + j]);

For the first object of each run, it's called three times total (line 3291 + scan-ahead for next run's check + assignment).

The first object's slab pointer (slab from line 3291) is already available but thrown away in the assignment loop which re-computes it at j=0.

4. Confusing index arithmetic
The scan-ahead loop uses absolute index j from i+1, but the assignment loop uses relative j from 0 with p[i + j]. This is error-prone and harder to read.

5. run_len initialized to 0 then immediately set to 1 (lines 3286, 3300)
Minor but sloppy: size_t run_len = 0; followed by run_len = 1; a few lines later.

Proposed Fix

Optimize the loop in __memcg_slab_post_alloc_hook() with these changes:

Hoist max_run computation before the loop — it's a loop-invariant constant.
Remove the skip_next flag entirely — simplify the control flow. When alloc_slab_obj_exts() fails in scan-ahead, just break. The next outer iteration handles it naturally.
Avoid redundant virt_to_slab() for the first object — handle it separately using the slab pointer already computed at the top of the outer loop, then loop over remaining objects.
Use run_end absolute index — replace run_len + j relative indexing with run_end absolute index. The assignment loop uses do { ... } while (++i < run_end) which advances i directly, eliminating separate index tracking.

Here is the optimized replacement for the loop body in __memcg_slab_post_alloc_hook() (mm/memcontrol.c). Replace lines 3280-3364 (the for (i = 0; i < size; ) loop and its body) with:

	/*
	 * Cap run length to prevent integer overflow in the final
	 * accumulation performed by __account_obj_stock() where
	 * batch_bytes is added to an int field.
	 */
	const size_t max_run = min_t(size_t,
				     (INT_MAX - PAGE_SIZE) / obj_size, size);

	for (i = 0; i < size; ) {
		unsigned long obj_exts;
		struct slabobj_ext *obj_ext;
		struct obj_stock_pcp *stock;
		struct pglist_data *pgdat;
		int batch_bytes;
		size_t run_end;

		slab = virt_to_slab(p[i]);

		if (!slab_obj_exts(slab) &&
		    alloc_slab_obj_exts(slab, s, flags, false)) {
			i++;
			continue;
		}

		pgdat = slab_pgdat(slab);
		run_end = i + 1;

		/*
		 * Scan ahead for objects on the same pgdat node so we
		 * can batch the memcg charge into a single call.
		 */
		while (run_end < size && (run_end - i) < max_run) {
			struct slab *slab_j = virt_to_slab(p[run_end]);

			if (slab_pgdat(slab_j) != pgdat)
				break;

			if (!slab_obj_exts(slab_j) &&
			    alloc_slab_obj_exts(slab_j, s, flags, false))
				break;

			run_end++;
		}

		/*
		 * If we fail and size is 1, memcg_alloc_abort_single() will
		 * just free the object, which is ok as we have not assigned
		 * objcg to its obj_ext yet.
		 *
		 * For larger sizes, kmem_cache_free_bulk() will uncharge
		 * any objects that were already charged and obj_ext assigned.
		 */
		batch_bytes = obj_size * (run_end - i);
		stock = trylock_stock();
		if (!stock || !__consume_obj_stock(objcg, stock, batch_bytes)) {
			size_t remainder;

			unlock_stock(stock);
			if (__obj_cgroup_charge(objcg, flags, batch_bytes, &remainder))
				return false;
			stock = trylock_stock();
			if (remainder)
				__refill_obj_stock(objcg, stock, remainder, false);
		}
		__account_obj_stock(objcg, stock, batch_bytes,
				    pgdat, cache_vmstat_idx(s));
		unlock_stock(stock);

		/*
		 * Assign obj_ext for each object in this run.
		 * Re...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

*This pull request was created from Copilot chat.*
>

…oc_hook Rewrite the per-object slab memcg charging loop with several improvements: 1. Hoist the loop-invariant max_run computation before the loop. The cap (INT_MAX - PAGE_SIZE) / obj_size is constant across all iterations. 2. Replace the single upfront bulk charge + per-object vmstat updates with per-pgdat-run batching. The scan-ahead loop groups consecutive objects on the same pgdat node, then: - obj_cgroup_charge() is called once per run (not once for all objects upfront) - mod_objcg_state() is called once per run instead of once per object, reducing repeated local_lock_irqsave/irqrestore from N acquisitions to approximately one per pgdat run - Objects that fail alloc_slab_obj_exts() are simply skipped rather than being charged upfront and then individually uncharged 3. Avoid redundant virt_to_slab() for the first object of each run by reusing the slab pointer already computed at the top of the outer loop. 4. Use run_end absolute index with while (++i < run_end) to advance i directly, eliminating a separate run_len counter and relative indexing. The skip_next flag pattern is not needed: when alloc_slab_obj_exts() fails in the scan-ahead, we simply break and the next outer iteration retries that object naturally. Agent-Logs-Url: https://github.com/teawater/linux/sessions/cd4a349b-c91d-46eb-889a-fa3f75c83e73 Co-authored-by: teawater <432382+teawater@users.noreply.github.com>

Eduard Zingerman says: ==================== bpf: replace min/max fields with struct cnum{32,64} This RFC replaces s64, u64, s32, u32 scalar range domains tracked by verifier by a pair of circular numbers (cnums): one for 64-bit domain and another for 32-bit domain. Each cnum represents a range as a single arc on the circular number line, from which signed and unsigned bounds are derived on demand. See also wrapped intervals representation as in [1]. The use of such representation simplifies arithmetic and conditions handling in verifier.c and allows to express 32 <-> 64 bit deductions in a more mathematically rigorous way. [1] https://jorgenavas.github.io/papers/ACM-TOPLAS-wrapped.pdf Changelog ========= v2 -> v1: - Fixes in source code comments highlighted by bot+bpf-ci. RFCv1 -> v2: - Dropped RFC tag. - Dropped cnum{32,64}_mul(), too much complexity and no veristat or selftests gains. - cnum32_from_cnum64() normalizes to CNUM32_EMPTY when input cnum64.size >= U32_MAX, previously only checked for > U32_MAX (bot+bpf-ci). - cnum32_from_cnum64() and cnum64_cnum32_intersect() now check for empty inputs (sashiko). - In regs_refine_cond_op() case BPF_JSLT use cnum{32,64}_from_srange constructors instead of unsigned variants (sashiko). - cnum{32,64}_intersect_with{,_urange,_srange}() helpers added (Alexei) [RFCv1] https://lore.kernel.org/bpf/0c47b0b7ea476647746806c46fded4353be885f7.camel@gmail.com/T/ [v2] https://lore.kernel.org/bpf/c4fb60bafd526ae2d92b86e2250255aef0ba5ee1.camel@gmail.com/T/ Patch set layout ================ Patch #1 introduces the cnum (circular number) data type and basic operations: intersection, addition, multiplication, negation, 32-to-64 and 64-to-32 projections. CBMC based proofs for these operations are available at [2]. (The proofs lag behind the current patch set and need an update if this RFC moves further. Although, I don't expect any significant changes). [2] https://github.com/eddyz87/cnum-verif Patch #2 mechanically converts all direct accesses to bpf_reg_state's min/max fields to use accessor functions, preparing for the field replacement. Patch #3 replaces the eight independent min/max fields in bpf_reg_state with two cnum fields. The signed and unsigned bounds are derived on demand from the cnum via the accessor functions. This eliminates the separate signed<->unsigned cross-deduction logic, simplifies reg_bounds_sync(), regs_refine_cond_op() and some arithmetic operations. Patch #4 adds selftest cases for improved 32->64 range refinements enabled by cnum64_cnum32_intersect(). Precision trade-offs ==================== A cnum represents a contiguous arc on the circular number line. Current master tracks four independent ranges (s64, u64, s32, u32) whose intersection could implicitly represent value sets that are two disjoint arcs on the circle - something a single cnum cannot. Cnums lose precision when a cross-domain conditional comparison (e.g., unsigned comparison on a register with a signed-derived range) produces an intersection that would be two disjoint arcs. The cnum must over-approximate by returning one of the input arcs. E.g. a pair of ranges u64:[1,U64_MAX] s:[-1,1] represents a set of two values: {-1,1}, this can only be approximated as a set of three values by cnum: {-1,0,1}. Cnums gain precision in arithmetic: when addition, subtraction or multiplication causes register value to cross both signed and unsigned min/max boundaries. Current master resets the register as unbound in such cases, while cnums preserve the exact wrapping arc. In practice precision loss turned out to matter only for a set of specifically crafted selftests from reg_bounds.c (see patch #3 for more details). On the other hand, precision gains are observable for a set real-life programs. Verification performance measurements ===================================== Tested on 6683 programs across four corpora: Cilium (134 programs), selftests (4635), sched_ext (376), and Meta internal BPF corpus (1540). Program verification verdicts are identical across all four program sets - no program changes between success and failure. Instruction count comparison (cnum vs signed/unsigned pair): - 83 programs improve, 44 regress, 6596 unchanged - Total saved: ~290K insns, total added: 10K insns - Largest improvements: -26% insns in a internal network firewall program; -47% insns in rusty/wd40 sched_ext schedulers. - Largest regression: +24% insns in one Cilium program (5062 insns added). Raw performance data is at the bottom of the email. Raw verification performance metrics ==================================== Filtered out by -f insns_pct>5 -f !insns<10000 ========= selftests: master vs cnums-everywhere-v2 ========= File Program Insns (A) Insns (B) Insns (DIFF) ---- ------- --------- --------- ------------ Total progs: 4665 total_insns diff min: -15.71% total_insns diff max: 45.45% -20 .. -10 %: 3 -5 .. 0 %: 6 0 .. 5 %: 4654 45 .. 50 %: 2 ========= scx: master vs cnums-everywhere-v2 ========= File Program Insns (A) Insns (B) Insns (DIFF) --------------- -------------- --------- --------- ---------------- scx_rusty.bpf.o rusty_enqueue 39842 22053 -17789 (-44.65%) scx_rusty.bpf.o rusty_stopping 37738 19949 -17789 (-47.14%) scx_wd40.bpf.o wd40_stopping 37729 19880 -17849 (-47.31%) Total progs: 376 total_insns diff min: -47.31% total_insns diff max: 19.61% -50 .. -40 %: 3 -5 .. 0 %: 5 0 .. 5 %: 366 5 .. 15 %: 1 15 .. 20 %: 1 ========= meta: master vs cnums-everywhere-v2 ========= File Program Insns (A) Insns (B) Insns (DIFF) --------------------------------- ---------- --------- --------- ---------------- <meta-internal-firewall-program> ...egress 222327 164648 -57679 (-25.94%) <meta-internal-firewall-program> ..._tc_eg 222839 164772 -58067 (-26.06%) <meta-internal-firewall-program> ...egress 222327 164648 -57679 (-25.94%) <meta-internal-firewall-program> ..._tc_eg 222839 164772 -58067 (-26.06%) Total progs: 1540 total_insns diff min: -26.06% total_insns diff max: 6.69% -30 .. -20 %: 4 -15 .. -5 %: 6 -5 .. 0 %: 36 0 .. 5 %: 1493 5 .. 10 %: 1 ========= cilium: master vs cnums-everywhere-v2 ========= File Program Insns (A) Insns (B) Insns (DIFF) ---------- ------------------------------- --------- --------- --------------- bpf_host.o tail_handle_ipv4_cont_from_host 20962 26024 +5062 (+24.15%) bpf_host.o tail_handle_ipv6_cont_from_host 17036 18672 +1636 (+9.60%) Total progs: 134 total_insns diff min: -3.32% total_insns diff max: 24.15% -5 .. 0 %: 12 0 .. 5 %: 120 5 .. 15 %: 1 20 .. 25 %: 1 --- ==================== Link: https://patch.msgid.link/20260424-cnums-everywhere-rfc-v1-v3-0-ca434b39a486@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Initial plan

adf9804

Copilot AI assigned Copilot and teawater Apr 16, 2026

Copilot started work on behalf of teawater April 16, 2026 09:02 View session

Copilot AI changed the title ~~[WIP] Fix performance issues in batch memcg charging implementation~~ mm/memcontrol: optimize __memcg_slab_post_alloc_hook batch charging Apr 16, 2026

Copilot AI requested a review from teawater April 16, 2026 09:17

Copilot finished work on behalf of teawater April 16, 2026 09:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mm/memcontrol: optimize __memcg_slab_post_alloc_hook batch charging#3

mm/memcontrol: optimize __memcg_slab_post_alloc_hook batch charging#3
Copilot wants to merge 2 commits into
masterfrom
copilot/fix-batch-memcg-charging

Copilot AI commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Problem

Issues in the current code (mm/memcontrol.c, function __memcg_slab_post_alloc_hook, around line 3280-3364):

Proposed Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Apr 16, 2026 •

edited

Loading

Issues in the current code (mm/memcontrol.c, function `__memcg_slab_post_alloc_hook`, around line 3280-3364):