Skip to content

groupby-sum pads every group to the largest group size #745

@FBumann

Description

@FBumann

Note: This issue was drafted by an AI agent (Claude Code) during a profiling/investigation session with @FBumann. The measurements (memray C-level profiling, benchmarks, the 284× / 132 MB figures) and the git-history archaeology are real and reproducible, but the framing and proposed directions are a starting point for discussion — please sanity-check before acting on them.

LinearExpression.groupby(...).sum() pads every group to the largest group size along _term (via unstack(group_dim, fill_value=...) in expressions.py). For skewed groups — e.g. incidence/balance constraints where a few buses host most units — the result blows up: a 300k-element expression over zipf-distributed groups produces a ~12 GiB padded result that is >99% fill.

Two separable costs:

Real-world evidence: PyPSA already works around this

PyPSA splits the nodal-balance constraint into two separate constraints — strongly-meshed vs weakly-meshed buses — purely to contain this padding (pypsa/optimization/optimize.py, with its own comment "This reduces memory usage for large networks"):

meshed_buses = get_strongly_meshed_buses(n, threshold=45)   # buses with >45 attached components
define_nodal_balance_constraints(n, sns, buses=weakly_meshed_buses)             # small groups
define_nodal_balance_constraints(n, sns, buses=meshed_buses, suffix="-meshed")  # large groups

A single grouped balance over all buses would pad every bus to the largest hub's term count; bucketing by meshedness keeps each rectangle padded only to its own bucket max. This is the "eventually do a separation of short and long linear expressions" noted in the original groupby commit (PyPSA #557, 2023) — and it is actively maintained: PyPSA #1591 (2026) promoted meshed_thresholds to a tunable user parameter.

Note: on typical PyPSA networks the realistic group skew is small (max ≈ 8 generators/bus, ~2.7× padding), so groupby is not the build's peak allocation there — merge is (#749). The pathological skew matters for detailed unit-commitment / rooftop-PV aggregation and for the meshed hubs the split above exists to handle.

Related

Filing as a tracking note so this doesn't get lost.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceThis improves performance while not (meaningfully) altering behaviour for users

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions