groupby-sum pads every group to the largest group size

> **Note:** This issue was drafted by an AI agent (Claude Code) during a profiling/investigation session with @FBumann. The measurements (memray C-level profiling, benchmarks, the 284× / 132 MB figures) and the git-history archaeology are real and reproducible, but the framing and proposed directions are a starting point for discussion — please sanity-check before acting on them.

`LinearExpression.groupby(...).sum()` pads every group to the largest group size along `_term` (via `unstack(group_dim, fill_value=...)` in `expressions.py`). For skewed groups — e.g. incidence/balance constraints where a few buses host most units — the result blows up: a 300k-element expression over zipf-distributed groups produces a ~12 GiB padded result that is >99% fill.

Two separable costs:
- **Transient** (the unstack/stack machinery making extra copies) — addressable in the current representation. Prototype: FBumann/linopy#25 scatters terms directly into the result (~32× faster, ~2.6× less peak on the skewed case).
- **Inherent** (the padded result itself) — forced by the dense rectangular `_term` layout; only a long-format/sparse kernel removes it. See #756.

### Real-world evidence: PyPSA already works around this

PyPSA splits the nodal-balance constraint into **two separate constraints** — strongly-meshed vs weakly-meshed buses — purely to contain this padding (`pypsa/optimization/optimize.py`, with its own comment *"This reduces memory usage for large networks"*):

```python
meshed_buses = get_strongly_meshed_buses(n, threshold=45)   # buses with >45 attached components
define_nodal_balance_constraints(n, sns, buses=weakly_meshed_buses)             # small groups
define_nodal_balance_constraints(n, sns, buses=meshed_buses, suffix="-meshed")  # large groups
```

A single grouped balance over all buses would pad *every* bus to the largest hub's term count; bucketing by meshedness keeps each rectangle padded only to its own bucket max. This is the *"eventually do a separation of short and long linear expressions"* noted in the original groupby commit (PyPSA #557, 2023) — and it is actively maintained: PyPSA #1591 (2026) promoted `meshed_thresholds` to a tunable user parameter.

Note: on **typical** PyPSA networks the realistic group skew is small (max ≈ 8 generators/bus, ~2.7× padding), so `groupby` is not the build's peak allocation there — `merge` is (#749). The pathological skew matters for detailed unit-commitment / rooftop-PV aggregation and for the meshed hubs the split above exists to handle.

### Related

- #748 — `@`/`dot` against a sparse matrix densifies the same way (KVL).
- #749 — `merge` of ragged expressions is the actual build peak.
- #756 — long-format/sparse `_term` kernel that subsumes all three.

Filing as a tracking note so this doesn't get lost.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby-sum pads every group to the largest group size #745

Real-world evidence: PyPSA already works around this

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

groupby-sum pads every group to the largest group size #745

Description

Real-world evidence: PyPSA already works around this

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions