docs(aggregation): add grouping usage example and fix GradVac note

rkhosrowshahi · rkhosrowshahi · commit c4ed86f5383e · 2026-04-13T12:41:25.000-04:00
Add a Grouping example page covering all four strategies from the GradVac
paper (whole_model, enc_dec, all_layer, all_matrix), with a runnable code
block for each. Update the GradVac docstring note to link to the new page
instead of the previous placeholder text. Fix trailing whitespace in
CHANGELOG.md.

Made-with: Cursor
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,7 +10,7 @@ changelog does not include internal changes that do not affect the user.
 
 ### Added
 
-- Added `GradVac` and `GradVacWeighting` from 
+- Added `GradVac` and `GradVacWeighting` from
   [Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models](https://arxiv.org/pdf/2010.05874).
 - Added a fallback for when the inner optimization of `NashMTL` fails (which can happen for example
   on the matrix [[0., 0.], [0., 1.]]).
diff --git a/docs/source/examples/grouping.rst b/docs/source/examples/grouping.rst
@@ -0,0 +1,167 @@
+Grouping
+========
+
+When applying a conflict-resolving aggregator such as :class:`~torchjd.aggregation.GradVac` in
+multi-task learning, the cosine similarities between task gradients can be computed at different
+granularities. The GradVac paper introduces four strategies, each partitioning the shared
+parameter vector differently:
+
+1. **Whole Model** (default) — one group covering all shared parameters.
+2. **Encoder-Decoder** — one group per top-level sub-network (e.g. encoder and decoder separately).
+3. **All Layers** — one group per leaf module of the encoder.
+4. **All Matrices** — one group per individual parameter tensor.
+
+In TorchJD, grouping is achieved by calling :func:`~torchjd.autojac.jac_to_grad` once per group
+after :func:`~torchjd.autojac.mtl_backward`, with a dedicated aggregator instance per group.
+For stateful aggregators such as :class:`~torchjd.aggregation.GradVac`, each instance
+independently maintains its own EMA state :math:`\hat{\phi}`, matching the per-block targets from
+the original paper.
+
+.. note::
+    The grouping is orthogonal to the choice of
+    :func:`~torchjd.autojac.backward` vs :func:`~torchjd.autojac.mtl_backward`. Those functions
+    determine *which* parameters receive Jacobians; grouping then determines *how* those Jacobians
+    are partitioned for aggregation. Calling :func:`~torchjd.autojac.jac_to_grad` once on all shared
+    parameters corresponds to the Whole Model strategy. Splitting those parameters into
+    sub-networks and calling :func:`~torchjd.autojac.jac_to_grad` separately on each — with a
+    dedicated aggregator per sub-network — gives an arbitrary custom grouping, such as the
+    Encoder-Decoder strategy described in the GradVac paper for encoder-decoder architectures.
+
+.. note::
+    The examples below use :class:`~torchjd.aggregation.GradVac`, but the same pattern applies to
+    any aggregator.
+
+1. Whole Model
+--------------
+
+A single :class:`~torchjd.aggregation.GradVac` instance aggregates all shared parameters
+together. Cosine similarities are computed between the full task gradient vectors.
+
+.. testcode::
+    :emphasize-lines: 14, 19
+
+    import torch
+    from torch.nn import Linear, MSELoss, ReLU, Sequential
+    from torch.optim import SGD
+
+    from torchjd.aggregation import GradVac
+    from torchjd.autojac import jac_to_grad, mtl_backward
+
+    encoder = Sequential(Linear(10, 5), ReLU(), Linear(5, 3), ReLU())
+    task1_head, task2_head = Linear(3, 1), Linear(3, 1)
+    optimizer = SGD([*encoder.parameters(), *task1_head.parameters(), *task2_head.parameters()], lr=0.1)
+    loss_fn = MSELoss()
+    inputs, t1, t2 = torch.randn(8, 16, 10), torch.randn(8, 16, 1), torch.randn(8, 16, 1)
+
+    gradvac = GradVac()
+
+    for x, y1, y2 in zip(inputs, t1, t2):
+        features = encoder(x)
+        mtl_backward([loss_fn(task1_head(features), y1), loss_fn(task2_head(features), y2)], features=features)
+        jac_to_grad(encoder.parameters(), gradvac)
+        optimizer.step()
+        optimizer.zero_grad()
+
+2. Encoder-Decoder
+------------------
+
+One :class:`~torchjd.aggregation.GradVac` instance per top-level sub-network. Here the model
+is split into an encoder and a decoder; cosine similarities are computed separately within each.
+Passing ``features=dec_out`` to :func:`~torchjd.autojac.mtl_backward` causes both sub-networks
+to receive Jacobians, which are then aggregated independently.
+
+.. testcode::
+    :emphasize-lines: 8-9, 15-16, 22-23
+
+    import torch
+    from torch.nn import Linear, MSELoss, ReLU, Sequential
+    from torch.optim import SGD
+
+    from torchjd.aggregation import GradVac
+    from torchjd.autojac import jac_to_grad, mtl_backward
+
+    encoder = Sequential(Linear(10, 5), ReLU())
+    decoder = Sequential(Linear(5, 3), ReLU())
+    task1_head, task2_head = Linear(3, 1), Linear(3, 1)
+    optimizer = SGD([*encoder.parameters(), *decoder.parameters(), *task1_head.parameters(), *task2_head.parameters()], lr=0.1)
+    loss_fn = MSELoss()
+    inputs, t1, t2 = torch.randn(8, 16, 10), torch.randn(8, 16, 1), torch.randn(8, 16, 1)
+
+    encoder_gradvac = GradVac()
+    decoder_gradvac = GradVac()
+
+    for x, y1, y2 in zip(inputs, t1, t2):
+        enc_out = encoder(x)
+        dec_out = decoder(enc_out)
+        mtl_backward([loss_fn(task1_head(dec_out), y1), loss_fn(task2_head(dec_out), y2)], features=dec_out)
+        jac_to_grad(encoder.parameters(), encoder_gradvac)
+        jac_to_grad(decoder.parameters(), decoder_gradvac)
+        optimizer.step()
+        optimizer.zero_grad()
+
+3. All Layers
+-------------
+
+One :class:`~torchjd.aggregation.GradVac` instance per leaf module. Cosine similarities are
+computed between the per-layer blocks of the task gradients.
+
+.. testcode::
+    :emphasize-lines: 14-15, 20-21
+
+    import torch
+    from torch.nn import Linear, MSELoss, ReLU, Sequential
+    from torch.optim import SGD
+
+    from torchjd.aggregation import GradVac
+    from torchjd.autojac import jac_to_grad, mtl_backward
+
+    encoder = Sequential(Linear(10, 5), ReLU(), Linear(5, 3), ReLU())
+    task1_head, task2_head = Linear(3, 1), Linear(3, 1)
+    optimizer = SGD([*encoder.parameters(), *task1_head.parameters(), *task2_head.parameters()], lr=0.1)
+    loss_fn = MSELoss()
+    inputs, t1, t2 = torch.randn(8, 16, 10), torch.randn(8, 16, 1), torch.randn(8, 16, 1)
+
+    leaf_layers = [m for m in encoder.modules() if not list(m.children()) and list(m.parameters())]
+    gradvacs = [GradVac() for _ in leaf_layers]
+
+    for x, y1, y2 in zip(inputs, t1, t2):
+        features = encoder(x)
+        mtl_backward([loss_fn(task1_head(features), y1), loss_fn(task2_head(features), y2)], features=features)
+        for layer, gradvac in zip(leaf_layers, gradvacs):
+            jac_to_grad(layer.parameters(), gradvac)
+        optimizer.step()
+        optimizer.zero_grad()
+
+4. All Matrices
+---------------
+
+One :class:`~torchjd.aggregation.GradVac` instance per individual parameter tensor. Cosine
+similarities are computed between the per-tensor blocks of the task gradients (e.g. weights and
+biases of each layer are treated as separate groups).
+
+.. testcode::
+    :emphasize-lines: 14-15, 20-21
+
+    import torch
+    from torch.nn import Linear, MSELoss, ReLU, Sequential
+    from torch.optim import SGD
+
+    from torchjd.aggregation import GradVac
+    from torchjd.autojac import jac_to_grad, mtl_backward
+
+    encoder = Sequential(Linear(10, 5), ReLU(), Linear(5, 3), ReLU())
+    task1_head, task2_head = Linear(3, 1), Linear(3, 1)
+    optimizer = SGD([*encoder.parameters(), *task1_head.parameters(), *task2_head.parameters()], lr=0.1)
+    loss_fn = MSELoss()
+    inputs, t1, t2 = torch.randn(8, 16, 10), torch.randn(8, 16, 1), torch.randn(8, 16, 1)
+
+    shared_params = list(encoder.parameters())
+    gradvacs = [GradVac() for _ in shared_params]
+
+    for x, y1, y2 in zip(inputs, t1, t2):
+        features = encoder(x)
+        mtl_backward([loss_fn(task1_head(features), y1), loss_fn(task2_head(features), y2)], features=features)
+        for param, gradvac in zip(shared_params, gradvacs):
+            jac_to_grad([param], gradvac)
+        optimizer.step()
+        optimizer.zero_grad()
diff --git a/docs/source/examples/index.rst b/docs/source/examples/index.rst
@@ -29,6 +29,9 @@ This section contains some usage examples for TorchJD.
 - :doc:`PyTorch Lightning Integration <lightning_integration>` showcases how to combine
   TorchJD with PyTorch Lightning, by providing an example implementation of a multi-task
   ``LightningModule`` optimized by Jacobian descent.
+- :doc:`Grouping <grouping>` shows how to apply an aggregator independently per parameter group
+  (e.g. per layer), so that conflict resolution happens at a finer granularity than the full
+  shared parameter vector.
 - :doc:`Automatic Mixed Precision <amp>` shows how to combine mixed precision training with TorchJD.
 
 .. toctree::
@@ -43,3 +46,4 @@ This section contains some usage examples for TorchJD.
     monitoring.rst
     lightning_integration.rst
     amp.rst
+    grouping.rst
diff --git a/src/torchjd/aggregation/_gradvac.py b/src/torchjd/aggregation/_gradvac.py
@@ -43,9 +43,12 @@ class GradVac(GramianWeightedAggregator):
         you need reproducibility.
 
     .. note::
-        To apply GradVac with per-layer or per-parameter-group granularity, first aggregate the
-        Jacobian into groups, apply GradVac per group, and sum the results. See the grouping usage
-        example for details.
+        To apply GradVac with per-layer or per-parameter-group granularity, create a separate
+        :class:`GradVac` instance for each group and call
+        :func:`~torchjd.autojac.jac_to_grad` once per group after
+        :func:`~torchjd.autojac.mtl_backward`. Each instance maintains its own EMA state,
+        matching the per-block targets :math:`\hat{\phi}_{ijk}` from the original paper. See
+        the :doc:`Grouping </examples/grouping>` example for details.
     """
 
     def __init__(self, beta: float = 0.5, eps: float = 1e-8) -> None: