In a self-improvement pipeline, model capacity may become a bottleneck as training progresses.
Instead of restarting training with a larger model, we can expand the existing model while preserving its learned function.
Two mechanisms:
- Net2Net expansion (general neural networks)
- Residual expansion for transformers
model_t
↓
train on dataset_t
↓
detect capacity saturation
↓
expand model capacity
↓
continue training
Reference:
Net2Net: Accelerating Learning via Knowledge Transfer
Net2Net introduces function-preserving transformations that allow increasing model capacity without changing the model's output.
Goal: increase hidden dimension.
Example:
hidden_dim: 512 → 1024
Procedure:
- Select neurons to replicate.
- Copy their weights.
- Adjust downstream weights to keep outputs unchanged.
Conceptually:
original neuron
↓
duplicate neuron
↓
split outgoing weights
Result:
f_new(x) = f_old(x)
Training can continue immediately.
Goal: add new layers.
Example:
Layer1
Layer2
Layer3
becomes
Layer1
Layer2
IdentityLayer
Layer3
Initialization:
IdentityLayer(x) = x
So
f_new(x) = f_old(x)
After expansion, training continues normally.
Standard transformer block:
x_{l+1} = x_l + F(x_l)
where F contains:
- attention
- MLP
- normalization
Naively duplicating layers changes the function:
x_{l+2} = x_l + F(x_l) + F(x_l + F(x_l))
which is not equal to the original mapping.
Split the residual update across duplicated layers.
Original block:
x = x + F(x)
Expanded blocks:
x = x + (1/k) * F(x)
x = x + (1/k) * F(x)
...
k times
Example for doubling depth:
x = x + 0.5 * F1(x)
x = x + 0.5 * F2(x)
Initialization:
F1 = copy of original block
F2 = copy of original block
This approximates the same function while increasing depth.
Transformer ≈ discretized ODE.
Increasing depth corresponds to smaller integration step size.