You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Learning state of the art deep learning optimization algorithms.
A zij (Arabic: زِيج, pronounced "zeej") is an astronomical handbook from the
Islamic golden age: a set of tables and computational methods that astronomers
consulted instead of re-deriving the field from scratch. The best known is the
Zīj al-Sindhind of Muḥammad ibn Mūsā al-Khwārizmī (محمد بن موسى الخوارزمي,
c. 820 CE). His Latinized name, Algoritmi, became the word algorithm, and his
book al-Jabr gave us algebra.
This project takes the name in that spirit. It gathers the equation, the paper,
and runnable code for the optimization methods used in machine learning.
git clone https://github.com/junaidaliop/zij.git
cd zij
conda env create -f environment.yml
conda activate zij-optim
Quick start
importzij# torch.optim, vendored at tag v2.12.0opt=zij.AdamW(model.parameters(), lr=3e-4)
# research optimizers, same interfaceopt=zij.Muon([pforpinmodel.parameters() ifp.ndim==2], lr=2e-2)
opt=zij.Prodigy(model.parameters()) # no learning rate to setopt=zij.SAM(model.parameters(), base_optimizer=zij.SGD, lr=0.1, rho=0.05)
# memory-efficient low-rank training (per-group rank)opt=zij.GaLoreAdamW(
[{"params": params, "rank": 128, "update_proj_gap": 200, "scale": 0.25, "proj_type": "std"}],
lr=1e-2,
)
# look up by namezij.list_optimizers("adam*")
opt_cls=zij.load_optimizer("soap")
zij.optim mirrors torch.optim, so zij.optim.AdamW is the same class as
zij.AdamW, and zij.optim.lr_scheduler is available. Use whichever import
reads better in your code.
Note
A few families use a documented non-standard call protocol. Schedule-Free needs
opt.train() and opt.eval(); the SAM family takes a closure or an explicit
first_step / second_step pair; Adam-mini and LOMO are built from a model
rather than a parameter list. Each class docstring states which.
Library
The PyTorch package ships 106 ready-to-use optimizers. zij.core mirrors
torch.optim at tag v2.12.0 (Adam, AdamW, SGD, Muon, LBFGS, Adafactor, and the
rest, plus lr_scheduler and swa_utils). zij.contrib adds research methods
grouped by family: first-order, second-order, memory-efficient,
learning-rate-free, and sharpness-aware. In every Canon table below, the zij
column names the class where an implementation exists; a dash (—) means the
method is listed but not yet implemented (paper-only, or its source is under a
license that cannot be vendored).
zij is a PyTorch library today. The Canon is framework-agnostic: it covers each
method regardless of the framework of its original code. JAX and TensorFlow ports
are planned and will follow the same standards.
Canon
The Canon below covers 740 methods across 11 categories. Each row
records the canonical name, venue, paper, the best available implementation, and
the zij class where one exists.
First-Order Optimizers
First-order optimizers update parameters using only gradients and accumulated gradient statistics such as momentum and second-moment estimates. This page covers the stochastic gradient descent lineage, the Adam family, and more recent sign-based and variance-reduced methods. The zij column gives the class name for optimizers already implemented in the package.
Memory-efficient optimizers reduce the optimizer-state memory that dominates large-model training budgets, where Adam-style methods store two extra full-precision values per parameter. The methods below cover factored second moments, 8-bit and 4-bit state quantization, low-rank gradient projection, block-coordinate updates, zeroth-order gradient estimates, and stateless update rules.
HuggingFace transformers exposes many of these methods through the optim argument of TrainingArguments. Each string value below maps to a memory-efficient optimizer; all backing libraries except the built-in Adafactor must be installed separately.
bitsandbytes AdEMAMix with 8-bit quantized and paged state (MIT).
rmsprop_bnb_8bit
bitsandbytes RMSprop with block-wise 8-bit quantized state (MIT).
adamw_torch_4bit / adamw_torch_8bit
torchao pure-PyTorch AdamW with 4-bit or 8-bit optimizer states (BSD-3-Clause).
galore_adamw / galore_adamw_8bit / galore_adafactor and *_layerwise variants
galore-torch, the official GaLore release (Apache-2.0).
apollo_adamw / apollo_adamw_layerwise
apollo-torch, the official APOLLO release (CC-BY-NC-4.0).
lomo / adalomo
lomo-optim, the official LOMO and AdaLomo release (MIT).
Fractional-Order Optimizers
Fractional-order optimizers generalize the integer-order gradient step with fractional-calculus operators, most commonly the Caputo, Riemann-Liouville, or Grünwald-Letnikov derivative, which weight past gradient information through power-law memory kernels. The field is young: the first neural-network training results date to 2015, convergence theory is still being settled, and most papers ship no code.
Note: FAdam (arXiv 2405.12807) is a Fisher-information variant of Adam and is unrelated to fractional calculus despite the name.
Distributed and Communication-Efficient Optimizers
Optimizers in this category target training across many devices or nodes, where memory and inter-worker communication are the main bottlenecks. They shard optimizer state, compress gradient exchange, or synchronize infrequently so that training scales without a proportional increase in bandwidth. Some entries are standalone update rules, while others wrap an inner optimizer with a communication-efficient outer loop.
Second-order and orthogonalized optimizers exploit curvature information or the matrix structure of gradients rather than purely elementwise first-order statistics. This group spans quasi-Newton and Hessian-diagonal methods (L-BFGS, AdaHessian, Sophia), full-matrix and Kronecker-factored preconditioning (PSGD, Shampoo, SOAP), and orthogonalized-update methods in the Muon family. Venues reflect peer-reviewed acceptance where applicable; otherwise the arXiv year is listed.
Zeroth-order (gradient-free) methods train models using only function evaluations, estimating gradients from randomized perturbations of the parameters instead of backpropagation. Because they need no backward pass or activation storage, they run at roughly inference-level memory, which has made them a practical option for fine-tuning large language models on constrained hardware. The lineage runs from SPSA in classical stochastic approximation to recent variance-reduced and low-rank variants built on MeZO.
Privacy-preserving optimizers train models under differential privacy, typically by clipping per-sample gradients and adding calibrated noise to updates. This page lists differentially private optimization methods and reference libraries, from the original DP-SGD to later variants that reduce clipping bias, correct moment estimates, or filter privacy noise.
Sharpness-aware methods seek parameters that lie in neighborhoods with uniformly low loss rather than at isolated minima, which tends to improve generalization. Introduced by SAM (Foret et al., ICLR 2021), these methods wrap a base optimizer such as SGD or AdamW and add a gradient ascent perturbation step before the descent update. Later work makes the perturbation scale-invariant, closes the surrogate gap, reweights the sharpness term, amortizes the extra forward-backward cost, or extends the idea to second-order optimization.
This page collects optimizers from two adjacent settings. The first is the optimization of variational quantum circuits, where shot noise and the quantum geometry of the parameter space drive the design of measurement-frugal, gradient-free, and natural-gradient methods. The second is quantum-inspired and quantum-hardware optimization of classical neural networks, where quantum fluctuations, adiabatic evolution, or annealer sampling replace or augment the classical training loop.
Learning-rate-free (also called parameter-free or tuning-free) optimizers select their step size automatically during training instead of requiring a manually tuned learning rate. Most methods in this family estimate a quantity such as the distance from the initial point to the solution and set the effective step size from observed gradients, while others wrap an existing base optimizer and tune its global scale factor online. The goal is to match the performance of a well-tuned baseline without a learning-rate search.
zij.core.lr_scheduler vendors the PyTorch core learning rate schedulers under their original class names. The first table lists every vendored class, including the LRScheduler base class, with the published work it derives from where one exists. The second table covers notable schedules from the literature that zij does not yet implement.
Schedule-Free is not a schedule on top of an optimizer but a replacement for scheduling, achieved through online iterate averaging inside the optimizer; see the learning-rate-free optimizers.
Weight averaging is available separately in zij.core.swa_utils, which provides stochastic weight averaging and exponential moving average utilities (AveragedModel, SWALR, update_bn, and the SWA/EMA averaging functions), following Averaging Weights Leads to Wider Optima and Better Generalization (Izmailov et al., UAI 2018).
How zij compares
Two kinds of project cover this ground: curated awesome-lists, and installable
optimizer collections. zij (زِيج) is both.
Capability
Awesome-lists
Library collections
zij
Curated reference of the whole field
Yes
—
Yes
Installable, tested implementations
—
Yes
Yes
Paper-only methods included
Yes
—
Yes
Update rule in standard notation
—
—
Yes
Per-file provenance (upstream, commit, license)
—
Partial
Yes
Dedicated fractional-order coverage
—
—
Yes
Dedicated quantum / quantum-inspired coverage
—
—
Yes
Engineering standards
The Canon and the code are one project. Every Canon row links the paper and, where it exists, the implementation. Every implementation links back to its source and paper.
Provenance is explicit. Vendored files record their upstream repository, pinned commit, and license; THIRD_PARTY_NOTICES.md aggregates the attributions. Sources under GPL, non-commercial, or no license are not vendored and remain listed only.
Mathematics is explicit. Each update rule is written in standard notation. Where an official implementation diverges from its own paper, the docstring records what the code computes.
Everything is tested. Every registered optimizer has convergence and state-dict round-trip tests.
Contributing
New implementations, Canon entries, and corrections are welcome. See CONTRIBUTING.md. A Canon correction counts as much as a code change.
kozistr/pytorch_optimizer: a comprehensive, maintained PyTorch optimizer collection, and a reference for several vendored implementations.
jettify/pytorch-optimizer: an early community optimizer collection and the source of several classic implementations.
timm: tested optimizer implementations and packaging conventions.
PyTorch: the torch.optim core that zij.core mirrors.
The optimizer authors: each method is someone's research. The canonical paper is cited in every Canon row and class docstring, and the original repository is credited per file in THIRD_PARTY_NOTICES.md.
Citation
If you use an optimizer from this library, cite two works: the original paper of
the algorithm (linked in its Canon row and docstring), and zij as the software
you ran. The paper credits the method; the software credits the implementation.
@software{raja_zij,
author = {Raja, Muhammad Junaid Ali Asif},
title = {zij: A Canon and Library of Deep Learning Optimizers},
year = {2026},
url = {https://github.com/junaidaliop/zij}
}