Skip to content

junaidaliop/zij

Repository files navigation

zij  زِيج

Learning state of the art deep learning optimization algorithms.

License Python PyTorch

A zij (Arabic: زِيج, pronounced "zeej") is an astronomical handbook from the Islamic golden age: a set of tables and computational methods that astronomers consulted instead of re-deriving the field from scratch. The best known is the Zīj al-Sindhind of Muḥammad ibn Mūsā al-Khwārizmī (محمد بن موسى الخوارزمي, c. 820 CE). His Latinized name, Algoritmi, became the word algorithm, and his book al-Jabr gave us algebra.

This project takes the name in that spirit. It gathers the equation, the paper, and runnable code for the optimization methods used in machine learning.

Contents

Installation

pip install zij

From source, with the pinned environment:

git clone https://github.com/junaidaliop/zij.git
cd zij
conda env create -f environment.yml
conda activate zij-optim

Quick start

import zij

# torch.optim, vendored at tag v2.12.0
opt = zij.AdamW(model.parameters(), lr=3e-4)

# research optimizers, same interface
opt = zij.Muon([p for p in model.parameters() if p.ndim == 2], lr=2e-2)
opt = zij.Prodigy(model.parameters())                       # no learning rate to set
opt = zij.SAM(model.parameters(), base_optimizer=zij.SGD, lr=0.1, rho=0.05)

# memory-efficient low-rank training (per-group rank)
opt = zij.GaLoreAdamW(
    [{"params": params, "rank": 128, "update_proj_gap": 200, "scale": 0.25, "proj_type": "std"}],
    lr=1e-2,
)

# look up by name
zij.list_optimizers("adam*")
opt_cls = zij.load_optimizer("soap")

zij.optim mirrors torch.optim, so zij.optim.AdamW is the same class as zij.AdamW, and zij.optim.lr_scheduler is available. Use whichever import reads better in your code.

Note

A few families use a documented non-standard call protocol. Schedule-Free needs opt.train() and opt.eval(); the SAM family takes a closure or an explicit first_step / second_step pair; Adam-mini and LOMO are built from a model rather than a parameter list. Each class docstring states which.

Library

The PyTorch package ships 106 ready-to-use optimizers. zij.core mirrors torch.optim at tag v2.12.0 (Adam, AdamW, SGD, Muon, LBFGS, Adafactor, and the rest, plus lr_scheduler and swa_utils). zij.contrib adds research methods grouped by family: first-order, second-order, memory-efficient, learning-rate-free, and sharpness-aware. In every Canon table below, the zij column names the class where an implementation exists; a dash (—) means the method is listed but not yet implemented (paper-only, or its source is under a license that cannot be vendored).

zij is a PyTorch library today. The Canon is framework-agnostic: it covers each method regardless of the framework of its original code. JAX and TensorFlow ports are planned and will follow the same standards.

Canon

The Canon below covers 740 methods across 11 categories. Each row records the canonical name, venue, paper, the best available implementation, and the zij class where one exists.

First-Order Optimizers

First-order optimizers update parameters using only gradients and accumulated gradient statistics such as momentum and second-moment estimates. This page covers the stochastic gradient descent lineage, the Adam family, and more recent sign-based and variance-reduced methods. The zij column gives the class name for optimizers already implemented in the package.

Optimizer Venue Paper Code zij
ASGD SIAM Journal on Control and Optimization 1992 Acceleration of Stochastic Approximation by Averaging community ASGD
Rprop ICNN 1993 A direct adaptive method for faster backpropagation learning: the RPROP algorithm community Rprop
Adagrad JMLR 2011 Adaptive Subgradient Methods for Online Learning and Stochastic Optimization community Adagrad
Adadelta arXiv 2012 ADADELTA: An Adaptive Learning Rate Method community Adadelta
RMSprop Lecture notes 2012 Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude community RMSprop
FTRL KDD 2013 Ad Click Prediction: a View from the Trenches
SGD ICML 2013 On the importance of initialization and momentum in deep learning community SGD
Adam ICLR 2015 Adam: A Method for Stochastic Optimization community Adam
AdaMax ICLR 2015 Adam: A Method for Stochastic Optimization community Adamax
Nadam ICLR Workshop 2016 Incorporating Nesterov Momentum into Adam community NAdam
LARS arXiv 2017 Large Batch Training of Convolutional Networks community LARS
SWATS arXiv 2017 Improving Generalization Performance by Switching from Adam to SGD community SWATS
A2Grad arXiv 2018 Optimal Adaptive and Accelerated Stochastic Gradient Descent community A2GradUni, A2GradInc, A2GradExp
AccSGD ICLR 2018 On the insufficiency of existing momentum schemes for Stochastic Optimization official AccSGD
AMSGrad ICLR 2018 On the Convergence of Adam and Beyond community
GADAM arXiv 2018 GADAM: Genetic-Evolutionary ADAM for Deep Neural Network Optimization
M-SVAG ICML 2018 Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients official
PID CVPR 2018 A PID Controller Approach for Stochastic Optimization of Deep Networks official PID
VR-SGD IEEE TKDE 2018 VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning
Yogi NeurIPS 2018 Adaptive Methods for Nonconvex Optimization community Yogi
AdaBound ICLR 2019 Adaptive Gradient Methods with Dynamic Bound of Learning Rate official AdaBound, AdaBoundW
AdaMod arXiv 2019 An Adaptive and Momental Bound Method for Stochastic Learning official AdaMod
AdamW ICLR 2019 Decoupled Weight Decay Regularization official AdamW
AdaShift ICLR 2019 AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods community AdaShift
AggMo ICLR 2019 Aggregated Momentum: Stability Through Passive Damping official AggMo
AvaGrad arXiv 2019 Domain-independent Dominance of Adaptive Methods official AvaGrad
HAdam NeurIPS Workshop 2019 On Higher-order Moments in Adam
HyperAdam AAAI 2019 HyperAdam: A Learnable Task-Adaptive Adam for Network Training
Lookahead NeurIPS 2019 Lookahead Optimizer: k steps forward, 1 step back community Lookahead
NosAdam IJCAI 2019 Nostalgic Adam: Weighting more of the past gradients when designing the adaptive learning rate
NovoGrad arXiv 2019 Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks community NovoGrad
QHAdam / QHM ICLR 2019 Quasi-hyperbolic momentum and Adam for deep learning official QHAdam, QHM
Ranger RAdam and Lookahead combination official Ranger
Sadam arXiv 2019 Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM
AdaBelief NeurIPS 2020 AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients official AdaBelief
Adam+ arXiv 2020 Adam+: A Stochastic Method with Adaptive Variance Reduction
AdamBS NeurIPS 2020 Adam with Bandit Sampling for Deep Learning
AdaSGD arXiv 2020 AdaSGD: Bridging the gap between SGD and Adam
Cayley SGD ICLR 2020 Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform official
clipped-SGD NeurIPS 2020 Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping official
DEAM ASONAM 2020 DEAM: Adaptive Momentum with Discriminative Weight for Stochastic Optimization
diffGrad IEEE TNNLS 2020 diffGrad: An Optimization Method for Convolutional Neural Networks official DiffGrad
EAdam arXiv 2020 EAdam Optimizer: How ε Impact Adam official
Fromage NeurIPS 2020 On the distance between two neural networks and the stability of learning official
Gradient Centralization (GC) ECCV 2020 Gradient Centralization: A New Optimization Technique for Deep Neural Networks official
LAMB ICLR 2020 Large Batch Optimization for Deep Learning: Training BERT in 76 minutes community Lamb
LaProp arXiv 2020 LaProp: Separating Momentum and Adaptivity in Adam official LaProp
NIGT ICML 2020 Momentum Improves Normalized SGD official
Padam IJCAI 2020 Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks official PAdam
signSGD ICML 2018 signSGD: Compressed Optimisation for Non-Convex Problems community SignSGD
pbSGD IJCAI 2020 pbSGD: Powered Stochastic Gradient Descent Methods for Accelerated Non-Convex Optimization official
PCGrad NeurIPS 2020 Gradient Surgery for Multi-Task Learning official
RAdam ICLR 2020 On the Variance of the Adaptive Learning Rate and Beyond official RAdam
SGD-G2 ICPR 2020 Stochastic Runge-Kutta methods and adaptive SGD-G2 stochastic gradient descent
ACMo AAAI 2021 ACMo: Angle-Calibrated Moment Methods for Stochastic Optimization
ACProp NeurIPS 2021 Momentum Centering and Asynchronous Update for Adaptive Gradient Methods official
AdaL arXiv 2021 AdaL: Adaptive Gradient Transformation Contributes to Convergences and Generalizations
AdamD arXiv 2021 AdamD: Improved bias-correction in Adam
AdamP ICLR 2021 AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights official AdamP
Adaptive Gradient Clipping (AGC) ICML 2021 High-Performance Large-Scale Image Recognition Without Normalization official
AngularGrad arXiv 2021 AngularGrad: A New Optimization Technique for Angular Convergence of Convolutional Neural Networks official
BGADAM IJCNN 2021 BGADAM: Boosting based Genetic-Evolutionary ADAM for Neural Network Optimization
Gravity arXiv 2021 Gravity Optimizer: a Kinematic Approach on Optimization in Deep Learning official Gravity
MADGRAD arXiv 2021 Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization official MADGRAD, MirrorMADGRAD
MaxVA ECML PKDD 2021 MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients official
Nero ICML 2021 Learning by Turning: Neural Architecture Aware Optimisation official
PNM ICML 2021 Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization official
AdaPNM ICML 2021 Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization official AdaPNM
Ranger21 arXiv 2021 Ranger21: a synergistic deep learning optimizer official Ranger21
SGDP ICLR 2021 AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights official SGDP
AdaFamily arXiv 2022 AdaFamily: A family of Adam-like adaptive gradient methods
Adai ICML 2022 Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momentum official Adai
AdamMC CVMI 2022 Moment Centralization based Gradient Descent Optimizers for Convolutional Neural Networks
Adan arXiv 2022 Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models official Adan
AdaSmooth arXiv 2022 AdaSmooth: An Adaptive Learning Rate Method based on Effective Ratio AdaSmooth
AEGDM Annals of Applied Mathematics 2022 An Adaptive Gradient Method with Energy and Momentum official
Amos arXiv 2022 Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale official Amos
GDA-AM ICLR 2022 GDA-AM: On the effectiveness of solving minimax optimization via Anderson Acceleration official
KOALA AAAI 2022 KOALA: A Kalman Optimization Algorithm with Loss Adaptivity official
RotoGrad ICLR 2022 RotoGrad: Gradient Homogenization in Multitask Learning official
SRSGD SIAM Journal on Imaging Sciences 2022 Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent
Step-Tuned SGD Neural Processing Letters 2022 Second-order step-size tuning of SGD for non-convex optimization
AdaInject IEEE TAI 2023 AdaInject: Injection Based Adaptive Gradient Descent Optimizers for Convolutional Neural Networks official
AdaNorm WACV 2023 AdaNorm: Adaptive Gradient Norm Correction based Optimizer for CNNs official AdaNorm
AGD NeurIPS 2023 AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix
Aida TMLR 2023 A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range official
Lion NeurIPS 2023 Symbolic Discovery of Optimization Algorithms official Lion
Lookaround NeurIPS 2023 Lookaround Optimizer: k steps around, 1 step average
MultiAdam ICML 2023 MultiAdam: Parameter-wise Scale-invariant Optimizer for Multiscale Training of Physics-informed Neural Networks
RLEKF AAAI 2023 RLEKF: An Optimizer for Deep Potential with Ab Initio Accuracy
Scheduled Weight Decay (SWD) NeurIPS 2023 On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective official
SGDF arXiv 2023 Signal Processing Meets SGD: From Momentum to Filter
StableAdamW NeurIPS 2023 Stable and low-precision training for large-scale vision-language models community StableAdamW
AdaAct ICDMW 2024 An Adaptive Method Stabilizing Activations for Enhanced Generalization
Adam-atan2 ICML 2024 Scaling Exponents Across Parameterizations and Optimizers community AdamAtan2
Adam-Rel NeurIPS 2024 Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timesteps
AdEMAMix arXiv 2024 The AdEMAMix Optimizer: Better, Faster, Older official AdEMAMix
ADOPT NeurIPS 2024 ADOPT: Modified Adam Can Converge with Any β₂ with the Optimal Rate official ADOPT
AGS-GD arXiv 2024 Anisotropic Gaussian Smoothing for Gradient-based Optimization
BADM arXiv 2024 BADM: Batch ADMM for Deep Learning
CaAdam arXiv 2024 CaAdam: Improving Adam optimizer using connection aware methods official
CAdam arXiv 2024 CAdam: Confidence-Based Optimization for Online Learning
Cautious Optimizers arXiv 2024 Cautious Optimizers: Improving Training with One Line of Code official
EXAdam arXiv 2024 EXAdam: The Power of Adaptive Cross-Moments official EXAdam
FAdam arXiv 2024 FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information community FAdam
GrokAdamW AdamW variant with Grokfast-style gradient amplification official GrokAdamW
Grokfast arXiv 2024 Grokfast: Accelerated Grokking by Amplifying Slow Gradients official
INNAprop arXiv 2024 A second-order-like optimizer with adaptive gradient scaling for deep learning official
KATE NeurIPS 2024 Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad official
MADA ICML 2024 MADA: Meta-Adaptive Optimizers through hyper-gradient Descent
RSGDM CCSB 2024 Reducing Bias in Deep Learning Optimization: The RSGDM Approach
SET-Adam ECML PKDD 2024 On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance
SNGM Science China Information Sciences 2024 Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training
SRMM JMLR 2024 Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates official
TAM arXiv 2024 Torque-Aware Momentum
WarpAdam arXiv 2024 WarpAdam: A new Adam optimizer based on Meta-Learning approach
AbsSADMM arXiv 2025 Stochastic ADMM with batch size adaptation for nonconvex nonsmooth optimization
AdamC arXiv 2025 Why Gradients Rapidly Increase Near the End of Training
AdamNX arXiv 2025 AdamNX: An Adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimate official
AdamS EMNLP 2025 AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training
adaNAPG arXiv 2025 Boosting Accelerated Proximal Gradient Method with Adaptive Sampling for Stochastic Composite Optimization
Ano arXiv 2025 ANO : Faster is Better in Noisy Landscape official
BCOS arXiv 2025 Stochastic Approximation with Block Coordinate Optimal Stepsizes official
Cautious Weight Decay arXiv 2025 Cautious Weight Decay community
Conda arXiv 2025 Conda: Column-Normalized Adam for Training Large Language Models Faster official
Coupled Adam ACL 2025 Better Embeddings with Coupled Adam
DecGD Machine Learning 2025 A New Adaptive Gradient Method with Gradient Decomposition
DEO arXiv 2025 Dimer-Enhanced Optimization: A First-Order Approach to Escaping Saddle Points in Neural Network Training official
EmoNavi An emotion-driven optimizer that feels loss and navigates accordingly official
MARS ICML 2025 MARS: Unleashing the Power of Variance Reduction for Training Large Models official MARS
FOCUS arXiv 2025 FOCUS: First Order Concentrated Updating Scheme official FOCUS
FSGDM ICLR 2025 On the Performance Analysis of Momentum Method: A Frequency Domain Perspective
Grams ICLR Workshop 2025 Grams: Gradient Descent with Adaptive Momentum Scaling official Grams
HGM arXiv 2025 Hindsight-Guided Momentum (HGM) Optimizer: An Approach to Adaptive Learning Rate
HVAdam AAAI 2025 HVAdam: A Full-Dimension Adaptive Optimizer
KO arXiv 2025 KO: Kinetics-inspired Neural Optimizer with PDE Simulation Approaches
KOALA++ NeurIPS 2025 KOALA++: Efficient Kalman-Based Optimization with Gradient-Covariance Products
Kourkoutas-Beta arXiv 2025 Kourkoutas-Beta: A Sunspike-Driven Adam Optimizer with Desert Flair official KourkoutasSoftmaxFlex
MIAdam AAAI 2025 A Method for Enhancing Generalization of Adam by Multiple Integrations official
μ²-SGD ICLR 2025 Do Stochastic, Feel Noiseless: Stable Stochastic Optimization via a Double Momentum Mechanism
⊥Grad (OrthoGrad) ICLR 2025 Grokking at the Edge of Numerical Stability official
Overshoot arXiv 2025 Overshoot: Taking advantage of future gradients in momentum-based stochastic optimization official
PadamP arXiv 2025 Adaptive Moment Estimation Optimization Algorithm Using Projection Gradient for Deep Learning
Simplified-AdEMAMix arXiv 2025 Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants official
LyAm arXiv 2025 LyAm: Robust Non-Convex Optimization for Stable Learning in Noisy Environments
NIRMAL arXiv 2025 Comparative Analysis of Novel NIRMAL Optimizer Against Adam and SGD with Momentum
SCSAdamW arXiv 2025 Beyond First-Order: Training LLMs with Stochastic Conjugate Subgradients and AdamW official
SKA-SGD arXiv 2025 Streaming Krylov-Accelerated Stochastic Gradient Descent
SoftSignSGD (S3) arXiv 2025 SoftSignSGD(S3): An Enhanced Optimizer for Practical DNN Training and Loss Spikes Minimization Beyond Adam
SPAM arXiv 2025 SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training official
VSGD TMLR 2025 Variational Stochastic Gradient Descent for Deep Neural Networks official
ZetA arXiv 2025 ZetA: A Riemann Zeta-Scaled Extension of Adam for Deep Learning
AdaGC ICML 2026 AdaGC: Enhancing LLM Pretraining Stability via Adaptive Gradient Clipping AdaGC
Anon arXiv 2026 Anon: Extrapolating Adaptivity Beyond SGD and Adam
C-Adam arXiv 2026 A Theoretical and Experimental Study of a Novel Adaptive Learning Algorithm
DualAdam arXiv 2026 Combining Adam and its Inverse Counterpart to Enhance Generalization of Deep Learning Optimizers official
FANoS arXiv 2026 FANoS-v2: Feedback-Controlled Momentum with Thermostat Damping for Lightweight Neural Optimization official
GradPower ICML 2026 GradPower: Powering Gradients for Faster Language Model Pre-Training
HomeAdam arXiv 2026 HomeAdam: Adam and AdamW Algorithms Sometimes Go Home to Obtain Better Provable Generalization
NOVAK arXiv 2026 NOVAK: Unified adaptive optimizer for deep neural networks
PS-Clip-SGD arXiv 2026 Robust and Fast Training via Per-Sample Clipping
SparseOpt ICML 2026 SparseOpt: Addressing Normalization-induced Gradient Skew in Sparse Training
Stable-SPAM / GradientStabilizer ICML 2026 GradientStabilizer: Fix the Norm, Not the Gradient official
VRAdam ICLR 2026 A Physics-Inspired Optimizer: Velocity Regularized Adam official
SparseAdam Adam variant for sparse gradients official SparseAdam
OptMuon arXiv 2026 OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality
FOGO arXiv 2026 FOGO: Forgetting-aware Orthogonalization Optimizer
AdamO ICML 2026 Preserving Plasticity in Continual Learning via Dynamical Isometry
MAdam arXiv 2026 MAdam: Metric-Aware Multi-Objective Adam
MuCon arXiv 2026 MuCon: Clipped Muon Updates for LLM Training
NuMuon arXiv 2026 NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training
MiMuon arXiv 2026 MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
Pion arXiv preprint (cs.LG, stat.ML) 2026 Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation official
iMuon (Intrinsic Muon) arXiv 2026 Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds official
Muon-OGD arXiv 2026 Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning
Newton-Muon arXiv 2026 The Newton-Muon Optimizer official
MuonEq arXiv 2026 MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration official
RMNP arXiv 2026 RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization official
MUD arXiv preprint 2026 Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training
NAMO arXiv 2026 Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum official
SpecMuon arXiv 2026 Muon with Spectral Guidance: Efficient Optimization for Scientific Machine Learning
ARO arXiv 2026 ARO: A New Lens On Matrix Optimization For Large Models
PRISM arXiv 2026 PRISM: Structured Optimization via Anisotropic Spectral Shaping
MCSD / SPEL arXiv 2026 Manifold constrained steepest descent
Variance-Adaptive Muon (Muon-NSR / Muon-VS) arXiv 2026 Variance-Adaptive Muon: Accelerating LLM Pretraining with NSR-Modulated and Variance-Scaled Momentum
MuonAll arXiv 2025 MuonAll: Muon Variant for Efficient Finetuning of Large Language Models official
Gluon arXiv 2025 (also accepted at ICML 2025 HiLD workshop) Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs)
LPSGD / LPSGDM arXiv 2026 Beyond L2-norm and L-infinity-norm: A Curvature-Inspired ell_p-Norm Scheme for Deep Neural Networks
ABSignSGD ICLR 2026 Arbitrary-Order Block SignSGD for Memory-Efficient LLM Fine-Tuning
StoSignSGD arXiv 2026 StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models
Hybrid SignSGD-SGD switching arXiv 2026 Enhancing SignSGD: Small-Batch Convergence Analysis and a Hybrid Switching Strategy
SoftSignum / SoftMuon ICML 2026 Softsign: Smooth Sign in Your Optimizer For Better Parameter Heterogeneity Handling official
Accelerated SignGD arXiv 2025 Norm-Constrained Flows and Sign-Based Optimization: Theory and Algorithms
CLion arXiv 2026 CLion: Efficient Cautious Lion Optimizer with Enhanced Generalization
OLion arXiv 2026 OLion: Approaching the Hadamard Ideal by Intersecting Spectral and ell_{infty} Implicit Biases official
MGUP NeurIPS 2025 MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization official
Magma arXiv 2026 On Surprising Effectiveness of Masking Updates in Adaptive Optimizers
AGGC ACL 2026 AGGC: Adaptive Group Gradient Clipping for Stabilizing Large Language Model Training official
Clipped Scion NeurIPS 2025 Generalized Gradient Norm Clipping & Non-Euclidean (L_0,L_1)-Smoothness official
SPECTRA ICML 2026 Enhancing LLM Training via Spectral Clipping official
Spectral Clipping (matrix-valued) arXiv 2026 Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
SPAMP ACM Multimedia Asia 2025 (7th ACM International Conference on Multimedia in Asia) Gradient Shaping Beyond Clipping: A Functional Perspective on Update Magnitude Control
NucGD arXiv 2026 Towards The Implicit Bias on Multiclass Separable Data Under Norm Constraints official
Batched / Transported Scion arXiv 2026 Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise
EMA bias-corrected iterate averaging NeurIPS 2025 Workshop (OPT 2025) EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes
RGrad-Avg OPT 2025 (17th Annual Workshop on Optimization for Machine Learning, co-located with NeurIPS 2025) On Riemannian Gradient Descent Algorithm using gradient averaging
SGD with adaptive preconditioning ICLR 2026 SGD with Adaptive Preconditioning: Unified Analysis and Momentum Acceleration
HTMuon arXiv 2026 HTMuon: Improving Muon via Heavy-Tailed Spectral Correction official
MARS-M arXiv 2025 MARS-M: When Variance Reduction Meets Matrices official
Drop-Muon arXiv 2025 Drop-Muon: Update Less, Converge Faster
Muon+ arXiv 2026 MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training official
TrasMuon ICLR 2026 Workshop Sci4DL TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers
Adam-SHANG arXiv 2026 Adam-SHANG: A Convergent Adam-Type Method for Stochastic Smooth Convex Optimization
EMA-Nesterov arXiv 2026 EMA-Nesterov: Stabilizing Nesterov's Lookahead for Accelerated Deep Learning Optimization
S-Adam arXiv 2026 Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization
IAdaPID-ADG arXiv 2026 An Improved Adaptive PID Optimizer with Enhanced Convergence and Stability for Deep Learning
CT-AGD arXiv 2026 Accelerated Gradient Descent for Faster Convergence with Minimal Overhead
GPA (Generalized Primal Averaging) arXiv 2025 Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs official
SNOO arXiv 2025 SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients official
Riemannion ICLR 2026 LoRA meets Riemannion: Muon Optimizer for Parametrization-independent Low-Rank Adapters
Optimal Projection-Free Adaptive SGD arXiv 2026 Optimal Projection-Free Adaptive SGD for Matrix Optimization
AdamCB ICLR 2025 ADAM Optimization with Adaptive Batch Selection
Kalman-Adam Knowledge-Based Systems 2026 Kalman-Adam: Optimal bayesian moment estimation for memory-Efficient and generalizable deep learning
AdamHD (AdamHuberDecay) NeurIPS 2025 Workshop (ScaleOpt: GPU-Accelerated and Scalable Optimization) AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training
MVN-Grad arXiv 2026 Adaptive Optimization via Momentum on Variance-Normalized Gradients
Compositional Muon (CM) Tilde Research blog 2026 Towards Compositional Steepest Descent official

Memory-Efficient Optimizers

Memory-efficient optimizers reduce the optimizer-state memory that dominates large-model training budgets, where Adam-style methods store two extra full-precision values per parameter. The methods below cover factored second moments, 8-bit and 4-bit state quantization, low-rank gradient projection, block-coordinate updates, zeroth-order gradient estimates, and stateless update rules.

Optimizer Venue Paper Code zij
Adafactor ICML 2018 Adafactor: Adaptive Learning Rates with Sublinear Memory Cost official Adafactor
SM3 NeurIPS 2019 Memory-Efficient Adaptive Optimization official SM3
8-bit Optimizers ICLR 2022 8-bit Optimizers via Block-wise Quantization official
tpSGD arXiv 2022 Learning with Local Gradients at the Edge
4-bit Optimizers NeurIPS 2023 Memory Efficient Optimizers with 4-bit States official
Adalite GitHub 2023 Adalite: a custom optimizer based on Adafactor and LAMB official
AdaLomo ACL 2024 Findings AdaLomo: Low-memory Optimization with Adaptive Learning Rate official AdaLomo
CAME ACL 2023 CAME: Confidence-guided Adaptive Memory Efficient Optimization official CAME
Lion NeurIPS 2023 Symbolic Discovery of Optimization Algorithms official
LOMO ACL 2024 Full Parameter Fine-tuning for Large Language Models with Limited Resources official Lomo
MeZO NeurIPS 2023 Fine-Tuning Language Models with Just Forward Passes official
Tiger GitHub 2023 Tiger: A Tight-fisted Optimizer official Tiger
4-bit Shampoo NeurIPS 2024 4-bit Shampoo for Memory-Efficient Network Training official
Adam-mini ICLR 2025 Adam-mini: Use Fewer Learning Rates To Gain More official AdamMini
Adapprox arXiv 2024 Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices
AdaRankGrad ICLR 2025 AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning
Addax ICLR 2025 Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models official
APOLLO MLSys 2025 APOLLO: SGD-like Memory, AdamW-level Performance official APOLLO
BAdam NeurIPS 2024 BAdam: A Memory Efficient Full Parameter Optimization Method for Large Language Models official BlockOptimizer
COAP CVPR 2025 COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection official
Fira NeurIPS 2025 Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint? official FiraAdamW
Flora ICML 2024 Flora: Low-Rank Adapters Are Secretly Gradient Compressors official
FRUGAL ICML 2025 FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training official
GaLore ICML 2024 GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection official GaLoreAdamW
GoLore ICML 2025 Subspace Optimization for Large Language Models with Convergence Guarantees official
GRASS EMNLP 2024 Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients official
LDAdam ICLR 2025 LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics official LDAdamW
LoQT NeurIPS 2024 LoQT: Low-Rank Adapters for Quantized Pretraining official
LoRA-RITE ICLR 2025 LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization official
MicroAdam NeurIPS 2024 MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence official
Muon Blog 2024 Muon: An optimizer for hidden layers in neural networks official Muon
Online Subspace Descent NeurIPS 2024 Memory-Efficient LLM Training with Online Subspace Descent official
Q-GaLore CPAL 2025 Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients official
SGD-SaI arXiv 2024 No More Adam: Learning Rate Scaling at Initialization is All You Need official SGDSaI
SMMF AAAI 2025 SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization official
SNSM ICML 2025 Lean and Mean Adaptive Optimization via Subset-Norm and Subspace-Momentum with Convergence Guarantees official
SWAN ICML 2025 SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training
AlphaGrad arXiv 2025 AlphaGrad: Non-Linear Gradient Normalization Optimizer
GWT arXiv 2025 GWT: Scalable Optimizer State Compression for Large Language Model Training
MLorc AISTATS 2026 MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation official
MoFaSGD TMLR 2025 Low-rank Momentum Factorization for Memory Efficient Training official
RACS / Alice arXiv 2025 Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension community
SinkGD arXiv 2025 Gradient Multi-Normalization for Stateless and Scalable LLM Training
SPAM ICLR 2025 SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training official SPAM
SubTrack++ NeurIPS 2025 SubTrack++ : Gradient Subspace Tracking for Scalable LLM Training official
SUMO NeurIPS 2025 SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training
TensorGRaD arXiv 2025 TensorGRaD: Tensor Gradient Robust Decomposition for Memory-Efficient Neural Operator Training
FlashOptim arXiv 2026 FlashOptim: Optimizers for Memory-Efficient Training official
Rose GitHub 2026 Rose: Range-Of-Slice Equilibration optimizer official
SAGE ACL 2026 Findings SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization
BlockLLM arXiv 2024 BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks official
Natural GaLore arXiv 2024 Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning official
SLTrain NeurIPS 2024 SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining official
8-bit Muon arXiv 2025 Effective Quantization of Muon Optimizer States
FFT-based Subspace Selection ICLR 2026 FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models official
FOAM arXiv 2025 FOAM: Blocked State Folding for Memory-Efficient LLM Training official
GaLore 2 arXiv 2025 GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection
GradientStabilizer ICML 2026 GradientStabilizer: Fix the Norm, Not the Gradient official
GUM arXiv 2025 Unbiased Gradient Low-Rank Projection
I3S NeurIPS 2025 Breaking the Frozen Subspace: Importance Sampling for Low-Rank Optimization in LLM Pretraining
LORENZA TMLR 2026 LORENZA: Enhancing Generalization in Low-Rank Gradient LLM Training via Efficient Zeroth-Order Adaptive SAM
ProjFactor (VLoRP) arXiv 2025 Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients
RSO arXiv 2025 A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models
SCALE ICML 2026 Memory-Efficient LLM Pretraining via Minimalist Optimizer Design
SlimAdam arXiv 2025 When Can You Get Away with Low Memory Adam? official
LoRA-Pre ICLR 2026 Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation official
Lotus arXiv 2026 Lotus: Efficient LLM Training by Randomized Low-Rank Gradient Projection with Adaptive Subspace Switching
POET-X ICML 2026 POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation official
MuonQ arXiv 2026 MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization official
4-bit-Muon-GRASP ICLR 2026 Achieving low-bit Muon through subspace preservation and grid quantization official
IO-Adam OpenReview 2026 IO-Adam: Rethinking Memory-Efficient Adaptive Optimizers from Gradient Computation
H-Fac AISTATS 2025 Memory-Efficient Optimization with Factorized Hamiltonian Descent
LiMuon ICML 2026 LiMuon: Light and Fast Muon Optimizer for Large Models
M+Adam OPT 2025: 17th Annual Workshop on Optimization for Machine Learning (NeurIPS 2025 Workshop) M+Adam: Stable Low-Precision Training with Combined Adam–Madam Updates
SMET ICML 2026 Memory-Efficient LLM Training with Dynamic Sparsity: From Stability to Practical Scaling official
PowerStep arXiv 2026 PowerStep: Memory-Efficient Adaptive Optimization via ell_p-Norm Steepest Descent official
SRON OpenReview 2025 SRON: State-free LLM Training via Row-wise Gradient Normalization
GradLite arXiv 2025 Backward-Friendly Optimization: Training Large Language Models with Approximate Gradients under Memory Constraints
Optimal Low-Rank SGE arXiv preprint 2026 Optimal low-rank stochastic gradient estimation for LLM training
Spectral Compact Training (SCT) arXiv 2026 Spectral Compact Training: Pre-Training Large Language Models via Permanent Truncated SVD and Stiefel QR Retraction official

Trainer integrations

HuggingFace transformers exposes many of these methods through the optim argument of TrainingArguments. Each string value below maps to a memory-efficient optimizer; all backing libraries except the built-in Adafactor must be installed separately.

optim value Backing library
adafactor transformers ships its own Adafactor implementation with relative-step and update-clipping options (Apache-2.0).
adamw_bnb_8bit / adamw_8bit bitsandbytes AdamW with block-wise 8-bit quantized state (MIT).
paged_adamw_8bit / paged_adamw_32bit bitsandbytes paged AdamW; optimizer state is paged between GPU and CPU memory (MIT).
lion_8bit / lion_32bit / paged_lion_8bit / paged_lion_32bit bitsandbytes Lion, single momentum buffer, with 8-bit and paged variants (MIT).
ademamix_8bit / paged_ademamix_8bit / paged_ademamix_32bit bitsandbytes AdEMAMix with 8-bit quantized and paged state (MIT).
rmsprop_bnb_8bit bitsandbytes RMSprop with block-wise 8-bit quantized state (MIT).
adamw_torch_4bit / adamw_torch_8bit torchao pure-PyTorch AdamW with 4-bit or 8-bit optimizer states (BSD-3-Clause).
galore_adamw / galore_adamw_8bit / galore_adafactor and *_layerwise variants galore-torch, the official GaLore release (Apache-2.0).
apollo_adamw / apollo_adamw_layerwise apollo-torch, the official APOLLO release (CC-BY-NC-4.0).
lomo / adalomo lomo-optim, the official LOMO and AdaLomo release (MIT).

Fractional-Order Optimizers

Fractional-order optimizers generalize the integer-order gradient step with fractional-calculus operators, most commonly the Caputo, Riemann-Liouville, or Grünwald-Letnikov derivative, which weight past gradient information through power-law memory kernels. The field is young: the first neural-network training results date to 2015, convergence theory is still being settled, and most papers ship no code.

Foundations

Optimizer Venue Paper Code
the Fractional Steepest Descent Method (FSDM) IEEE Transactions on Neural Networks and Learning Systems 2015 Fractional Extreme Value Adaptive Training Method: Fractional Steepest Descent Approach
Caputo BP-NN FOGD (ISNN) Advances in Neural Networks - ISNN 2017 (Lecture Notes in Computer Science) A Caputo-Type Fractional-Order Gradient Descent Learning of BP Neural Networks
Caputo CVNN FOGD IEEE Access 2017 Convergence Analysis of Caputo-Type Fractional Order Complex-Valued Neural Networks
Caputo fractional-order gradient descent Neural Networks 2017 Fractional-order gradient descent learning of BP neural networks with Caputo derivative
FBPTT Circuits, Systems, and Signal Processing 2018 A Novel Fractional Gradient-Based Learning Algorithm for Recurrent Neural Networks
FGD-RBF Circuits, Systems, and Signal Processing 2018 A Fractional Gradient Descent-Based RBF Neural Network
Fractional-Order Deep BP NN Computational Intelligence and Neuroscience 2018 Fractional-Order Deep Backpropagation Neural Network official
Caputo-Type FOGD (Deep BP) IEEE IMCEC 2019 A Caputo-Type Fractional-Order Gradient Descent Learning of Deep BP Neural Networks
FSGD Electronic Markets 2019 Fractional stochastic gradient descent for recommender systems
mF-SGD IEEE Access 2019 Design of Momentum Fractional Stochastic Gradient Descent for Recommender Systems
CFEM-LMS Neurocomputing 2020 Combination of fractional FLANN filters for solving the Van der Pol-Duffing oscillator
FSDM Frontiers of Information Technology & Electronic Engineering 2020 Fractional-order global optimal backpropagation machine trained by an improved fractional-order steepest descent method
Fractional Order Gradient Method Neurocomputing 2020 Convolutional neural networks with fractional order gradient method
Normalized Fractional SGD (NFSGD) Neural Computing and Applications 2020 Design of normalized fractional SGD computing paradigm for recommender systems
the Fractional Order Gradient Method Journal of the Franklin Institute 2020 Generalization of the gradient method with fractional order gradient direction
Fractional Order Gradient Descent with Momentum (FOGDM) Network: Computation in Neural Systems 2020 Data classification based on fractional order gradient descent with momentum for RBF neural network
CFGD (Caputo) arXiv 2021 A Caputo fractional derivative-based algorithm for optimization
Fractional-Order Momentum (FCM) Neurocomputing 2021 Convolutional neural networks based on fractional-order momentum for parameter training
FOGDM-RBF Soft Computing 2021 Fractional-order gradient descent with momentum for RBF neural network-based AIS trajectory restoration
Caputron Electronics (MDPI) 2022 Exploring the Effects of Caputo Fractional Derivative in Spiking Neural Network Training official
FGD (CNN BP) arXiv 2022 Using a novel fractional-order gradient method for CNN back-propagation
FGNN Mathematics (MDPI) 2022 A Regularized Graph Neural Network Based on Approximate Fractional Order Gradients
FracM Neural Computing and Applications 2022 A fractional-order momentum optimization approach of deep neural networks community
GFSGD Chaos, Solitons & Fractals 2022 Generalized fractional strategy for recommender systems with chaotic ratings behavior
Fractional Derivative Gradient Optimizers (FSGD Applied Sciences 2022 Fractional Derivative Gradient-Based Optimizers for Neural Networks and Human Activity Recognition
Fractional LMS (FLMS) IEEE Transactions on Signal Processing 2022 Performance Analysis of Fractional Learning Algorithms
Conformable Fractional Gradient Descent Fuzzy Systems and Data Mining VIII 2022 Fractional Gradient Descent Learning of Backpropagation Artificial Neural Networks with Conformable Fractional Calculus
Fractional Order Gradient Descent with variable initial value Neurocomputing 2022 Study on fast speed fractional order gradient descent method and its application in neural networks
TFGD (Time-fractional) Axioms 2022 Training Neural Networks by Time-Fractional Gradient Descent
Variable Order Fractional Gradient Descent Chinese Control and Decision Conference 2022 Variable Order Fractional Gradient Descent Method and Its Application in Neural Networks Optimization
CfGD / CfAdam Neural Networks 2023 Accelerating gradient descent and Adam via fractional gradients
RFGD Neural Networks 2023 A fractional gradient descent algorithm robust to the initial weights of multilayer perceptron
FO-RI-FedAvg arXiv 2026 Fractional Order Federated Learning for Battery Electric Vehicle Energy Consumption Modeling
IHL-Adam Expert Systems with Applications 2024 Parameter training method for convolutional neural networks based on improved Hausdorff-like derivative

Recent advances

Optimizer Venue Paper Code
AFOGD / AFOAGD arXiv 2023 The Novel Adaptive Fractional Order Gradient Decent Algorithms Design via Robust Control
EFSGD / EN-EFSGD Chaos, Solitons & Fractals 2023 Enhanced fractional prediction scheme for effective matrix factorization in chaotic feedback recommender systems
FCGD_G-L Mathematics 2023 A Deep Learning Optimizer Based on Grünwald–Letnikov Fractional Order Definition
FGDAM Applied Mathematics and Computation 2023 Applications of fractional gradient descent method with adaptive momentum in BP neural networks
FracG Chinese Control Conference (CCC) 2023 Optimization Method of Neural Networks via Fractional-Order of Gradients
Fractional Gradient Descent (FSGD) Fractal and Fractional 2023 Fractional Gradient Optimizers for PyTorch: Enhancing GAN and BERT
the Improved Stochastic Fractional Order Gradient Descent algorithm Fractal and Fractional 2023 The Improved Stochastic Fractional Order Gradient Descent Algorithm
AdaGL Neural Processing Letters 2024 An Adaptive Learning Rate Deep Learning Optimizer Using Long and Short-Term Gradients Based on G–L Fractional-Order Derivative community
GFSGD Heliyon 2024 Fractional gradient optimized explainable convolutional neural network for Alzheimer's disease diagnosis
FOAdam Applied Mathematical Modelling 2024 A novel gradient descent optimizer based on fractional order scheduler and its application in deep neural networks
Adaptive Terminal Caputo Fractional Gradient Descent (AT-CFGD) TMLR 2024 Convergence Analysis of Fractional Gradient Descent
Caputo Fractional-Order Gradient Descent International Journal of Fuzzy Systems 2024 A Novel Neuro-fuzzy Learning Algorithm for First-Order Takagi–Sugeno Fuzzy Model: Caputo Fractional-Order Gradient Descent Method
FNGD IEEE Access 2024 Improving the Accuracy of Neural Network Pattern Recognition by Fractional Gradient Descent
MFFGD Neurocomputing 2024 MFFGD: An adaptive Caputo fractional-order gradient algorithm for DNN
Caputo-based SGD (L1 scheme) OpenReview 2024 Stochastic Fractional Gradient Descent with Caputo L1 Scheme for Deep Neural Networks
C-FOG Fractal and Fractional 2024 Self-Organizing Optimization Based on Caputo's Fractional Order Gradients
CSA-CFGD PeerJ Computer Science 2024 Deep ocular tumor classification model using cuckoo search algorithm and Caputo fractional gradient descent official
FGD-RBFNN (UAV) Computer Modeling in Engineering & Sciences 2024 Fractional Gradient Descent RBFNN for Active Fault-Tolerant Control of Plant Protection UAVs
FOELM Applied Soft Computing 2024 An interval neural network-based Caputo fractional-order extreme learning machine applied to classification
MIF Algorithms 2024 An Integer-Fractional Gradient Algorithm for Back Propagation Neural Networks
Multi-layer NN FOGD Advanced Theory and Simulations 2024 Convergence Analysis and Application for Multi-Layer Neural Network Based on Fractional-Order Gradient Descent Learning
UCAdam Journal of Electrical Systems 2024 Improved Adam: Incorporating Unified Conformable Fractional Derivative for fractional-order Momentum
2SEDFOSGD arXiv 2025 Effective Dimension Aware Fractional-Order Stochastic Gradient Descent for Convex Optimization Problems
2SEDFOSGD arXiv 2025 More Optimal Fractional-Order Stochastic Gradient Descent for Non-Convex Optimization Problems
AFGD (adaptive Caputo FGD for TCN) Neurocomputing 2025 Monotonic convergence of adaptive Caputo fractional gradient descent for temporal convolutional networks
FGDSINN International Journal of Machine Learning and Cybernetics 2025 A smoothing interval neural networks-based Caputo fractional-order gradient learning algorithm
FOSGD / FOSGDM / FOSGDME Neural Networks 2025 Fractional-order stochastic gradient descent method with momentum and energy for deep neural networks
FracGrad Fractal and Fractional 2025 FracGrad: A Discretized Riemann–Liouville Fractional Integral Approach to Gradient Accumulation for Deep Learning
GF-SGD Computers in Biology and Medicine 2025 Generalized fractional optimization-based explainable lightweight CNN model for malaria disease classification
IFOGD Neural Networks 2025 Improved fractional-order gradient descent method based on multilayer perceptron
L2O-CFGD arXiv 2025 Enhancing Fractional Gradient Descent with Learned Optimizers official
MOAOCFGD arXiv 2025 An Adaptive Order Caputo Fractional Gradient Descent Method for Multi-objective Optimization Problems
NCFDD / NFLightGBM Information Fusion 2025 Fractional light gradient boosting machine ensemble learning model: A non-causal fractional difference descent approach
a Caputo fractional-order gradient descent for neural network training Chaos, Solitons & Fractals 2025 Fractional-order gradient approach for optimizing neural networks: A theoretical and empirical analysis
Fractional-order SGD (FSGD) arXiv 2025 Fractional-order Jacobian Matrix Differentiation and Its Application in Artificial Neural Networks
Adaptive Parameter Fractional-Order Gradient Descent Learning European Journal of Operational Research 2025 Novel adaptive parameter fractional-order gradient descent learning for stock selection decision support systems
FAdam Chaos, Solitons & Fractals 2025 Parameter training methods for convolutional neural networks with adaptive adjustment method based on Caputo fractional-order differences
SFM Digital Signal Processing 2025 A momentum-based stochastic fractional gradient optimizer with U-net model for brain tumor segmentation in MRI
Caputo Fractional-order Gradient Descent for Ridge Polynomial Neural International Conference on Electronics and Communication, Network and Computer Technology 2025 A Novel Method for Ridge Polynomial Neural Network-based Caputo Fractional-order Gradient Descent Algorithm
AOFGD SSRN 2025 AOFGD: Adaptive order fractional gradient descent method
Frac-Adam Mathematics 2025 Fractional Optimizers for LSTM Networks in Financial Time Series Forecasting
Caputo Fractional Gradient Descent International Conference on Advanced Algorithms and Control Engineering 2025 Fractional Order Gradient Descent with Caputo Derivatives for Product-Unit Neural Networks
FO-STDGD Neurocomputing 2025 Fractional-order spike-timing-dependent gradient descent for multi-layer spiking neural networks
Fractional Order Stochastic Gradient Descent (FOSGD) ASME IDETC-CIE 2025 Tail-Index-Awareness in Fractional Order Stochastic Gradient Descent
λ-FAdaMax Expert Systems with Applications 2025 λ-FAdaMax: A novel fractional-order gradient descent method with decaying second moment for neural network training
CFDNN Scientific Reports 2026 Conformable Fractional Deep Neural Networks (CFDNN) for high-speed cyber-attack detection
CFGD (Compressed) IEEE Transactions on Neural Networks and Learning Systems 2026 Fractional Gradient Descent With Matrix Stepsizes for Non-Convex Optimization official
FAdamWav Fractal and Fractional 2026 FAdamWav: A Fractional Wavelet Gradient Optimizer for Neural Networks
FOFedAvg arXiv 2026 Fractional-Order Federated Learning
Fractional-order FL with adaptive momentum IEEE Transactions on Emerging Topics in Computational Intelligence 2026 Communication-Efficient Federated Learning via Fractional-Order Gradient Descent With Adaptive Momentum Under Non-IID Data
TFGD (Tempered) Neural Networks 2026 Tempered fractional gradient descent: Theory, algorithms, and robust learning applications
FGD-ED Information Processing & Management 2026 Fractional-order gradient descent method based on fractional-order term exponential decay and its application in artificial neural networks
the Caputo Fractional-Order Gradient Descent Method (FGDM) Applied Soft Computing 2026 A novel gradient learning algorithm based on zero-order Takagi-Sugeno fuzzy model: the caputo fractional-order gradient descent
CFGD (Conformable) Journal of Computational and Applied Mathematics 2026 Conformable fractional gradient descent: A local optimizer for neural network training
NGLFGD Knowledge-Based Systems 2026 Fast and accurate fractional order gradient descent algorithm and its application in Extreme Gradient Boosting
FO-Elman Neural Networks 2026 Fractional-order gradient descent learning for Elman neural networks

Surveys

Optimizer Venue Paper Code
Fractional-Order Gradient Descent for Neural Networks The European Physical Journal Special Topics 2022 Artificial neural networks: a practical review of applications involving fractional calculus
Fractional Gradient Descent (FGD) Chaos, Solitons & Fractals 2025 A comprehensive survey of fractional gradient descent methods and their convergence analysis
the Fractional Continuous Time Method (FCTM) Journal of Computational and Applied Mathematics 2026 An overview of the fractional-order gradient descent method and its applications

Note: FAdam (arXiv 2405.12807) is a Fisher-information variant of Adam and is unrelated to fractional calculus despite the name.

Distributed and Communication-Efficient Optimizers

Optimizers in this category target training across many devices or nodes, where memory and inter-worker communication are the main bottlenecks. They shard optimizer state, compress gradient exchange, or synchronize infrequently so that training scales without a proportional increase in bandwidth. Some entries are standalone update rules, while others wrap an inner optimizer with a communication-efficient outer loop.

Optimizer Venue Paper Code zij
signSGD ICML 2018 signSGD: Compressed Optimisation for Non-Convex Problems official
LD-SGD arXiv 2019 Communication-Efficient Local Decentralized SGD Methods
Local SGD ICLR 2019 Local SGD Converges Fast and Communicates Little community
PowerSGD NeurIPS 2019 PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization
Qsparse-local-SGD NeurIPS 2019 Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification, and Local Computations
signProx ICASSP 2019 signProx: One-Bit Proximal Algorithm for Nonconvex Stochastic Optimization
APMSqueeze arXiv 2020 APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm
DEED-GD arXiv 2020 DEED: A General Quantization Scheme for Communication Efficiency in Bits
FedAC NeurIPS 2020 Federated Accelerated Stochastic Gradient Descent
LAGS-SGD ECAI 2020 Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees
rTop-k JSAIT 2020 rTop-k: A Statistical Estimation Approach to Distributed SGD
SCAFFOLD ICML 2020 SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
SlowMo ICLR 2020 SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum
ZeRO SC 2020 ZeRO: Memory Optimizations Toward Training Trillion Parameter Models official
1-bit Adam ICML 2021 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed official
BVR-L-SGD ICML 2021 Bias-Variance Reduced Local SGD for Less Heterogeneous Federated Learning
SQuARM-SGD JSAIT 2021 SQuARM-SGD: Communication-Efficient Momentum SGD for Decentralized Optimization
SketchedAMSGrad ICDM 2022 Communication-Efficient Adam-Type Algorithms for Distributed Data Mining
0/1 Adam ICLR 2023 Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam official
AdaCGD TMLR 2023 Adaptive Compression for Communication-Efficient Distributed Training
DiLoCo arXiv 2023 DiLoCo: Distributed Low-Communication Training of Language Models community
Distributed Shampoo arXiv 2023 A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale official
SPARQ-SGD TAC 2023 SPARQ-SGD: Event-Triggered and Compressed Communication in Decentralized Stochastic Optimization
AdaFedAdam TMLCN 2024 Accelerating Fair Federated Learning: Adaptive Federated Adam official
DeMo arXiv 2024 DeMo: Decoupled Momentum Optimization official
FADAS ICML 2024 FADAS: Towards Federated Adaptive Asynchronous Optimization official
FAGH arXiv 2024 FAGH: Accelerating Federated Learning with Approximated Global Hessian
Fed-Sophia ICC 2024 Fed-Sophia: A Communication-Efficient Second-Order Federated Learning Algorithm
FedLion ICASSP 2024 FedLion: Faster Adaptive Federated Optimization with Fewer Communication official
FedRepOpt ACCV 2024 FedRepOpt: Gradient Re-parametrized Optimizers in Federated Learning official
FedSTaS arXiv 2024 FedSTaS: Client Stratification and Client Level Sampling for Efficient Federated Learning official
FESS-GDA AISTATS 2024 Stochastic Smoothed Gradient Descent Ascent for Federated Minimax Optimization
FLeNS BigData 2024 FLeNS: Federated Learning with Enhanced Nesterov-Newton Sketch official
MM-PSGD / MC-PSGD MMAsia-W 2024 Distributed Optimization over Block-Cyclic Data
OpenDiLoCo arXiv 2024 OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training official
ADEF arXiv 2025 Accelerated Distributed Optimization with Compression and Error Feedback
DAT-SGD ICML 2025 Enhancing Parallelism in Decentralized Stochastic Convex Optimization
DeCo-SGD arXiv 2025 Taming Latency and Bandwidth: A Theoretical Framework and Adaptive Algorithm for Communication-Constrained Training
DES-LOC arXiv 2025 DES-LOC: Desynced Low Communication Adaptive Optimizers for Training Foundation Models
Dion arXiv 2025 Dion: Distributed Orthonormalized Updates official
DLAS-R-FTC CDC 2025 Distributed Optimization and Learning for Automated Stepsize Selection with Finite Time Coordination
FAdamGC arXiv 2025 Gradient Correction in Federated Learning with Adaptive Optimization
FedCET arXiv 2025 Communication Efficient Federated Learning with Linear Convergence on Heterogeneous Data
FedIvon TMLR 2025 Federated Learning with Uncertainty and Personalization via Efficient Second-order Optimization
FedMuon arXiv 2025 FedMuon: Accelerating Federated Learning with Matrix Orthogonalization official
FedOne ICML 2025 FedOne: Query-Efficient Federated Learning for Black-box Discrete Prompt Learning
HybridSGD arXiv 2025 Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization
Kuramoto-FedAvg arXiv 2025 Kuramoto-FedAvg: Using Synchronization Dynamics to Improve Federated Learning Optimization under Statistical Heterogeneity official
LQ-SGD arXiv 2025 Trustworthy Efficient Communication for Distributed Learning using LQ-SGD Algorithm
Muon arXiv 2025 Muon is Scalable for LLM Training official Muon
pFedSOP arXiv 2025 pFedSOP: Accelerating Training Of Personalized Federated Learning Using Second-Order Optimization
LT-ADMM TAC 2026 Communication-Efficient Stochastic Distributed Learning
Ringleader ASGD ICLR 2026 Ringleader ASGD: The First Asynchronous SGD with Optimal Time Complexity under Data Heterogeneity
DECA arXiv 2026 DECA: Decentralizing Block-Wise Adam for Efficient LLM Full-Parameter Fine-Tuning on Non-IID Data
Ringmaster LMO arXiv 2026 Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method
SignMuon arXiv 2026 SignMuon: Communication-Efficient Distributed Muon Optimization
Orth-Dion arXiv 2026 Orth-Dion: Eliminating Geometric Mismatch in Distributed Low-Rank Spectral Optimization
EF21-Muon arXiv 2025 Error Feedback for Muon and Friends
MuonBP ICLR 2026 MuonBP: Faster Muon via Block-Periodic Orthogonalization
CurvaDion arXiv 2025 CurvaDion: Curvature-Adaptive Distributed Orthonormalization
Quasi-Newton FL with Error Feedback OPT 2025: Optimization for Machine Learning (NeurIPS 2025 Workshop) Quasi-Newton Methods for Federated Learning with Error Feedback
DeMuon arXiv 2025 DeMuon: A Decentralized Muon for Matrix Optimization over Graphs
HeLoCo arXiv 2026 HeLoCo: Efficient asynchronous low-communication training under data and device heterogeneity
Decoupled DiLoCo arXiv 2026 Decoupled DiLoCo for Resilient Distributed Pre-training
Partial Parameter Updates arXiv 2025 Partial Parameter Updates for Efficient Distributed Training
SparseLoCo arXiv 2025 Communication Efficient LLM Pre-training with SparseLoCo official
GASLoC arXiv 2026 Unifying Local Communications and Local Updates for LLM Pretraining
MG-ADSGD arXiv 2026 Accelerated Decentralized Stochastic Gradient Descent for Strongly Convex Optimization
Local MixVR arXiv 2026 Local MixVR: Breaking the Communication-Sample Dependence in Distributed Learning
LOSCAR-SGD arXiv 2026 LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
HEW-Local SGD arXiv (math.OC) 2026 Heterogeneous-Horizon Exact-Weight Local SGD
CAPTAIN (C-ALADIN) arXiv 2026 A Global Convergence Analysis of Consensus ALADIN for Convex Optimization
FedPAC arXiv 2026 Taming Preconditioner Drift: Unlocking the Potential of Second-Order Optimizers for Federated Learning on Non-IID Data official
FedAdamW AAAI 2026 FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models official
LoRDO arXiv 2026 LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

Second-Order and Orthogonalized Optimizers

Second-order and orthogonalized optimizers exploit curvature information or the matrix structure of gradients rather than purely elementwise first-order statistics. This group spans quasi-Newton and Hessian-diagonal methods (L-BFGS, AdaHessian, Sophia), full-matrix and Kronecker-factored preconditioning (PSGD, Shampoo, SOAP), and orthogonalized-update methods in the Muon family. Venues reflect peer-reviewed acceptance where applicable; otherwise the arXiv year is listed.

Optimizer Venue Paper Code zij
Gauss-Newton Method Biometrika 1974 Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method
Newton's Method ANL Technical Report 1982 Newton's method (ANL-82-8)
L-BFGS Mathematical Programming 1989 On the limited memory BFGS method for large scale optimization official LBFGS
Natural Gradient Neural Computation 1998 Natural Gradient Works Efficiently in Learning
K-FAC ICML 2015 Optimizing Neural Networks with Kronecker-factored Approximate Curvature
PSGD IEEE TNNLS 2018 Preconditioned Stochastic Gradient Descent official
Shampoo ICML 2018 Shampoo: Preconditioned Stochastic Tensor Optimization official Shampoo
AdaHessian AAAI 2021 ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning official Adahessian
Apollo arXiv 2020 Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization official
K-BFGS / K-BFGS(L) NeurIPS 2020 Practical Quasi-Newton Methods for Training Deep Neural Networks
SGN arXiv 2020 On the Promise of the Stochastic Generalized Gauss-Newton Method for Training DNNs
SpiderSQN IEEE TNNLS 2022 Faster Stochastic Quasi-Newton Methods
TKFAC AAAI 2021 A Trace-restricted Kronecker-Factored Approximation to Natural Gradient
SGDHess NeurIPS 2022 Better SGD using Second-order Momentum
SketchySGD SIMODS 2024 SketchySGD: Reliable Stochastic Optimization via Randomized Curvature Estimates official
Distributed Shampoo arXiv 2023 A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale official
mL-BFGS TMLR 2023 mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural Network Optimization
Sophia ICLR 2024 Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training official SophiaG
AdaFisher ICLR 2025 AdaFisher: Adaptive Second Order Optimization via Fisher Information official
CRNAS arXiv 2024 Novel Optimization Techniques for Parameter Estimation
HesScale ICML 2024 Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learning official
Muon Blog post 2024 Muon: An optimizer for hidden layers in neural networks official Muon
NysAct IEEE BigData 2024 NysAct: A Scalable Preconditioned Gradient Descent using Nystrom Approximation
OptiQ arXiv 2024 Second-Order Optimization via Quiescence
Q-Newton arXiv 2024 Q-Newton: Hybrid Quantum-Classical Scheduling for Accelerating Neural Network Training with Newton's Gradient Descent official
SOAA arXiv 2024 Efficient Second-Order Neural Network Optimization via Adaptive Trust Region Methods
SOAP ICLR 2025 SOAP: Improving and Stabilizing Shampoo using Adam for Language Modeling official SOAP
AdaDiag arXiv 2025 Improving Adaptive Moment Optimization via Preconditioner Diagonalization
ADAGB2 arXiv 2025 Fast Stochastic Second-Order Adagrad for Nonconvex Bound-Constrained Optimization
AdaGO arXiv 2025 AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates
AdaMuon arXiv 2025 AdaMuon: Adaptive Muon Optimizer official AdaMuon
ASGO NeurIPS 2025 ASGO: Adaptive Structured Gradient Optimization official
AuON arXiv 2025 AuON: A Linear-time Alternative to Orthogonal Momentum Updates official
COSMOS arXiv 2025 COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs official
FUSE IEEE CAI 2025 FUSE: First-Order and Second-Order Unified SynthEsis in Stochastic Optimization
Hessian-aware Scaling arXiv 2025 First-ish Order Methods: Hessian-aware Scalings of Gradient Descent
MAC IEEE ICDM 2025 MAC: An Efficient Gradient Preconditioning using Mean Activation Approximated Curvature
MuonClip arXiv 2025 Kimi K2: Open Agentic Intelligence community
NorMuon ICML 2026 NorMuon: Making Muon more efficient and scalable official NorMuon
OCAR ICML 2025 Online Curvature-Aware Replay: Leveraging 2nd Order Information for Online Continual Learning
PolarGrad arXiv 2025 PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective official PolarGrad
ROOT arXiv 2025 ROOT: Robust Orthogonalized Optimizer for Neural Network Training official
S-BFGS arXiv 2025 Efficient Stochastic BFGS methods Inspired by Bayesian Principles
SASSHA ICML 2025 SASSHA: Sharpness-aware Adaptive Second-order Optimization with Stable Hessian Approximation official
Scion ICML 2025 Training Deep Learning Models with Norm-Constrained LMOs official Scion
SPlus arXiv 2025 A Stable Whitening Optimizer for Efficient Neural Network Training official SPlus
Muon^2 arXiv 2026 Muon^2: Boosting Muon via Adaptive Second-Moment Preconditioning
Nora arXiv 2026 Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer
Pion arXiv 2026 Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
Spectral Sphere Optimizer (SSO) arXiv 2026 Controlled LLM Training on Spectral Sphere official
LoRA-Muon arXiv 2026 LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold
FOAM arXiv 2026 FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo
Mousse arXiv 2026 Mousse: Rectifying the Geometry of Muon with Curvature-Aware Preconditioning official
FISMO arXiv 2026 FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer
DyKAF arXiv 2025 DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning
Double Preconditioning (DoPr) arXiv 2026 Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss
AdaCubic TMLR 2026 AdaCubic: An Adaptive Cubic Regularization Optimizer for Deep Learning official
IFNSO arXiv 2026 IFNSO: Iteration-Free Newton-Schulz Orthogonalization official
CAO arXiv preprint 2025 CAO: Curvature-Adaptive Optimization via Periodic Low-Rank Hessian Sketching
Turbo-Muon arXiv 2025 Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning official
SR1 Cubic Quasi-Newton arXiv 2025 Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization
KL-Shampoo ICLR 2026 Understanding and Improving Shampoo and SOAP via Kullback-Leibler Minimization official
LLQR arXiv 2026 Layerwise LQR for Geometry-Aware Optimization of Deep Networks official
Freon / Kaon arXiv 2026 Muon is Not That Special: Random or Inverted Spectra Work Just as Well
Mano arXiv 2026 Mano: Restriking Manifold Optimization for LLM Training official
Atlas OPT 2025: 17th Annual Workshop on Optimization for Machine Learning (co-located with NeurIPS 2025) Atlas – Rethinking Optimizer Design for Stability and Speed

Zeroth-Order Optimizers

Zeroth-order (gradient-free) methods train models using only function evaluations, estimating gradients from randomized perturbations of the parameters instead of backpropagation. Because they need no backward pass or activation storage, they run at roughly inference-level memory, which has made them a practical option for fine-tuning large language models on constrained hardware. The lineage runs from SPSA in classical stochastic approximation to recent variance-reduced and low-rank variants built on MeZO.

Optimizer Venue Paper Code zij
SPSA IEEE Transactions on Automatic Control 1992 Multivariate stochastic approximation using a simultaneous perturbation gradient approximation official
Evolution Strategies arXiv 2017 Evolution Strategies as a Scalable Alternative to Reinforcement Learning official
ZO-AdaMM NeurIPS 2019 ZO-AdaMM: Zeroth-Order Adaptive Momentum Method for Black-Box Optimization official
MeZO NeurIPS 2023 Fine-Tuning Language Models with Just Forward Passes official
DeepZero ICLR 2024 DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training official
LeZO arXiv 2024 Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models official
MeZO-SVRG ICML 2024 Variance-reduced Zeroth-Order Methods for Fine-Tuning Language Models official
ZO-AdaMU AAAI 2024 ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and Uncertainty in Zeroth-order Optimization official
ZoPro CDC 2024 A Zeroth-Order Proximal Algorithm for Consensus Optimization
Addax ICLR 2025 Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models official
DiZO NeurIPS 2025 Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning official
ElasticZO arXiv 2025 ElasticZO: A Memory-Efficient On-Device Learning with Combined Zeroth- and First-Order Optimization
HELENE EMNLP 2025 HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization
KerZOO arXiv 2025 KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning
LORENZA arXiv 2025 LORENZA: Enhancing Generalization in Low-Rank Gradient LLM Training via Efficient Zeroth-Order Adaptive SAM
LOZO ICLR 2025 Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures official
MaZO arXiv 2025 MaZO: Masked Zeroth-Order Optimization for Multi-Task Fine-Tuning of Large Language Models
QuZO EMNLP 2025 QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models official
R-AdaZO ICML 2025 Refining Adaptive Zeroth-Order Optimization at Ease official
Sparse MeZO NeurIPS 2025 Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning official
SubZero ICCV 2025 Zeroth-Order Fine-Tuning of LLMs in Random Subspaces official
TeZO arXiv 2025 TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs
VAMO arXiv 2025 VAMO: Efficient Zeroth-Order Variance Reduction for SGD with Faster Convergence
VR-SZD arXiv 2025 A Structured Proximal Stochastic Variance Reduced Zeroth-order Algorithm official
ZO-SAH arXiv 2025 Subspace-based Approximate Hessian Method for Zeroth-Order Optimization
ZO2 COLM 2025 ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory official
ZOQO ICASSP 2025 ZOQO: Zero-Order Quantized Optimization
AdaMeZO arXiv 2026 AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments official
FZOO ICLR 2026 FZOO: Fast Zeroth-Order Optimizer for Fine-Tuning Large Language Models towards Adam-Scale Speed official
MEAZO arXiv 2026 On Adaptivity in Zeroth-Order Optimization
QZO ICLR 2026 Fine-tuning Quantized Neural Networks with Zeroth-order Optimization official
GRZO arXiv 2026 GRZO: Group-Relative Zeroth-Order Optimization for Large Language Model Fine-Tuning
AGZO ICML 2026 AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning
ZO-MOPI arXiv 2026 Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration official
ZO-Muon arXiv 2026 Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization official
RLR (Recursive Likelihood Ratio) ICLR 2026 Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer official
ZO Fine-tuner arXiv (accepted to ICML 2026) 2025 Learning a Zeroth-Order Optimizer for Fine-Tuning LLMs official

Privacy-Preserving Optimizers

Privacy-preserving optimizers train models under differential privacy, typically by clipping per-sample gradients and adding calibrated noise to updates. This page lists differentially private optimization methods and reference libraries, from the original DP-SGD to later variants that reduce clipping bias, correct moment estimates, or filter privacy noise.

Optimizer Venue Paper Code
DP-SGD CCS 2016 Deep Learning with Differential Privacy official
DP-LSSGD MSML 2020 DP-LSSGD: A Stochastic Optimization Method to Lift the Utility in Privacy-Preserving ERM official
DP-PASGD arXiv 2020 Differentially Private Federated Learning for Resource-Constrained Internet of Things
DP-SGD-JL NeurIPS 2021 Fast and Memory Efficient Differentially Private-SGD via JL Projections
Opacus arXiv 2021 Opacus: User-Friendly Differential Privacy Library in PyTorch official
A(DP)²SGD TPAMI 2022 A(DP)²SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent with Differential Privacy
DPIS CCS 2022 DPIS: An Enhanced Mechanism for Differentially Private SGD with Importance Sampling
Top-DP TCSVT 2022 Topology-aware Differential Privacy for Decentralized Image Classification
ANSGD arXiv 2023 Learning across Data Owners with Joint Differential Privacy
DP-FedSAM CVPR 2023 Make Landscape Flatter in Differentially Private Federated Learning official
AClipped-dpSGD Machine Learning 2024 Efficient Private SCO for Heavy-Tailed Data via Averaged Clipping
DiceSGD ICLR 2024 Differentially Private SGD Without Clipping Bias: An Error-Feedback Approach official
DOPPLER NeurIPS 2024 DOPPLER: Differentially Private Optimizers with Low-pass Filter for Privacy Noise Reduction
DP-AdamBC AAAI 2024 DP-AdamBC: Your DP-Adam Is Actually DP-SGD (Unless You Apply Bias Correction) official
FedLAP-DP PoPETs 2024 FedLAP-DP: Federated Learning by Sharing Differentially Private Loss Approximations official
DC-SGD TIFS 2025 DC-SGD: Differentially Private SGD with Dynamic Clipping through Gradient Norm Distribution Estimation
DP-AdamW ICML Workshop 2025 DP-AdamW: Investigating Decoupled Weight Decay and Bias Correction in Private Deep Learning
DP-MicroAdam arXiv 2025 DP-MicroAdam: Private and Frugal Algorithm for Training and Fine-tuning
DPZV arXiv 2025 Communication-Efficient and Differentially Private Vertical Federated Learning with Zeroth-Order Optimization
GeoDP ICDE 2025 Analyzing and Optimizing Perturbation of DP-SGD Geometrically official
Interleaved-ShuffleG arXiv 2025 Improving the Convergence of Private Shuffled Gradient Methods with Public Data
Logit-DP ICLR 2025 Differentially Private Optimization for Non-Decomposable Objective Functions
SPARTA KDD 2025 SPARTA: An Optimization Framework for Differentially Private Sparse Fine-Tuning official
DP-λCGD arXiv 2026 DP-λCGD: Efficient Noise Correlation for Differentially Private Model Training
PINA ICASSP 2026 Differentially Private Clustered Federated Learning with Privacy-Preserving Initialization and Normality-Driven Aggregation
RaCO-DP ICLR 2026 Private Rate-Constrained Optimization with Applications to Fair Learning official
DP-MacAdam arXiv 2026 DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum
FO-DP-SGD arXiv 2026 Deep Learning under Fractional-Order Differential Privacy
Hyperparameter-free DP optimization (GeN-DP) ICLR 2025 Towards hyperparameter-free optimization with differential privacy
DP-Muon arXiv 2026 DP-Muon: Differentially Private Optimization via Matrix-Orthogonalized Momentum
TP-TopK arXiv 2026 When Do Fewer Coordinates Suffice in DP-SGD?
DPDL arXiv 2026 DPDL: Towards Differential Privacy Preservation in Decentralized Stochastic Learning on Non-IID Data
DP-SGD-RC ICML 2026 Efficient DP-SGD for LLMs with Randomized Clipping
PRISM ICML 2026 PRISM: Gauge-Invariant Tangent-Space Differentially Private LoRA
SMA-DP-SGD arXiv 2026 SMA-DP: Spectral Memory-Aware Differential Privacy for Deep Learning
FiBeR arXiv 2026 FIBER: A Differentially Private Optimizer with Filter-Aware Innovation Bias Correction
DP-KFC ICML 2026 DP-KFC: Data-Free Preconditioning for Privacy-Preserving Deep Learning official
DP-FedAdamW CVPR 2026 DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models
Lap2 IEEE CSF 2026 Lap2: Revisiting Laplace DP-SGD for High Dimensions via Majorization Theory official
Clip21-SGD2M arXiv 2025 Double Momentum and Error Feedback for Clipping with Fast Rates and Differential Privacy

Sharpness-Aware Optimizers

Sharpness-aware methods seek parameters that lie in neighborhoods with uniformly low loss rather than at isolated minima, which tends to improve generalization. Introduced by SAM (Foret et al., ICLR 2021), these methods wrap a base optimizer such as SGD or AdamW and add a gradient ascent perturbation step before the descent update. Later work makes the perturbation scale-invariant, closes the surrogate gap, reweights the sharpness term, amortizes the extra forward-backward cost, or extends the idea to second-order optimization.

Optimizer Venue Paper Code zij
SAM ICLR 2021 Sharpness-Aware Minimization for Efficiently Improving Generalization community SAM
ASAM ICML 2021 ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks community ASAM
ESAM ICLR 2022 Efficient Sharpness-aware Minimization for Improved Training of Neural Networks
GSAM ICLR 2022 Surrogate Gap Minimization Improves Sharpness-Aware Training official GSAM
LookSAM CVPR 2022 Towards Efficient and Scalable Sharpness-Aware Minimization community LookSAM
AE-SAM ICLR 2023 An Adaptive Policy to Employ Sharpness-Aware Minimization
bSAM ICLR 2023 SAM as an Optimal Relaxation of Bayes official
GAM CVPR 2023 Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization
WSAM KDD 2023 Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term official WSAM
AdaSAM Neural Networks 2024 AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning Rate and Momentum for Training Deep Neural Networks
F-SAM CVPR 2024 Friendly Sharpness-Aware Minimization official
FGSAM NeurIPS 2024 Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification
Lookbehind-SAM ICML 2024 Lookbehind-SAM: k steps back, 1 step forward
MSAM arXiv 2024 Momentum-SAM: Sharpness Aware Minimization without Computational Overhead official
SAMPa NeurIPS 2024 SAMPa: Sharpness-aware Minimization Parallelized
AsyncSAM arXiv 2025 Asynchronous Sharpness-Aware Minimization For Fast and Accurate Deep Learning
GCSAM arXiv 2025 GCSAM: Gradient Centralized Sharpness Aware Minimization official
LightSAM arXiv 2025 LightSAM: Parameter-Agnostic Sharpness-Aware Minimization
SASSHA ICML 2025 SASSHA: Sharpness-aware Adaptive Second-order Optimization with Stable Hessian Approximation official
SSAM JMLR 2025 Stabilizing Sharpness-aware Minimization Through A Simple Renormalization Strategy
SAM-Polyak (Adaptive SAM with Polyak step size) ICML 2026 Adaptive Sharpness-Aware Minimization with a Polyak-type Step size: A Theory-Grounded Scheduler official
X-SAM arXiv 2026 X-SAM: Boosting Sharpness-Aware Minimization with Dominant-Eigenvector Gradient Correction
M-SAM (Modality-Aware SAM) NeurIPS 2025 Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning
ZSharp (SAM with Z-Score Gradient Filtering) NeurIPS 2025 OPT Workshop (also accepted to ICASSP 2026) Sharpness-Aware Minimization with Z-Score Gradient Filtering official
Focal-SAM ICML 2025 Focal-SAM: Focal Sharpness-Aware Minimization for Long-Tailed Classification official
Functional SAM ICML 2025 Avoiding spurious sharpness minimization broadens applicability of SAM
FedGMT ICML 2025 One Arrow, Two Hawks: Sharpness-aware Minimization for Federated Learning via Global Model Trajectory official
LE-SAM ICML 2026 Fix the Loss, Not the Radius: Rethinking the Adversarial Perturbation of Sharpness-Aware Minimization

Quantum and Quantum-Inspired Optimizers

This page collects optimizers from two adjacent settings. The first is the optimization of variational quantum circuits, where shot noise and the quantum geometry of the parameter space drive the design of measurement-frugal, gradient-free, and natural-gradient methods. The second is quantum-inspired and quantum-hardware optimization of classical neural networks, where quantum fluctuations, adiabatic evolution, or annealer sampling replace or augment the classical training loop.

Optimizers for variational quantum circuits

Optimizer Venue Paper Code
SPSA IEEE Transactions on Automatic Control 1992 Multivariate stochastic approximation using a simultaneous perturbation gradient approximation official
iCANS Quantum 2020 An Adaptive Optimizer for Measurement-Frugal Variational Algorithms community
NFT Physical Review Research 2020 Sequential minimal optimization for quantum-classical hybrid algorithms official
Quantum Natural Gradient Quantum 2020 Quantum Natural Gradient official
Rosalin arXiv 2020 Operator Sampling for Shot-frugal Optimization in Variational Algorithms community
QN-SPSA Quantum 2021 Simultaneous Perturbation Stochastic Approximation of the Quantum Fisher Information official
Rotosolve / Rotoselect Quantum 2021 Structure optimization for parameterized quantum circuits community
Quantum Analytic Descent Physical Review Research 2022 Quantum Analytic Descent official
SGLBO npj Quantum Information 2022 Stochastic gradient line Bayesian optimization for efficient noise-robust optimization of parameterized quantum circuits community
SantaQlaus arXiv 2023 SantaQlaus: A resource-efficient method to leverage quantum shot-noise for optimization of variational quantum algorithms
ExcitationSolve Communications Physics 2025 Fast gradient-free optimization of excitations in variational quantum eigensolvers official
Kernel Descent Scientific Reports 2025 Introducing the kernel descent optimizer for variational quantum algorithms
QUIVER arXiv 2026 Adaptive directional gradients for parameterised quantum circuits
WSBD AISTATS 2026 WSBD: Freezing-Based Optimizer for Quantum Neural Networks official
H-QNG arXiv 2025 Efficient Hamiltonian-aware Quantum Natural Gradient Descent for Variational Quantum Eigensolvers
WA-QNG Quantum Science and Technology 2026 Weighted Approximate Quantum Natural Gradient for Variational Quantum Eigensolver
CQNG EPJ Quantum Technology 2025 Modified Conjugate Quantum Natural Gradient
Momentum-QNG Physica A 2024 Application of Langevin Dynamics to Advance the Quantum Natural Gradient Optimization Algorithm official
qBang Quantum 2024 Optimizing Variational Quantum Algorithms with qBang: Efficiently Interweaving Metric and Momentum to Navigate Flat Energy Landscapes official
EGT (Exact Geodesic Transport) arXiv 2025 Quantum optimization with exact geodesic transport
TGF / TGFQS arXiv 2026 Two-Gate Extensions of Free Axis and Free Quaternion Selection for Sequential Optimization of Parameterized Quantum Circuits
SGD (Superpositional Gradient Descent) IEEE QAI 2025 Superpositional Gradient Descent: Harnessing Quantum Principles for Model Training
Scalable On-Hardware QNN training (parallelised parameter-shift rule) arXiv 2026 Scalable On-Hardware Training of Quantum Neural Networks and Application to Clinical Data Imputation
QM-quantization optimizer (Schrodinger gradient-flow) arXiv 2026 Quantum mechanical framework for quantization-based optimization: from Gradient flow to Schroedinger equation

Quantum-inspired and quantum-hardware methods

Optimizer Venue Paper Code
Quantum Adam Scientific Reports 2018 Optimization of neural networks via finite-value quantum fluctuations
RBM training on a D-Wave annealer Frontiers in Physics 2021 Training Restricted Boltzmann Machines With a D-Wave Quantum Annealer
Quantum Hamiltonian Descent (QHD) arXiv 2023 Quantum Hamiltonian Descent official
Universal AQC neural-network training Frontiers in Artificial Intelligence 2024 Training neural networks with universal adiabatic quantum computing
QHDOPT INFORMS Journal on Computing 2025 QHDOPT: A Software for Nonlinear Optimization with Quantum Hamiltonian Descent official
Stochastic Quantum Hamiltonian Descent (SQHD) arXiv 2025 Stochastic Quantum Hamiltonian Descent
QIASO AIMS Mathematics 2026 The quantum-inspired adaptive superposition optimization for neural network training

Learning-Rate-Free Optimizers

Learning-rate-free (also called parameter-free or tuning-free) optimizers select their step size automatically during training instead of requiring a manually tuned learning rate. Most methods in this family estimate a quantity such as the distance from the initial point to the solution and set the effective step size from observed gradients, while others wrap an existing base optimizer and tune its global scale factor online. The goal is to match the performance of a well-tuned baseline without a learning-rate search.

Optimizer Venue Paper Code zij
AdGD ICML 2020 Adaptive Gradient Descent without Descent official
ALI-G ICML 2020 Training Neural Networks for and by Interpolation official
AdaBFE arXiv 2022 BFE and AdaBFE: A New Approach in Learning Rate Automation for Stochastic Optimization
D-Adaptation ICML 2023 Learning-Rate-Free Learning by D-Adaptation official DAdaptSGD, DAdaptAdam
DoG ICML 2023 DoG is SGD's Best Friend: A Parameter-Free Dynamic Step Size Schedule official DoG, LDoG
Mechanic NeurIPS 2023 Mechanic: A Learning Rate Tuner official mechanize
Adam++ arXiv 2024 Towards Simple and Provable Parameter-Free Adaptive Gradient Methods
MoMo ICML 2024 MoMo: Momentum Models for Adaptive Learning Rates official Momo, MomoAdam
Prodigy ICML 2024 Prodigy: An Expeditiously Adaptive Parameter-Free Learner official Prodigy
AdamG arXiv 2024 Towards Stability of Parameter-free Optimization community AdamG
TRAC NeurIPS 2024 Fast TRAC: A Parameter-Free Optimizer for Lifelong Reinforcement Learning official TRAC
Accelerated GRAAL arXiv 2025 Nesterov Finds GRAAL: Optimal and Adaptive Gradient Method for Convex Optimization
AutoSGD arXiv 2025 AutoSGD: Automatic Learning Rate Selection for Stochastic Gradient Descent
EAGLE arXiv 2025 eagle: early approximated gradient based learning rate estimator
ScheduleFree+ arXiv 2026 ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models official
AMUSE arXiv 2026 AMUSE: Anytime Muon with Stable Gradient Evaluation
Adaptive Polyak Steps (SF-SGD / SF-Adam) arXiv 2025 Taking the Road Less Scheduled with Adaptive Polyak Steps
GGD (Geodesic Gradient Descent) arXiv 2026 Geodesic Gradient Descent: A Generic and Learning-rate-free Optimizer on Objective Function-induced Manifolds
Accelerated Distance-adaptive Method (DoG-lineage) NeurIPS 2025 Accelerated Distance-adaptive Methods for Hölder Smooth and Convex Optimization
GeN ICLR 2025 Gradient descent with generalized Newton's method official
DoWG NeurIPS 2023 DoWG Unleashed: An Efficient Universal Parameter-Free Gradient Descent Method official
U-DoG COLT 2024 Accelerated Parameter-Free Stochastic Optimization
Sign-SGD via Parameter-Free Optimization ICLR 2026 Sign-SGD via Parameter-Free Optimization
OptEMA arXiv 2026 OptEMA: Adaptive Exponential Moving Average for Stochastic Optimization with Zero-Noise Optimality

Learning Rate Schedulers

zij.core.lr_scheduler vendors the PyTorch core learning rate schedulers under their original class names. The first table lists every vendored class, including the LRScheduler base class, with the published work it derives from where one exists. The second table covers notable schedules from the literature that zij does not yet implement.

In zij

Scheduler Origin
ChainedScheduler
ConstantLR
CosineAnnealingLR Loshchilov & Hutter ICLR 2017 (SGDR)
CosineAnnealingWarmRestarts Loshchilov & Hutter ICLR 2017 (SGDR)
CyclicLR Smith WACV 2017 (cyclical learning rates)
ExponentialLR
LambdaLR
LinearLR
LRScheduler
MultiplicativeLR
MultiStepLR
OneCycleLR Smith & Topin 2019 (super-convergence)
PolynomialLR
ReduceLROnPlateau
SequentialLR
StepLR

Notable schedules elsewhere

Scheduler Venue Paper Code zij
Inverse square root NeurIPS 2017 Attention Is All You Need official
AdaS arXiv 2020 AdaS: Adaptive Scheduling of Stochastic Gradients official
Untuned Warmup AAAI 2021 On the adequacy of untuned warmup for adaptive optimization
AutoDrop UAI 2024 AutoDrop: Training Deep Learning Models with Automatic Learning Rate Drop
Schedule-Free NeurIPS 2024 The Road Less Scheduled official SGDScheduleFree, AdamWScheduleFree, RAdamScheduleFree, ScheduleFreeWrapper
WSD (Warmup-Stable-Decay) COLM 2024 MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies official
GreedyLR arXiv 2025 Dynamic Learning Rate Scheduling based on Loss Changes Leads to Faster Convergence
Refined SF-AdamW NeurIPS 2025 Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training
SF-NorMuon arXiv 2026 Anytime Training with Schedule-Free Spectral Optimization
WSM ICLR 2026 WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training
Power Decay / Warmup-Stable-Decay (WSD) arXiv 2026 Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay
Anytime (Horizon-Free WA schedule) arXiv 2026 Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging

Schedule-Free is not a schedule on top of an optimizer but a replacement for scheduling, achieved through online iterate averaging inside the optimizer; see the learning-rate-free optimizers.

Weight averaging is available separately in zij.core.swa_utils, which provides stochastic weight averaging and exponential moving average utilities (AveragedModel, SWALR, update_bn, and the SWA/EMA averaging functions), following Averaging Weights Leads to Wider Optima and Better Generalization (Izmailov et al., UAI 2018).

How zij compares

Two kinds of project cover this ground: curated awesome-lists, and installable optimizer collections. zij (زِيج) is both.

Capability Awesome-lists Library collections zij
Curated reference of the whole field Yes Yes
Installable, tested implementations Yes Yes
Paper-only methods included Yes Yes
Update rule in standard notation Yes
Per-file provenance (upstream, commit, license) Partial Yes
Dedicated fractional-order coverage Yes
Dedicated quantum / quantum-inspired coverage Yes

Engineering standards

  • The Canon and the code are one project. Every Canon row links the paper and, where it exists, the implementation. Every implementation links back to its source and paper.
  • Provenance is explicit. Vendored files record their upstream repository, pinned commit, and license; THIRD_PARTY_NOTICES.md aggregates the attributions. Sources under GPL, non-commercial, or no license are not vendored and remain listed only.
  • Mathematics is explicit. Each update rule is written in standard notation. Where an official implementation diverges from its own paper, the docstring records what the code computes.
  • Everything is tested. Every registered optimizer has convergence and state-dict round-trip tests.

Contributing

New implementations, Canon entries, and corrections are welcome. See CONTRIBUTING.md. A Canon correction counts as much as a code change.

Acknowledgments

zij (زِيج) builds on the projects it learns from:

  • APRIL-AIGC/Awesome-Optimizer: an awesome-list whose breadth helped inform this project's scope.
  • kozistr/pytorch_optimizer: a comprehensive, maintained PyTorch optimizer collection, and a reference for several vendored implementations.
  • jettify/pytorch-optimizer: an early community optimizer collection and the source of several classic implementations.
  • timm: tested optimizer implementations and packaging conventions.
  • PyTorch: the torch.optim core that zij.core mirrors.
  • The optimizer authors: each method is someone's research. The canonical paper is cited in every Canon row and class docstring, and the original repository is credited per file in THIRD_PARTY_NOTICES.md.

Citation

If you use an optimizer from this library, cite two works: the original paper of the algorithm (linked in its Canon row and docstring), and zij as the software you ran. The paper credits the method; the software credits the implementation.

@software{raja_zij,
  author = {Raja, Muhammad Junaid Ali Asif},
  title  = {zij: A Canon and Library of Deep Learning Optimizers},
  year   = {2026},
  url    = {https://github.com/junaidaliop/zij}
}

Machine-readable metadata is in CITATION.cff.

License

Apache-2.0. Vendored components retain their original licenses; see THIRD_PARTY_NOTICES.md.

Contact

Muhammad Junaid Ali Asif Raja — muhammadjunaidaliasifraja@gmail.com