Skip to content

Conversation

@jyork03
Copy link
Contributor

@jyork03 jyork03 commented Nov 1, 2025

Summary:

  • Step-like arguments are converted by grad_accumulation_steps for consistency with configured training steps.
    • e.g. A warmup of 100, with gradient accumulation of 32 steps will now last for 100 steps instead of 100*32 steps. This should be more intuitive for users.
  • Missing scheduler name now falls back to constant when warmup > 0, making warmup-only configs simpler.
  • If both name and warmup are missing, we raise KeyError("name") to surface misconfigurations early.
  • Centralized learning rate validation and safer handling of variable argument lengths across schedulers.

Grad accumulation normalization:

  • warmup: steps
  • linear_schedule: steps
  • cosine_decay: decay_steps
  • step_decay: step_size
  • Conversion: effective_steps = ceil(config_steps / max(1, grad_accumulation_steps)), with a minimum of 1.

Tests:

  • Warmup-only config now validated in tests.
  • Added tests for linear/cosine/step variants and grad-accum conversions; linear schedule comparisons use MLX reference.

Behavioral notes:

  • No change for valid, named schedulers.
  • Schedules without arguments fall back to sane defaults
  • Unknown names still raise ValueError as before.

Minor improvements:

  • Added muon to --optimizer help text
  • Changed --learning-rate help text from "Adam learning rate." to "Optimizer learning rate." since it applies to more than just Adam
  • Improved lr_schedule configuration discoverability and documentation

From:

"lr_schedule": None

To:

# name options: "cosine_decay", "linear_schedule", "exponential_decay", or "step_decay"
# arguments match positional values for the corresponding mlx scheduler:
# See: https://ml-explore.github.io/mlx/build/html/python/optimizers/schedulers.html
"lr_schedule": {"name": None, "arguments": [], "warmup": 0, "warmup_init": 0.0},

Added reference to config and docs where the scheduler is built:

# Initialize the selected optimizer
  lr = args.learning_rate
  if args.lr_schedule.get("name", None) or args.lr_schedule.get("warmup", 0) > 0:
      # See CONFIG_DEFAULTS["lr_schedule"] for the format
      # and https://ml-explore.github.io/mlx/build/html/python/optimizers/schedulers.html
      # for the available schedulers
      lr = build_schedule(
          args.lr_schedule,
          args.learning_rate,
          args.iters,
          args.grad_accumulation_steps,
      )

…ests

* Allow warmup-only configs by defaulting missing name to "constant" when warmup > 0

* Raise KeyError when both name and warmup are missing; keep ValueError for unknown names

* Centralize learning-rate presence check; robust arg handling for
linear/cosine/exponential/step schedulers

* Convert step-like args by grad_accumulation_steps consistently

* Update tests: warmup-only behavior validated; add schedule argument and grad-accum conversion tests; align linear schedule checks with MLX reference
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant