Refactor lm_head losses #425

oleksost · 2025-12-16T01:05:14Z

✨ Description

Refactors loss definition and logging in thehead.py:

makes logging more explicit
implements forward KL
only log unscaled losses
defines a separate loss configs for each loss in the lm+_head
allows logging of losses without training on them

TODO:

Update tests

Config example:

    head:
      lr_scale: 0.0
      losses:
        lm_loss:
          type: cross_entropy
          factor: 1.0
        reverse_kl:
          type: reverse_kl_distillation
          factor: 1.0
        forward_kl:
          type: forward_kl_distillation
          factor: 0.0 # track without logging

This will train using lm_loss which is a cross_entropy_lm_loss as well as reverse_kl, both weighted with 1. Will also log forward_kl.

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Change A
Change B

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

jlamypoirier

Not sure I understand this PR. If we only have layer distillation, doesn't that mean we don't train the model head at all?

fast_llm/layers/language_model/head.py

fast_llm/layers/language_model/config.py

fast_llm/layers/language_model/head.py

jlamypoirier

Thanks for looking into this, it's badly needed.

fast_llm/functional/cross_entropy.py

jlamypoirier · 2025-12-22T18:17:30Z

fast_llm/layers/language_model/config.py

-                    self.language_model_loss_factor = 0.0
+        for loss_config in self.losses.values():
+            if "dist" in loss_config.type:
+                assert self.distillation_model is not None, "Distillation loss requires a distillation model."


Shouldn't the distillation model go with the loss?

hm, this raises error when there is no distillation mode, this is correct, no?

The distillation_model parameter is not needed in the language model head itself, only the losses use it, so it should be moved with the losses that use it along with these checks.

fast_llm/layers/language_model/head.py

fast_llm/layers/language_model/lm_head_losses.py

jlamypoirier · 2025-12-22T18:44:46Z

fast_llm/layers/language_model/config.py

        desc="Configuration for the LM output layer (weight). Ignored for tied embeddings",
        hint=FieldHint.architecture,
    )
-    cross_entropy_implementation: CrossEntropyImpl = Field(


These removals are likely to cause backward compatibility issues when loading existing models. Please make sure it doesn't disrupt ongoing work, and if needed add backward compatibility in _validate

I tested training with checkpoints created on the main branch in both distributed and apriel2 format. Training starts with no issues.

…ow/Fast-LLM into train_only_layer_losses

Removed the targets, class, moved tragets processing to losses, made loss masks more explicit

jlamypoirier · 2026-01-08T00:50:17Z

fast_llm/layers/language_model/config.py

-                else:
-                    self.language_model_loss_factor = 0.0
+            if not self.losses:
+                if "losses" not in self._explicit_fields:


Not sure it's needed, it doesn't make sense to have a head without loss.

Can simplify to
self.losses = {"lm_loss": CrossEntropyLMLossConfig()}

jlamypoirier · 2026-01-08T00:57:06Z

fast_llm/layers/language_model/config.py

-                    self.language_model_loss_factor = 0.0
+        for loss_config in self.losses.values():
+            if "dist" in loss_config.type:
+                assert self.distillation_model is not None, "Distillation loss requires a distillation model."


The distillation_model parameter is not needed in the language model head itself, only the losses use it, so it should be moved with the losses that use it along with these checks.

jlamypoirier · 2026-01-08T01:03:15Z

fast_llm/layers/language_model/head.py

            sequence_parallel=self._sequence_parallel and self._vocab_parallel,
        )

+        # TODO: also move to lm_head_losses?


Why not making it into an independent loss like the others?

addressed.

Actually it looks like z_loss is useless here, since it is implemented using gradient injection back backward on logits is never called, and if it is, then it does not backprop into the model due to detach() in the _logits_loss_forward_backward call.

jlamypoirier · 2026-01-08T01:14:41Z

fast_llm/layers/language_model/kwargs.py

@@ -0,0 +1,23 @@
+from fast_llm.layers.block.config import BlockKwargs


Not sure about moving here, the convention is to leave in config.py

easier to have it here to avoid circular imports

jlamypoirier · 2026-01-08T01:19:59Z

fast_llm/layers/language_model/lm_head_losses.py

@@ -0,0 +1,344 @@
+import abc


By convention configs are expected to go in a file name config.py. I recommend moving this back to the config file. (Oother option would be to create a loss subdirectory`, but not really justified at this stage)

jlamypoirier · 2026-01-08T01:47:33Z

fast_llm/functional/cross_entropy.py

    loss = per_sample_loss.mean()
    if target_format != TargetFormat.labels and group is not None:
        all_reduce(loss, op=ReduceOp.AVG, group=group)
+    if return_target_entropy and target_format == TargetFormat.logits:


Incorrect for other target formats?

should be fine for other formats as well

jlamypoirier · 2026-01-08T02:01:46Z

fast_llm/functional/cross_entropy.py

        all_reduce(loss, op=ReduceOp.AVG, group=group)
+    if return_target_entropy and target_format == TargetFormat.logits:
+        # Compute teacher entropy
+        teacher_log_prob = torch.log(target + 1e-20)


For TargetFormat.logits we should be using log_softmax(logits) which is numerically stable. It's simply target_logits - log(sum_exp_target_logits), and we already computed sum_exp_target_logits in _fused_softmax_base

jlamypoirier · 2026-01-08T02:06:28Z

tests/utils/model_configs.py

-                "head": {"output_weight": init_1},
+                "head": {
+                    "output_weight": init_1,
+                    "losses": {


We can drop this default value, it will make updating easier.

jlamypoirier · 2026-01-08T02:06:43Z

tests/test_config.py

    Assert.eq(len(rank_breakdowns), world_size)
+
+
+if __name__ == "__main__":


Please remove

jlamypoirier · 2026-01-08T02:07:50Z

tests/test_config.py

                },
                "num_blocks": 12,
            },
+            "head": {"losses": {"lm_loss": {"type": "cross_entropy", "weight": 1.0}}},


"weight" should not be explicit (will go away if removed from the default in validation)

train with only layer distillation losses

c335f6e

oleksost requested a review from jlamypoirier December 16, 2025 01:05

oleksost marked this pull request as draft December 16, 2025 13:31

unscaled loss llogging + training with distillation loss factor = 0

e06a4b2

oleksost marked this pull request as ready for review December 16, 2025 14:20

jlamypoirier reviewed Dec 16, 2025

View reviewed changes

fast_llm/layers/language_model/head.py Outdated Show resolved Hide resolved

fast_llm/layers/language_model/head.py Outdated Show resolved Hide resolved

oleksost requested a review from jlamypoirier December 17, 2025 13:45

oleksost added 4 commits December 17, 2025 21:09

make logging more explicit

179ae25

Merge remote-tracking branch 'origin/main' into train_only_layer_losses

af456f0

clean + tests

9968aac

nvm

945c5a7

oleksost changed the title ~~Train with only layer distillation losses~~ Train with only layer distillation losses + explicit logging Dec 17, 2025

jlamypoirier reviewed Dec 19, 2025

View reviewed changes

fast_llm/layers/language_model/head.py Outdated Show resolved Hide resolved

jlamypoirier reviewed Dec 19, 2025

View reviewed changes

fast_llm/layers/language_model/head.py Outdated Show resolved Hide resolved

oleksost added 4 commits December 19, 2025 22:11

forward KL

4b6e3d7

test forward kl

c5fefa0

wip: report unscaled + kl loss

4119596

loss config

b55a0a4

oleksost changed the title ~~Train with only layer distillation losses + explicit logging~~ Refactor lm head losses Dec 22, 2025

oleksost changed the title ~~Refactor lm head losses~~ Refactor lm_head losses Dec 22, 2025

oleksost requested a review from jlamypoirier December 22, 2025 13:47

oleksost added 6 commits December 22, 2025 14:24

wip

097baeb

tests

d773d98

Merge remote-tracking branch 'origin/main' into train_only_layer_losses

35400c1

test

282925c

tests

0f73ea2

Merge branch 'main' into train_only_layer_losses

04a0193

jlamypoirier reviewed Dec 22, 2025

View reviewed changes

wip

fa85c41

oleksost added 8 commits December 22, 2025 22:30

Merge branch 'train_only_layer_losses' of https://github.com/ServiceN…

feb416e

…ow/Fast-LLM into train_only_layer_losses

wip

31cfb84

no grad if factor 0

24fe67b

Merge remote-tracking branch 'origin/main' into train_only_layer_losses

00f6118

Merge branch 'main' into train_only_layer_losses

0cadf98

addressed comments

0e562e9

Merge branch 'train_only_layer_losses' of https://github.com/ServiceN…

2a474e2

…ow/Fast-LLM into train_only_layer_losses

addressed comments

52c1c11

oleksost requested a review from jlamypoirier December 23, 2025 17:45

oleksost added 4 commits December 30, 2025 20:13

Removed Targets class

406d0a2

Removed the targets, class, moved tragets processing to losses, made loss masks more explicit

fixes

f25380a

imports

8adb7dd

polish naming

1ce641d

jlamypoirier reviewed Jan 8, 2026

View reviewed changes

oleksost added 3 commits January 8, 2026 18:44

addresseing comments

95f14af

explicit z_loss grads

5ad4c0c

removed z_loss as aux loss

0a66e14

		@@ -0,0 +1,23 @@
		from fast_llm.layers.block.config import BlockKwargs

		Assert.eq(len(rank_breakdowns), world_size)


		if __name__ == "__main__":

Refactor lm_head losses #425

Are you sure you want to change the base?

Refactor lm_head losses #425

Conversation

oleksost commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact

📊 Performance Impact Details

🗒️ Additional Notes

Uh oh!

jlamypoirier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jlamypoirier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

oleksost commented Dec 16, 2025 •

edited

Loading