Fix non-deterministic manifest generation#2369
Open
lzap wants to merge 4 commits into
Open
Conversation
dbab380 to
ebf25a8
Compare
supakeen
reviewed
May 26, 2026
Comment on lines
+92
to
+114
| // Deep copy Flatpaks (this is the critical one for the race condition fix) | ||
| if c.Flatpaks != nil { | ||
| copy.Flatpaks = make([]*struct { | ||
| Registry *struct { | ||
| RemoteName string `yaml:"remote_name,omitempty"` | ||
| URL string `yaml:"url,omitempty"` | ||
| } `yaml:"registry,omitempty"` | ||
| References []string `yaml:"references,omitempty"` | ||
| }, len(c.Flatpaks)) | ||
| for i, fp := range c.Flatpaks { | ||
| if fp != nil { | ||
| fpCopy := *fp | ||
| if fp.Registry != nil { | ||
| regCopy := *fp.Registry | ||
| fpCopy.Registry = ®Copy | ||
| } | ||
| if fp.References != nil { | ||
| fpCopy.References = append([]string(nil), fp.References...) | ||
| } | ||
| copy.Flatpaks[i] = &fpCopy | ||
| } | ||
| } | ||
| } |
Member
There was a problem hiding this comment.
The comment isn't very nice and perhaps we should move the types from inline to public so they don't get duplicated here.
Member
|
Nice. I had indeed found that the flatpak/installerconfig templating was the culprit a few days ago but hadn't had time to address it. Could you update this PR? The title doesn't really clearly reflect the contents anymore. Also left one comment. This also needs to attach updated manifest checksums in case they were wrongly generated initially. Probably amended to the deepcopy commit. |
Go's map iteration is intentionally randomized, causing non-deterministic behavior when iterating over maps during manifest generation. This manifested as different container checksums across runs, even with fixed RNG seeds. Two sources of non-determinism fixed: 1. manifestmock resolver functions (ResolveContainers, ResolveCommits, ResolveFlatpaks, Depsolve): Added manual key sorting before iteration to ensure consistent mock manifest generation. 2. distro/defs conditional config application: Created OrderedMap[V] generic type that sorts keys once during YAML unmarshaling. Replaced all Conditions maps with OrderedMap to ensure deterministic application order during shallow merges. The OrderedMap approach avoids repeated sorting on every method call - keys are sorted once when the YAML is loaded, then iteration order is guaranteed consistent. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The startMessagePrinter goroutine was reading activeWorkers directly while other goroutines were modifying it with atomic operations, causing a data race. Fix by using atomic.LoadInt32() to read the activeWorkers count instead of accessing it directly. This race condition was detected and fixed by adding comprehensive race detector tests for the workerQueue concurrent execution patterns.
The InstallerConfig.ExpandTemplates() method modifies Flatpaks.References
in place, expanding templates like {{.Arch}} and {{.Distro}}. When manifests
for multiple architectures (e.g., x86_64 and aarch64) are generated in parallel,
they share the same InstallerConfig instance loaded from YAML. This causes a
race condition where whichever goroutine runs last overwrites the templated
values, leading to non-deterministic manifests.
The CICD failure showed this clearly - flatpak container refs were randomly
switching between x86_64 and aarch64:
- "name": "runtime/org.fedoraproject.Platform.GL.default/x86_64/f44"
+ "name": "runtime/org.fedoraproject.Platform.GL.default/aarch64/f44"
Solution: Add a global mutex map keyed by ImageTypeYAML pointer to serialize
InstallerConfig() calls per image type. This ensures template expansion happens
sequentially, preventing concurrent modification of the shared config.
The previous mutex (installerCustomizationsMu) was in the wrong location
(pkg/distro/generic/images.go) - it protected installerCustomizations() which
is called AFTER InstallerConfig returns. The new mutex is in the correct
location (pkg/distro/defs/loader.go InstallerConfig method) where template
expansion actually occurs.
Trade-offs:
- Simpler than deep copying: no 63-line DeepCopy() to maintain
- Slower: serializes manifest generation per image type (no parallelism)
- Acceptable: correctness over speed, user explicitly prefers this approach
AMI checksum changes are expected: proper isolation now allows deterministic
template expansion, fixing the race that caused non-deterministic manifests.
Verified deterministic across 5 consecutive runs.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
ebf25a8 to
19f3e14
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Show diff on validate checksum error to immediately see what is wrong. Also added
Then I took stab on the flakyness of manifest generation with Claude. Some code smells were small patches, but then the real meat is concurrent access. Map ordering is a big and ugly change. And deep copy as well.
We might actually give up on concurrency since the future (CLI) will not take advantage of it. My measurements show that gen-manifests take 10 seconds without any concurrency, and then 6-12 concurrent goroutines on my 12C CPU takes 5 seconds. It does not scale up at all. Opinions?
We can maybe have slow (non-concurrent) to be the default and if someone is in a hurry and feel risky, there could be a "fast" option.
FTR the diff is here: https://github.com/osbuild/images/actions/runs/26418482265/job/77767951928?pr=2369