Skip to content

Fix non-deterministic manifest generation#2369

Open
lzap wants to merge 4 commits into
osbuild:mainfrom
lzap:checksum-validator-diff1
Open

Fix non-deterministic manifest generation#2369
lzap wants to merge 4 commits into
osbuild:mainfrom
lzap:checksum-validator-diff1

Conversation

@lzap
Copy link
Copy Markdown
Contributor

@lzap lzap commented May 25, 2026

Show diff on validate checksum error to immediately see what is wrong. Also added

Then I took stab on the flakyness of manifest generation with Claude. Some code smells were small patches, but then the real meat is concurrent access. Map ordering is a big and ugly change. And deep copy as well.

We might actually give up on concurrency since the future (CLI) will not take advantage of it. My measurements show that gen-manifests take 10 seconds without any concurrency, and then 6-12 concurrent goroutines on my 12C CPU takes 5 seconds. It does not scale up at all. Opinions?

We can maybe have slow (non-concurrent) to be the default and if someone is in a hurry and feel risky, there could be a "fast" option.


FTR the diff is here: https://github.com/osbuild/images/actions/runs/26418482265/job/77767951928?pr=2369

@lzap lzap requested a review from a team as a code owner May 25, 2026 19:52
@lzap lzap requested review from avitova, brlane-rht and thozza May 25, 2026 19:52
@lzap lzap force-pushed the checksum-validator-diff1 branch from dbab380 to ebf25a8 Compare May 25, 2026 21:36
Comment thread pkg/distro/installer_config.go Outdated
Comment on lines +92 to +114
// Deep copy Flatpaks (this is the critical one for the race condition fix)
if c.Flatpaks != nil {
copy.Flatpaks = make([]*struct {
Registry *struct {
RemoteName string `yaml:"remote_name,omitempty"`
URL string `yaml:"url,omitempty"`
} `yaml:"registry,omitempty"`
References []string `yaml:"references,omitempty"`
}, len(c.Flatpaks))
for i, fp := range c.Flatpaks {
if fp != nil {
fpCopy := *fp
if fp.Registry != nil {
regCopy := *fp.Registry
fpCopy.Registry = &regCopy
}
if fp.References != nil {
fpCopy.References = append([]string(nil), fp.References...)
}
copy.Flatpaks[i] = &fpCopy
}
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment isn't very nice and perhaps we should move the types from inline to public so they don't get duplicated here.

@supakeen
Copy link
Copy Markdown
Member

supakeen commented May 26, 2026

Nice. I had indeed found that the flatpak/installerconfig templating was the culprit a few days ago but hadn't had time to address it.

Could you update this PR? The title doesn't really clearly reflect the contents anymore. Also left one comment.

This also needs to attach updated manifest checksums in case they were wrongly generated initially. Probably amended to the deepcopy commit.

lzap and others added 3 commits May 26, 2026 07:58
Go's map iteration is intentionally randomized, causing non-deterministic
behavior when iterating over maps during manifest generation. This manifested
as different container checksums across runs, even with fixed RNG seeds.

Two sources of non-determinism fixed:

1. manifestmock resolver functions (ResolveContainers, ResolveCommits,
   ResolveFlatpaks, Depsolve): Added manual key sorting before iteration
   to ensure consistent mock manifest generation.

2. distro/defs conditional config application: Created OrderedMap[V] generic
   type that sorts keys once during YAML unmarshaling. Replaced all
   Conditions maps with OrderedMap to ensure deterministic application
   order during shallow merges.

The OrderedMap approach avoids repeated sorting on every method call -
keys are sorted once when the YAML is loaded, then iteration order is
guaranteed consistent.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The startMessagePrinter goroutine was reading activeWorkers directly
while other goroutines were modifying it with atomic operations,
causing a data race.

Fix by using atomic.LoadInt32() to read the activeWorkers count
instead of accessing it directly.

This race condition was detected and fixed by adding comprehensive
race detector tests for the workerQueue concurrent execution patterns.
The InstallerConfig.ExpandTemplates() method modifies Flatpaks.References
in place, expanding templates like {{.Arch}} and {{.Distro}}. When manifests
for multiple architectures (e.g., x86_64 and aarch64) are generated in parallel,
they share the same InstallerConfig instance loaded from YAML. This causes a
race condition where whichever goroutine runs last overwrites the templated
values, leading to non-deterministic manifests.

The CICD failure showed this clearly - flatpak container refs were randomly
switching between x86_64 and aarch64:
  -  "name": "runtime/org.fedoraproject.Platform.GL.default/x86_64/f44"
  +  "name": "runtime/org.fedoraproject.Platform.GL.default/aarch64/f44"

Solution: Add a global mutex map keyed by ImageTypeYAML pointer to serialize
InstallerConfig() calls per image type. This ensures template expansion happens
sequentially, preventing concurrent modification of the shared config.

The previous mutex (installerCustomizationsMu) was in the wrong location
(pkg/distro/generic/images.go) - it protected installerCustomizations() which
is called AFTER InstallerConfig returns. The new mutex is in the correct
location (pkg/distro/defs/loader.go InstallerConfig method) where template
expansion actually occurs.

Trade-offs:
- Simpler than deep copying: no 63-line DeepCopy() to maintain
- Slower: serializes manifest generation per image type (no parallelism)
- Acceptable: correctness over speed, user explicitly prefers this approach

AMI checksum changes are expected: proper isolation now allows deterministic
template expansion, fixing the race that caused non-deterministic manifests.

Verified deterministic across 5 consecutive runs.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@lzap lzap force-pushed the checksum-validator-diff1 branch from ebf25a8 to 19f3e14 Compare May 26, 2026 06:24
@lzap lzap changed the title cicd: show diff on validate checksum error Fix non-deterministic manifest generation May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants