Skip to content

Fix CI: apt update on runner, file lock race condition#341

Open
mikekryjak wants to merge 6 commits intomasterfrom
ci-apt-update
Open

Fix CI: apt update on runner, file lock race condition#341
mikekryjak wants to merge 6 commits intomasterfrom
ci-apt-update

Conversation

@mikekryjak
Copy link
Copy Markdown
Collaborator

@mikekryjak mikekryjak commented Mar 17, 2026

The first error is:

Ign:4 https://security.ubuntu.com/ubuntu noble-updates/main amd64 libcurl4-openssl-dev amd64 8.5.0-2ubuntu10.7
Err:4 mirror+file:/etc/apt/apt-mirrors.txt noble-updates/main amd64 libcurl4-openssl-dev amd64 8.5.0-2ubuntu10.7
  404  Not Found [IP: 52.161.185.214 80]
E: Failed to fetch mirror+file:/etc/apt/apt-mirrors.txt/pool/main/c/curl/libcurl4-openssl-dev_8.5.0-2ubuntu10.7_amd64.deb  404  Not Found [IP: 52.161.185.214 80]
Fetched 4720 kB in 2s (2450 kB/s)
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

I tried to follow the error advice and add sudo apt-get update, which seems to resolve the same issue in xHermes. This PR adds it to xBOUT CI.

The second issue is where the tests hang on test_boutdataset.py::TestSaveRestart::test_to_restart. I reproduced the CI environment locally and reproduced the issue. Thanks to @dschwoerer's stack trace debug and the help of an LLM, I was then able to narrow this down to an unsafely loaded dataset in that test. This resolves the issue on my end at least....

mikekryjak and others added 3 commits March 17, 2026 18:39
Was erroring out on installing libcurl4-openssl-dev before.
@mikekryjak mikekryjak changed the title Fix CI: add sudo apt-get update to actions Fix CI: apt update on runner, file lock race condition Apr 13, 2026
dschwoerer and others added 3 commits April 13, 2026 14:37
Cause of the next hang locally. This is a different file now than last time, which means it may be that many tests need to be fixed. For now, I am pushing this to see if this is enough.
@mikekryjak
Copy link
Copy Markdown
Collaborator Author

The tests were still failing locally, but intermittently. I made a bash script which loops the tests until they fail, and then prints a stack trace. I found another cause in another test inside test_boutdataset, where the dataset is opened and then quickly saved afterward. My LLM reckons that the lazy opening still kept a lock on the data files and clashed with the save operation. I made the fix here: 689613b

I also cherry picked @dschwoerer's timeout and stack trace from 62ab549.

While the CI continues, I will keep looping the tests locally to see if I can find more. If I do, I will make all file loads safe in this test file.

There is still the mystery of why it fails every time on CI and intermittently locally. My LLM thinks it could be because the runners are slow which could make timing and file locking issues worse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants