Skip to content

fix: poll deviceconnect websocket instead of fixed sleeps#2881

Open
pasinskim wants to merge 14 commits into
masterfrom
docker-fix
Open

fix: poll deviceconnect websocket instead of fixed sleeps#2881
pasinskim wants to merge 14 commits into
masterfrom
docker-fix

Conversation

@pasinskim
Copy link
Copy Markdown
Member

The docker-client fixture is function-scoped (since 59f3afc), so test_in_poor_network_environment starts each run with a freshly bootstrapped client and races the initial deviceconnect session handshake. The fixed 30s heal window after the iptables drop has the same problem. Replace both with a bounded retry loop around get_websocket() so the test waits for actual readiness.

Related to QA-1527 and QA-1563.

@mender-test-bot
Copy link
Copy Markdown
Contributor

@pasinskim, start a full integration test pipeline with:

  • mentioning me and start integration pipeline

my commands and options

You can prevent me from automatically starting CI pipelines:

  • if your pull request title starts with "[NoCI] ..."

You can trigger a client pipeline on multiple prs with:

  • mentioning me and start client pipeline --pr mender/127 --pr mender-connect/255

You can trigger a client pipeline for a specific Mender Client release with:

  • mentioning me and start client pipeline --release 6.0.x (can be given multiple times)
  • by default, a pipeline is triggered for each supported release the component is a part of

You can trigger GitHub->GitLab branch sync with:

  • mentioning me and sync

You can print PR statistics for a repository with:

  • mentioning me and print fast pr stats (Team stats only)
  • mentioning me and print full pr stats (Detailed report)
  • options: --repo <repo>, --team <name>, --all-repos, --exclude-drafts, --exclude-user <user>
  • mentioning me and print full pr stats --repo mender --all-repos --exclude-drafts

You can cherry pick to a given branch or branches with:

  • mentioning me and:
 cherry-pick to:
 * 1.0.x
 * 2.0.x

@pasinskim
Copy link
Copy Markdown
Member Author

@mender-test-bot start integration pipeline

@mender-test-bot
Copy link
Copy Markdown
Contributor

Hello 😺 I created a pipeline for you here: Pipeline-2455059455

Build Configuration Matrix

Key Value
INTEGRATION_REV pull/2881/head
RUN_TESTS_FULL_INTEGRATION true

@pasinskim pasinskim force-pushed the docker-fix branch 2 times, most recently from 8924c9c to 23fae41 Compare April 15, 2026 14:01
Copy link
Copy Markdown
Contributor

@merlin-northern merlin-northern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with that change, two comments:

  • there was a reason I mentioned the HEALTHCHECK on the Mender Docker Client: let's try to consider introducing it with the dbus cli to make the container healthy when the mender-auth responds over dbus (as was the conclusion of the discussion).
  • the pipeline failed: https://gitlab.com/Northern.tech/Mender/integration/-/jobs/13944335034 why was it not reported?

Comment on lines +215 to +229
def open_websocket_with_retry(timeout_s=180, interval_s=5):
# The docker-client fixture is function-scoped, so each test starts
# with a freshly-bootstrapped client whose deviceconnect session may
# not yet be established. Poll get_websocket() until it succeeds.
deadline = time.monotonic() + timeout_s
last_err = None
while time.monotonic() < deadline:
try:
return docker_env.devconnect.get_websocket()
except Exception as e:
last_err = e
time.sleep(interval_s)
raise AssertionError(
f"deviceconnect websocket never became available: {last_err}"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet. usually we used a lib for that. but it is ok.

Copy link
Copy Markdown
Contributor

@elkoniu elkoniu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

while time.monotonic() < deadline:
try:
return docker_env.devconnect.get_websocket()
except Exception as e:
Copy link
Copy Markdown
Contributor

@p-targowicz p-targowicz Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but one thing --> do we want to catch all exceptions ? perhaps if there's some specific error idk we want to kill it instantly ? if not that's fine 🥇

@pasinskim
Copy link
Copy Markdown
Member Author

@mender-test-bot start integration pipeline

@mender-test-bot
Copy link
Copy Markdown
Contributor

Hello 😺 I created a pipeline for you here: Pipeline-2457447851

Build Configuration Matrix

Key Value
INTEGRATION_REV pull/2881/head
RUN_TESTS_FULL_INTEGRATION true

Comment thread tests/tests/test_filetransfer.py Outdated
Comment on lines +110 to +114
mender_device.run(
"kill -TERM `pidof %s` 2>/dev/null; "
"while pidof %s >/dev/null 2>&1; do sleep 0.1; done"
% (connect_service_name, connect_service_name)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this potentially can hang forever, I would introduce some way to bail out.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlin-northern thanks. That makes sense. Fixed!

Copy link
Copy Markdown
Contributor

@merlin-northern merlin-northern Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, and there is one more thing: touch /stop-respawn, remember?

Copy link
Copy Markdown
Member Author

@pasinskim pasinskim Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Actually how about

# Kill mender-connect and let the entrypoint's supervise loop restart
 # it automatically with the updated config.
 mender_device.run("kill `pidof %s` 2>/dev/null" % connect_service_name)

@pasinskim
Copy link
Copy Markdown
Member Author

@mender-test-bot start integration pipeline but please don't fail it this time!

@mender-test-bot
Copy link
Copy Markdown
Contributor

Hello 😺 I created a pipeline for you here: Pipeline-2458219452

Build Configuration Matrix

Key Value
INTEGRATION_REV pull/2881/head
RUN_TESTS_FULL_INTEGRATION true

@pasinskim
Copy link
Copy Markdown
Member Author

@mender-test-bot start integration pipeline but please don't fail it this time!

@mender-test-bot
Copy link
Copy Markdown
Contributor

Hello 😺 I created a pipeline for you here: Pipeline-2458246630

Build Configuration Matrix

Key Value
INTEGRATION_REV pull/2881/head
RUN_TESTS_FULL_INTEGRATION true

@pasinskim
Copy link
Copy Markdown
Member Author

@mender-test-bot start integration pipeline

@mender-test-bot
Copy link
Copy Markdown
Contributor

Hello 😺 I created a pipeline for you here: Pipeline-2458414545

Build Configuration Matrix

Key Value
INTEGRATION_REV pull/2881/head
RUN_TESTS_FULL_INTEGRATION true

@pasinskim
Copy link
Copy Markdown
Member Author

@mender-test-bot start integration pipeline

@mender-test-bot
Copy link
Copy Markdown
Contributor

Hello 😺 I created a pipeline for you here: Pipeline-2484811465

Build Configuration Matrix

Key Value
INTEGRATION_REV pull/2881/head
RUN_TESTS_FULL_INTEGRATION true

@pasinskim pasinskim force-pushed the docker-fix branch 3 times, most recently from 1052bf5 to b35e9c8 Compare May 8, 2026 10:48
The docker-client fixture is function-scoped (since 59f3afc), so
test_in_poor_network_environment starts each run with a freshly
bootstrapped client and races the initial deviceconnect session
handshake. The fixed 30s heal window after the iptables drop has
the same problem. Replace both with a bounded retry loop around
get_websocket() so the test waits for actual readiness.

Related to QA-1527 and QA-1563.

Changelog: Add retry loop around get_websocket() so the test
waits for actual readiness.

Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
pasinskim added 10 commits May 8, 2026 15:35
Remove the non-existent PreserveGroup config key. Mender-connect's
PreserveOwner already covers both owner and group via os.Chown(uid, gid).
The unknown JSON field was silently ignored.

Also move the status code assertion before the file check so that
upload failures (e.g. 408 timeout) are diagnosed directly instead of
producing a confusing "assert '' == '101 102'" error.

Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
…r_network_environment

iptables is now pre-installed in the mender-client-docker-addons image
(mender-client-subcomponents). Remove the runtime apt-get
update/install which was the primary cause of flaky failures like exit
code 100 whenever package repos were unreachable from the CI container.

Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
These 6 download-error limit tests (x2 for OS/Enterprise = 12 total)
have been xfailed since April 2021. They spin up the full Docker stack
each time only to hit the expected xfail, burning ~24 minutes of CI
time per run. Mark them as skip until MEN-4659 is actually fixed.

Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
The _run() SSH retry loop had three problems:
- Slept before the first attempt (1s wasted if already connected)
- Exponential backoff with no cap (1,2,4,8,16,32,64,128s...)
- Extra time.sleep(60) on ConnectionError on top of the backoff

After 7 failures the gap between retries was 128s; with the 60s
penalty a single transient SSH drop could stall for minutes.

Fix: try first then sleep, cap backoff at 30s, remove the redundant
60s ConnectionError penalty. Retry sequence is now 1,2,4,8,16,30,30...

Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
get_devices_status() slept 5s before the first API call and used
unbounded linear backoff (5,10,15,20,25...). If devices were already
registered, every caller wasted at least 5s for nothing.

check_expected_status() also slept before its first check.

Fix: check first then sleep on miss, use exponential backoff with
cap (2,4,8,15,15...) instead of linear. Saves ~5s per device
bootstrap across ~7 accept_devices calls in the suite.

Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
Four issues in deployments.py:

1. check_expected_statistics: slept 200ms before the first check on
   every iteration. With 59 call sites this adds up.

2. check_not_in_status: had a for/else logic bug where 'continue'
   targeted the inner for-loop instead of the outer while-loop,
   making the function silently return success in some edge cases.
   Replaced with straightforward any() check.

3. abort() and abort_finished_deployment(): both had a blind sleep(5)
   between the PUT response and the assertion on that same response's
   status code. This was completely pointless since r.status_code is   already set. Saves 5s per call (4 call sites = 20s).

Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
_check_log_for_message: used a fixed 10s polling interval with
sleep-before-check. If the log message appeared within 1s, the test
still waited 10s. Replace with check-first and exponential backoff
(2,4,8,10,10...) saving ~8s per call across 7 call sites.

requests_retry: only retried on 500/502 but not 503 (Service
Unavailable) or 504 (Gateway Timeout), which are common during
container startup when the gateway is up but backends are still
loading. Adding these prevents false test failures during setup.

Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
Convert all 4 filetransfer test classes from function-scoped to
class-scoped fixtures so each class spins up the ~21-container Docker
Compose stack once instead of per-test. This saves ~80% of the wall
time for the filetransfer suite (verified: 6 tests in 97s vs ~580s).

Follows the existing _impl + class_persistent_ pattern already used
for QEMU client fixtures.

Signed-off-by: Maciej Pasinski <maciej.pasinski@northern.tech>
Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
These tests have been xfailed since 2021 and were converted to
@pytest.mark.skip to save CI time. They test deviceconnect 403 error
handling that was never fixed (MEN-4659). Remove the tests and the
assert_forbidden helper they relied on.

Signed-off-by: Maciej Pasinski <maciej.pasinski@northern.tech>
Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
Remove test_db_migration.py and its v1 legacy client fixtures from
common_setup.py. These tests verified upgrading from Mender v1.7.0
(released 2017) to the current version. This migration path that is no
longer supported. The fixtures were only used by this test file.

Signed-off-by: Maciej Pasinski <maciej.pasinski@northern.tech>
Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
@pasinskim
Copy link
Copy Markdown
Member Author

@mender-test-bot start integration pipeline

@mender-test-bot
Copy link
Copy Markdown
Contributor

Hello 😺 I created a pipeline for you here: Pipeline-2510897543

Build Configuration Matrix

Key Value
INTEGRATION_REV pull/2881/head
RUN_TESTS_FULL_INTEGRATION true

pasinskim added 3 commits May 8, 2026 20:15
This reverts commit 9e0b67f58423e74b73774a5d27c415835ac63062.

Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
The iptables binary is not pre-installed in the Docker client image,
so removing the apt-get install broke test_in_poor_network_environment
on both OS and Enterprise.

Restore the install until iptables is pre-baked into the base image.

Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
@mender-test-bot
Copy link
Copy Markdown
Contributor

mender-test-bot commented May 8, 2026

Merging these commits will result in the following changelog entries:

Changelogs

integration (docker-fix)

New changes in integration since master:

  • Add retry loop around get_websocket() so the test
    waits for actual readiness.
    (QA-1527, QA-1563)

@pasinskim
Copy link
Copy Markdown
Member Author

@mender-test-bot start integration pipeline

@mender-test-bot
Copy link
Copy Markdown
Contributor

Hello 😺 I created a pipeline for you here: Pipeline-2511095238

Build Configuration Matrix

Key Value
INTEGRATION_REV pull/2881/head
RUN_TESTS_FULL_INTEGRATION true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants