fix: poll deviceconnect websocket instead of fixed sleeps#2881
fix: poll deviceconnect websocket instead of fixed sleeps#2881pasinskim wants to merge 14 commits into
Conversation
|
@pasinskim, start a full integration test pipeline with:
my commands and optionsYou can prevent me from automatically starting CI pipelines:
You can trigger a client pipeline on multiple prs with:
You can trigger a client pipeline for a specific Mender Client release with:
You can trigger GitHub->GitLab branch sync with:
You can print PR statistics for a repository with:
You can cherry pick to a given branch or branches with:
|
|
@mender-test-bot start integration pipeline |
|
Hello 😺 I created a pipeline for you here: Pipeline-2455059455 Build Configuration Matrix
|
8924c9c to
23fae41
Compare
There was a problem hiding this comment.
I am ok with that change, two comments:
- there was a reason I mentioned the HEALTHCHECK on the Mender Docker Client: let's try to consider introducing it with the dbus cli to make the container healthy when the mender-auth responds over dbus (as was the conclusion of the discussion).
- the pipeline failed: https://gitlab.com/Northern.tech/Mender/integration/-/jobs/13944335034 why was it not reported?
| def open_websocket_with_retry(timeout_s=180, interval_s=5): | ||
| # The docker-client fixture is function-scoped, so each test starts | ||
| # with a freshly-bootstrapped client whose deviceconnect session may | ||
| # not yet be established. Poll get_websocket() until it succeeds. | ||
| deadline = time.monotonic() + timeout_s | ||
| last_err = None | ||
| while time.monotonic() < deadline: | ||
| try: | ||
| return docker_env.devconnect.get_websocket() | ||
| except Exception as e: | ||
| last_err = e | ||
| time.sleep(interval_s) | ||
| raise AssertionError( | ||
| f"deviceconnect websocket never became available: {last_err}" | ||
| ) |
There was a problem hiding this comment.
sweet. usually we used a lib for that. but it is ok.
| while time.monotonic() < deadline: | ||
| try: | ||
| return docker_env.devconnect.get_websocket() | ||
| except Exception as e: |
There was a problem hiding this comment.
LGTM but one thing --> do we want to catch all exceptions ? perhaps if there's some specific error idk we want to kill it instantly ? if not that's fine 🥇
|
@mender-test-bot start integration pipeline |
|
Hello 😺 I created a pipeline for you here: Pipeline-2457447851 Build Configuration Matrix
|
| mender_device.run( | ||
| "kill -TERM `pidof %s` 2>/dev/null; " | ||
| "while pidof %s >/dev/null 2>&1; do sleep 0.1; done" | ||
| % (connect_service_name, connect_service_name) | ||
| ) |
There was a problem hiding this comment.
this potentially can hang forever, I would introduce some way to bail out.
There was a problem hiding this comment.
@merlin-northern thanks. That makes sense. Fixed!
There was a problem hiding this comment.
oh, and there is one more thing: touch /stop-respawn, remember?
There was a problem hiding this comment.
Yup. Actually how about
# Kill mender-connect and let the entrypoint's supervise loop restart
# it automatically with the updated config.
mender_device.run("kill `pidof %s` 2>/dev/null" % connect_service_name)
|
@mender-test-bot start integration pipeline but please don't fail it this time! |
|
Hello 😺 I created a pipeline for you here: Pipeline-2458219452 Build Configuration Matrix
|
|
@mender-test-bot start integration pipeline but please don't fail it this time! |
|
Hello 😺 I created a pipeline for you here: Pipeline-2458246630 Build Configuration Matrix
|
|
@mender-test-bot start integration pipeline |
|
Hello 😺 I created a pipeline for you here: Pipeline-2458414545 Build Configuration Matrix
|
|
@mender-test-bot start integration pipeline |
|
Hello 😺 I created a pipeline for you here: Pipeline-2484811465 Build Configuration Matrix
|
1052bf5 to
b35e9c8
Compare
The docker-client fixture is function-scoped (since 59f3afc), so test_in_poor_network_environment starts each run with a freshly bootstrapped client and races the initial deviceconnect session handshake. The fixed 30s heal window after the iptables drop has the same problem. Replace both with a bounded retry loop around get_websocket() so the test waits for actual readiness. Related to QA-1527 and QA-1563. Changelog: Add retry loop around get_websocket() so the test waits for actual readiness. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
Remove the non-existent PreserveGroup config key. Mender-connect's PreserveOwner already covers both owner and group via os.Chown(uid, gid). The unknown JSON field was silently ignored. Also move the status code assertion before the file check so that upload failures (e.g. 408 timeout) are diagnosed directly instead of producing a confusing "assert '' == '101 102'" error. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
…r_network_environment iptables is now pre-installed in the mender-client-docker-addons image (mender-client-subcomponents). Remove the runtime apt-get update/install which was the primary cause of flaky failures like exit code 100 whenever package repos were unreachable from the CI container. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
These 6 download-error limit tests (x2 for OS/Enterprise = 12 total) have been xfailed since April 2021. They spin up the full Docker stack each time only to hit the expected xfail, burning ~24 minutes of CI time per run. Mark them as skip until MEN-4659 is actually fixed. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
The _run() SSH retry loop had three problems: - Slept before the first attempt (1s wasted if already connected) - Exponential backoff with no cap (1,2,4,8,16,32,64,128s...) - Extra time.sleep(60) on ConnectionError on top of the backoff After 7 failures the gap between retries was 128s; with the 60s penalty a single transient SSH drop could stall for minutes. Fix: try first then sleep, cap backoff at 30s, remove the redundant 60s ConnectionError penalty. Retry sequence is now 1,2,4,8,16,30,30... Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
get_devices_status() slept 5s before the first API call and used unbounded linear backoff (5,10,15,20,25...). If devices were already registered, every caller wasted at least 5s for nothing. check_expected_status() also slept before its first check. Fix: check first then sleep on miss, use exponential backoff with cap (2,4,8,15,15...) instead of linear. Saves ~5s per device bootstrap across ~7 accept_devices calls in the suite. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
Four issues in deployments.py: 1. check_expected_statistics: slept 200ms before the first check on every iteration. With 59 call sites this adds up. 2. check_not_in_status: had a for/else logic bug where 'continue' targeted the inner for-loop instead of the outer while-loop, making the function silently return success in some edge cases. Replaced with straightforward any() check. 3. abort() and abort_finished_deployment(): both had a blind sleep(5) between the PUT response and the assertion on that same response's status code. This was completely pointless since r.status_code is already set. Saves 5s per call (4 call sites = 20s). Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
_check_log_for_message: used a fixed 10s polling interval with sleep-before-check. If the log message appeared within 1s, the test still waited 10s. Replace with check-first and exponential backoff (2,4,8,10,10...) saving ~8s per call across 7 call sites. requests_retry: only retried on 500/502 but not 503 (Service Unavailable) or 504 (Gateway Timeout), which are common during container startup when the gateway is up but backends are still loading. Adding these prevents false test failures during setup. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
Convert all 4 filetransfer test classes from function-scoped to class-scoped fixtures so each class spins up the ~21-container Docker Compose stack once instead of per-test. This saves ~80% of the wall time for the filetransfer suite (verified: 6 tests in 97s vs ~580s). Follows the existing _impl + class_persistent_ pattern already used for QEMU client fixtures. Signed-off-by: Maciej Pasinski <maciej.pasinski@northern.tech> Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
These tests have been xfailed since 2021 and were converted to @pytest.mark.skip to save CI time. They test deviceconnect 403 error handling that was never fixed (MEN-4659). Remove the tests and the assert_forbidden helper they relied on. Signed-off-by: Maciej Pasinski <maciej.pasinski@northern.tech> Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
Remove test_db_migration.py and its v1 legacy client fixtures from common_setup.py. These tests verified upgrading from Mender v1.7.0 (released 2017) to the current version. This migration path that is no longer supported. The fixtures were only used by this test file. Signed-off-by: Maciej Pasinski <maciej.pasinski@northern.tech> Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
|
@mender-test-bot start integration pipeline |
|
Hello 😺 I created a pipeline for you here: Pipeline-2510897543 Build Configuration Matrix
|
This reverts commit 9e0b67f58423e74b73774a5d27c415835ac63062. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
The iptables binary is not pre-installed in the Docker client image, so removing the apt-get install broke test_in_poor_network_environment on both OS and Enterprise. Restore the install until iptables is pre-baked into the base image. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
|
@mender-test-bot start integration pipeline |
|
Hello 😺 I created a pipeline for you here: Pipeline-2511095238 Build Configuration Matrix
|
The docker-client fixture is function-scoped (since 59f3afc), so test_in_poor_network_environment starts each run with a freshly bootstrapped client and races the initial deviceconnect session handshake. The fixed 30s heal window after the iptables drop has the same problem. Replace both with a bounded retry loop around get_websocket() so the test waits for actual readiness.
Related to QA-1527 and QA-1563.