Add option to wait for domain shutdown#469
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #469 +/- ##
==========================================
- Coverage 76.83% 76.81% -0.03%
==========================================
Files 53 60 +7
Lines 9365 9423 +58
==========================================
+ Hits 7196 7238 +42
- Misses 2169 2185 +16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
It works, half of it, at least. This will make some tests fails because it wasn't updated to have The |
|
I tried simper things and had the same problem. loop = asyncio.get_event_loop()
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
for qube in domains:
future = loop.run_in_executor(
executor, partial(qube.shutdown, force=force, wait=args.wait)
)
tasks.append(future)
loop.run_until_complete(asyncio.wait_for(asyncio.gather(*tasks), args.timeout))async def main(...):
...
loop = asyncio.get_running_loop()
tasks = []
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
for qube in domains:
future = loop.run_in_executor(executor, qube.shutdown)
tasks.append(future)
async with asyncio.timeout(args.timeout):
await asyncio.gather(*tasks)
returnAnd change to Traceback (most recent call last):
File "/usr/lib/python3.13/site-packages/qubesadmin/tools/qvm_shutdown.py", line 111, in main
await asyncio.gather(*tasks)
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.13/site-packages/qubesadmin/tools/qvm_shutdown.py", line 110, in main
async with asyncio.timeout(args.timeout):
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
File "/usr/lib64/python3.13/asyncio/timeouts.py", line 116, in __aexit__
raise TimeoutError from exc_val
TimeoutError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/bin/qvm-shutdown", line 7, in <module>
sys.exit(asyncio.run(main()))
~~~~~~~~~~~^^^^^^^^
File "/usr/lib64/python3.13/asyncio/runners.py", line 195, in run
return runner.run(main)
~~~~~~~~~~^^^^^^
File "/usr/lib64/python3.13/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
File "/usr/lib64/python3.13/asyncio/base_events.py", line 712, in run_until_complete
self.run_forever()
~~~~~~~~~~~~~~~~^^
File "/usr/lib64/python3.13/asyncio/base_events.py", line 683, in run_forever
self._run_once()
~~~~~~~~~~~~~~^^
File "/usr/lib64/python3.13/asyncio/base_events.py", line 2050, in _run_once
handle._run()
~~~~~~~~~~~^^
File "/usr/lib64/python3.13/asyncio/events.py", line 89, in _run
self._context.run(self._callback, *self._args)
~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.13/site-packages/qubesadmin/tools/qvm_shutdown.py", line 102, in main
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
File "/usr/lib64/python3.13/concurrent/futures/_base.py", line 647, in __exit__
self.shutdown(wait=True)
~~~~~~~~~~~~~^^^^^^^^^^^
File "/usr/lib64/python3.13/concurrent/futures/thread.py", line 239, in shutdown
t.join()
~~~~~~^^
File "/usr/lib64/python3.13/threading.py", line 1094, in join
self._handle.join(timeout)
~~~~~~~~~~~~~~~~~^^^^^^^^^
File "/usr/lib64/python3.13/asyncio/runners.py", line 157, in _on_sigint
raise KeyboardInterrupt()
KeyboardInterrupt
^CException ignored on threading shutdown:
Traceback (most recent call last):
File "/usr/lib64/python3.13/threading.py", line 1536, in _shutdown
atexit_call()
File "/usr/lib64/python3.13/threading.py", line 1507, in <lambda>
_threading_atexits.append(lambda: func(*arg, **kwargs))
File "/usr/lib64/python3.13/concurrent/futures/thread.py", line 31, in _python_exit
t.join()
File "/usr/lib64/python3.13/threading.py", line 1094, in join
self._handle.join(timeout)
KeyboardInterrupt: |
|
Maybe better to really do QubesOS/qubes-issues#4719 first? |
Yes... I read some alternatives, but they are dirty: https://superfastpython.com/threadpoolexecutor-kill-running-tasks/ |
Technically, this is also the current behavior. With the current code (without this PR), |
The merge request to core-admin already avoids the main problem, of a hanging shutdown blocking qubesd. This change avoids a hanging shutdown blocking the interaction with the qube manager For: QubesOS/qubes-issues#10648 Fixes: QubesOS/qubes-issues#10074 Requires: QubesOS/qubes-core-admin#807 Requires: QubesOS/qubes-core-admin-client#469
Waiting for the server to send the result of the API calls, allows dealing with failures/exceptions: - QubesVM.shutdown -> .shutdown(wait=True) - QubesVM.run_service -> .run_service_for_stdio It also does not block the widget from being opened again even if an action is still running. This allows, for example: - To create disposable qube and attach device to it (which takes 10s), but not block the widget from opening - To detach and shutdown disposable that is hanging, without hanging the widget For qui/tray/domains.py, the "widget.destroy()" is called earlier, cause it's not needed after the response is received. Else it hangs until the "react_to_question" finishes. For: QubesOS/qubes-issues#10648 For: QubesOS/qubes-issues#10651 For: QubesOS/qubes-issues#10835 Requires: QubesOS/qubes-core-admin#807 Requires: QubesOS/qubes-core-admin-client#469
609befd to
b544cb0
Compare
|
The last commit, b544cb0, probably requires adapting the file created in I will try again to not require the |
|
Besides the above, pylint complains, mostly about misplaced docstrings in tests. |
|
Sorry for the delay. I thought that rebooting a qube with libvirt was going to be easy. |
| >>> from qubesadmin.utils import async_thread | ||
| >>> await async_thread(vm.shutdown, force=True, wait=True) | ||
| """ | ||
| return await asyncio.to_thread(functools.partial(func, *args, **kwargs)) |
There was a problem hiding this comment.
Does this really need a helper function for a single line?
There was a problem hiding this comment.
Not anymore, it was different then... Thanks for pointing it out.
return await asyncio.get_running_loop().run_in_executor(
None, functools.partial(func, *args, **kwargs)
)But checking asyncio.to_thread docs again, I don't need the functools.partial that was needed for run_in_executor, because the former accepts kwargs while the latter doesn't.
| for qube in domains | ||
| ] | ||
| results = await asyncio.gather(*tasks, return_exceptions=True) | ||
| for qube, res in zip(domains, results): |
There was a problem hiding this comment.
This zip() relies on the same iteration order in domains as in the above call. Type hint says it's list (which should be fine), but actually you give it as set, which doesn't guarantee the same order (even if it works this way most of the time).
There was a problem hiding this comment.
It seems that conversion to set is unecessary. app.domains only show unique items. But, it doesn't keep order anyway, because qubesadmin.tools.QubesArgumentParser.parse_qubes_app makes a set() conversion... will fix that.
| async def main_async(args=None, app=None): | ||
| # pylint: disable=missing-docstring | ||
| args, domains = parse_common(args=args, app=app) | ||
| if args.dry_run: | ||
| return | ||
| await run_async(args=args, domains=domains) |
There was a problem hiding this comment.
Is this function used anywhere? Looks to be mostly in tests, and in one place in qvm_template_postprocess.py (at which I'll comment separately).
There was a problem hiding this comment.
Also, I plan to use qvm_shutdown.main_async in other repositoryes that handled each vm.shutdown by their own way.
There was a problem hiding this comment.
in other repositoryes
Don't, the qubesadmin.tools is supposed to be just CLI wrappers around relevant functions. If you want to reuse some of the logic here elsewhere, move it to proper functions in other place (utils, if no other idea) use those.
| loop = asyncio.new_event_loop() | ||
| asyncio.set_event_loop(loop) | ||
| remaining_domains = set(args.domains) | ||
| for _ in range(len(args.domains)): | ||
| if not remaining_domains: | ||
| break | ||
| shutdown_failed = set() | ||
| if not args.dry_run: | ||
| for vm in remaining_domains: | ||
| try: | ||
| vm.shutdown(force=force) | ||
| except qubesadmin.exc.QubesVMNotStartedError: | ||
| pass | ||
| except qubesadmin.exc.QubesException as e: | ||
| if not args.wait: | ||
| vm.log.error('Shutdown error: {}'.format(e)) | ||
| shutdown_failed.add(vm) | ||
| if not args.wait: | ||
| if shutdown_failed: | ||
| parser.error_runtime( | ||
| 'Failed to shut down: ' + | ||
| ', '.join(vm.name for vm in shutdown_failed), | ||
| len(shutdown_failed)) | ||
| return | ||
| awaiting = remaining_domains - shutdown_failed | ||
| remaining_domains = shutdown_failed | ||
| if not awaiting: | ||
| # no VM shutdown request succeeded, no sense to try again | ||
| break | ||
| loop.run_until_complete(run_async(args=args, domains=domains)) |
| await qubesadmin.tools.qvm_shutdown.main_async( | ||
| args=["--timeout={}".format(timeout), "--", vm.name], | ||
| app=vm.app, | ||
| ) |
There was a problem hiding this comment.
This looks like overkill, you want to shutdown a single VM here. Just use vm.shutdown(wait=True), possibly via asyncio.
There was a problem hiding this comment.
But the old code is not just vm.shutdown(wait=True), it replicates qvm_shutdown, it waits for the domain-shutdown event and kills if the qube was not shutdown.
There was a problem hiding this comment.
Then do try/except with kill here? I still think calling the whole tool (via cmdline entry point) is an overkill.
| tasks = [ | ||
| asyncio.wait_for( | ||
| async_thread(qube.shutdown, force=args.force, wait=args.wait), | ||
| timeout=args.timeout, | ||
| ) | ||
| for qube in domains | ||
| ] |
There was a problem hiding this comment.
This approach significantly differs from what qvm-shutdown did before. Previously, shutdown was done in a retry loop, where shutdown was retried after previous round domains completed. This way, if any failed because of dependencies, it would retry shutdown when others are off.
This new version will fail at that. For example, assuming only sys-net, sys-firewall and sys-whonix are running, qvm-shutdown --wait sys-net sys-firewall sys-whonix (without --force) should:
- First try to shutdown all, at which point sys-net and sys-firewall will refuse to shutdown because it's used by others. But sys-whonix shutdown will get initiated.
- Wait for successfully initiated shutdown to complete.
- Try again, this time sys-firewall will shutdown cleanly, but sys-net will still fail.
- Wait for successfully initiated shutdown to complete.
- Finally, sys-net shutdown will succeed cleanly.
| self.app.expected_calls[ | ||
| ('other-vm', 'admin.vm.Shutdown', None, None)] = [ | ||
| b'2\x00QubesException\x00\x00Shutdown refused\x00', | ||
| ('other-vm', 'admin.vm.Shutdown', 'wait', None)] = [ | ||
| b'2\x00QubesVMShutdownTimeoutError\x00\x00Shutdown timed out\x00', | ||
| b'0\x00', | ||
| ] |
There was a problem hiding this comment.
This was the test for the behavior of retry-loop I described in the other comment. Your new version do that only for timeout (and changed test now does only timeout...), and only once.
| loop = asyncio.new_event_loop() | ||
| asyncio.set_event_loop(loop) | ||
| return loop.run_until_complete(run_async(domains=args.domains)) |
Waiting for the server to send the result of the API calls, allows dealing with failures/exceptions: - QubesVM.shutdown -> .shutdown(wait=True) - QubesVM.run_service -> .run_service_for_stdio It also does not block the widget from being opened again even if an action is still running. This allows, for example: - To create disposable qube and attach device to it (which takes 10s), but not block the widget from opening - To detach and shutdown disposable that is hanging, without hanging the widget For qui/tray/domains.py, the "widget.destroy()" is called earlier, cause it's not needed after the response is received. Else it hangs until the "react_to_question" finishes. For: QubesOS/qubes-issues#10648 For: QubesOS/qubes-issues#10651 For: QubesOS/qubes-issues#10835 Requires: QubesOS/qubes-core-admin#807 Requires: QubesOS/qubes-core-admin-client#469
| ", ".join(qube.name for qube in used) | ||
| ) | ||
| ) | ||
| unhandled_retry, used, timedout_retry = await shutdown( |
There was a problem hiding this comment.
Two remarks:
- exit the loop if there is nothing to shutdown anymore
- abort if no qube succeeded - if there is no progress, there is no point in retrying
There was a problem hiding this comment.
Added tests for these, it wasn't tested...
| pass | ||
| try: | ||
| await asyncio.wait_for( | ||
| asyncio.to_thread(vm.shutdown, wait=True), |
There was a problem hiding this comment.
wait=True already enforces the timeout, no need to wrap it in wait_for. And then adjust handled exception below.
There was a problem hiding this comment.
Kinda... this is a bit confusing...
Timeout is known to the client only and never passed in the API call, which means that for the freezed qube that will timeout only after 60s (either by shutdown_timeout property or libvirt timeout, depending when the freeze happens), and will raise QubesVMShutdownTimeoutError, and the call is ceased.
But when setting the timeout on the client lower than 60s, and the qube is freezed, this timeout will raise the exception asyncio.TimeoutError, but still, the server has not ended the connection as it doesn't know of the client shutdown timeout, and the client can't end the connection because python can't cancel process running in a different thread, the error message is printed, but on SystemExit, it hangs until the connection is terminated by the server.
This could be solved either by
- having asynchronous caller that can cancel requests (still, the server would continue waiting for the qube to shutdown independently of client disconnect)
- sending the timeout in the api call as an argument, so the server ends the connection
There was a problem hiding this comment.
No need for any of that. Existence of this client-side timeout is only because shutdown_timeout property didn't existed before. Let the server side enforce the timeout, that's enough.
|
Is cab1a91 intentionally in this PR? Looks unrelated to shutdown... |
Waiting for the server to send the result of the API calls, allows dealing with failures/exceptions: - QubesVM.shutdown -> .shutdown(wait=True) - QubesVM.run_service -> .run_service_for_stdio It also does not block the widget from being opened again even if an action is still running. This allows, for example: - To create disposable qube and attach device to it (which takes 10s), but not block the widget from opening - To detach and shutdown disposable that is hanging, without hanging the widget For qui/tray/domains.py, the "widget.destroy()" is called earlier, cause it's not needed after the response is received. Else it hangs until the "react_to_question" finishes. For: QubesOS/qubes-issues#10648 For: QubesOS/qubes-issues#10651 For: QubesOS/qubes-issues#10835 Requires: QubesOS/qubes-core-admin#807 Requires: QubesOS/qubes-core-admin-client#469
Waiting for the server to send the result of the API calls, allows dealing with failures/exceptions: - QubesVM.shutdown -> .shutdown(wait=True) - QubesVM.run_service -> .run_service_for_stdio It also does not block the widget from being opened again even if an action is still running. This allows, for example: - To create disposable qube and attach device to it (which takes 10s), but not block the widget from opening - To detach and shutdown disposable that is hanging, without hanging the widget For: QubesOS/qubes-issues#10648 For: QubesOS/qubes-issues#10651 For: QubesOS/qubes-issues#10835 Requires: QubesOS/qubes-core-admin#807 Requires: QubesOS/qubes-core-admin-client#469
|
This is causing a RuntimeError for the event loop on Qube Manager, I will see about moving part of the code to utils and modify qube manager to not use main(), or adapting the main() to support those requests. |
|
Marek, |
That's yet another reason why calling other's tool main function is a bad idea... Yes, better make the drive assignment reusable. |
|
CI error: https://gitlab.com/QubesOS/qubes-core-admin-client/-/jobs/14581983314
|
|
PipelineRetryFailed |
It was necessary to make alternative main async functions, as Python is picky about running async code when mixing async and sync functions.
Qube Manager is calling qvm_start.main on 3 separate locations...
By waiting for the full shutdown, it becomes possible to receive appropriate exception, that would allow adapting the client to react differently. A timeout can be interpreted and the method adjusted to kill the domain.
For: QubesOS/qubes-issues#10835
Fixes: QubesOS/qubes-issues#6132
Requires: QubesOS/qubes-core-admin#807
Adding wait parameter to
qubesadmin.tools.qvm_shutdowndoesn't work well when specifying multiple domains. I'm thinking of a better way to handle it. I thought of looping the domains to create tasks for each of them with run_in_executor and asyncio.gather at last, but that will require changing theqvm_shutdowna bit.