Skip to content

Add option to wait for domain shutdown#469

Open
ben-grande wants to merge 12 commits into
QubesOS:mainfrom
ben-grande:long-shutdown
Open

Add option to wait for domain shutdown#469
ben-grande wants to merge 12 commits into
QubesOS:mainfrom
ben-grande:long-shutdown

Conversation

@ben-grande
Copy link
Copy Markdown
Contributor

@ben-grande ben-grande commented May 6, 2026

By waiting for the full shutdown, it becomes possible to receive appropriate exception, that would allow adapting the client to react differently. A timeout can be interpreted and the method adjusted to kill the domain.

For: QubesOS/qubes-issues#10835
Fixes: QubesOS/qubes-issues#6132
Requires: QubesOS/qubes-core-admin#807


Adding wait parameter to qubesadmin.tools.qvm_shutdown doesn't work well when specifying multiple domains. I'm thinking of a better way to handle it. I thought of looping the domains to create tasks for each of them with run_in_executor and asyncio.gather at last, but that will require changing the qvm_shutdown a bit.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

❌ Patch coverage is 90.87838% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.81%. Comparing base (64cd714) to head (1cc5984).

Files with missing lines Patch % Lines
qubesadmin/utils/start.py 80.95% 16 Missing ⚠️
qubesadmin/utils/__init__.py 89.36% 5 Missing ⚠️
qubesadmin/utils/resume.py 0.00% 4 Missing ⚠️
qubesadmin/tools/qvm_template_postprocess.py 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #469      +/-   ##
==========================================
- Coverage   76.83%   76.81%   -0.03%     
==========================================
  Files          53       60       +7     
  Lines        9365     9423      +58     
==========================================
+ Hits         7196     7238      +42     
- Misses       2169     2185      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ben-grande
Copy link
Copy Markdown
Contributor Author

ben-grande commented May 8, 2026

6ebacfb

It works, half of it, at least.

This will make some tests fails because it wasn't updated to have wait in the argument. But I am not sure if I will continue with the current method, since it currently blocks exiting the interpreter until the api call ceases (unblocks the thread). It is not "dangerous", but an inconvenience and might break expectations. For more information, please read the big "TODO" on top of "ThreadPoolexecutor".

The future.cancel() is False, as well as future.cancelled(). The future.running() is True. All this means is that it cannot be cancelled.

@ben-grande
Copy link
Copy Markdown
Contributor Author

I tried simper things and had the same problem.

    loop = asyncio.get_event_loop()
    with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
        for qube in domains:
            future = loop.run_in_executor(
               executor, partial(qube.shutdown, force=force, wait=args.wait)
            )
            tasks.append(future)
        loop.run_until_complete(asyncio.wait_for(asyncio.gather(*tasks), args.timeout))
async def main(...):

    ...

    loop = asyncio.get_running_loop()
    tasks = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
        for qube in domains:
            future = loop.run_in_executor(executor, qube.shutdown)
            tasks.append(future)
        async with asyncio.timeout(args.timeout):
            await asyncio.gather(*tasks)
    return

And change to asyncio.run(main()) on /usr/bin/qvm-shutdown.

Traceback (most recent call last):
  File "/usr/lib/python3.13/site-packages/qubesadmin/tools/qvm_shutdown.py", line 111, in main
    await asyncio.gather(*tasks)
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.13/site-packages/qubesadmin/tools/qvm_shutdown.py", line 110, in main
    async with asyncio.timeout(args.timeout):
               ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "/usr/lib64/python3.13/asyncio/timeouts.py", line 116, in __aexit__
    raise TimeoutError from exc_val
TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/qvm-shutdown", line 7, in <module>
    sys.exit(asyncio.run(main()))
             ~~~~~~~~~~~^^^^^^^^
  File "/usr/lib64/python3.13/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ~~~~~~~~~~^^^^^^
  File "/usr/lib64/python3.13/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/usr/lib64/python3.13/asyncio/base_events.py", line 712, in run_until_complete
    self.run_forever()
    ~~~~~~~~~~~~~~~~^^
  File "/usr/lib64/python3.13/asyncio/base_events.py", line 683, in run_forever
    self._run_once()
    ~~~~~~~~~~~~~~^^
  File "/usr/lib64/python3.13/asyncio/base_events.py", line 2050, in _run_once
    handle._run()
    ~~~~~~~~~~~^^
  File "/usr/lib64/python3.13/asyncio/events.py", line 89, in _run
    self._context.run(self._callback, *self._args)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.13/site-packages/qubesadmin/tools/qvm_shutdown.py", line 102, in main
    with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.13/concurrent/futures/_base.py", line 647, in __exit__
    self.shutdown(wait=True)
    ~~~~~~~~~~~~~^^^^^^^^^^^
  File "/usr/lib64/python3.13/concurrent/futures/thread.py", line 239, in shutdown
    t.join()
    ~~~~~~^^
  File "/usr/lib64/python3.13/threading.py", line 1094, in join
    self._handle.join(timeout)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^
  File "/usr/lib64/python3.13/asyncio/runners.py", line 157, in _on_sigint
    raise KeyboardInterrupt()
KeyboardInterrupt

^CException ignored on threading shutdown:
Traceback (most recent call last):
  File "/usr/lib64/python3.13/threading.py", line 1536, in _shutdown
    atexit_call()
  File "/usr/lib64/python3.13/threading.py", line 1507, in <lambda>
    _threading_atexits.append(lambda: func(*arg, **kwargs))
  File "/usr/lib64/python3.13/concurrent/futures/thread.py", line 31, in _python_exit
    t.join()
  File "/usr/lib64/python3.13/threading.py", line 1094, in join
    self._handle.join(timeout)
KeyboardInterrupt:

@ben-grande
Copy link
Copy Markdown
Contributor Author

@marmarek
Copy link
Copy Markdown
Member

Maybe better to really do QubesOS/qubes-issues#4719 first?

@ben-grande
Copy link
Copy Markdown
Contributor Author

Maybe better to really do QubesOS/qubes-issues#4719 first?

Yes... I read some alternatives, but they are dirty: https://superfastpython.com/threadpoolexecutor-kill-running-tasks/

@ben-grande
Copy link
Copy Markdown
Contributor Author

But I am not sure if I will continue with the current method, since it currently blocks exiting the interpreter until the api call ceases (unblocks the thread).

Technically, this is also the current behavior. With the current code (without this PR), qvm-shutdown --timeout=3 HANGING_QUBE will not cease the connection.

ben-grande added a commit to ben-grande/qubes-manager that referenced this pull request May 12, 2026
The merge request to core-admin already avoids the main problem, of a hanging
shutdown blocking qubesd. This change avoids a hanging shutdown blocking the
interaction with the qube manager

For: QubesOS/qubes-issues#10648
Fixes: QubesOS/qubes-issues#10074
Requires: QubesOS/qubes-core-admin#807
Requires: QubesOS/qubes-core-admin-client#469
ben-grande added a commit to ben-grande/qubes-desktop-linux-manager that referenced this pull request May 13, 2026
Waiting for the server to send the result of the API calls, allows
dealing with failures/exceptions:

- QubesVM.shutdown -> .shutdown(wait=True)
- QubesVM.run_service -> .run_service_for_stdio

It also does not block the widget from being opened again even if an
action is still running. This allows, for example:

- To create disposable qube and attach device to it (which takes 10s),
  but not block the widget from opening
- To detach and shutdown disposable that is hanging, without hanging the
  widget

For qui/tray/domains.py, the "widget.destroy()" is called earlier, cause
it's not needed after the response is received. Else it hangs until the
"react_to_question" finishes.

For: QubesOS/qubes-issues#10648
For: QubesOS/qubes-issues#10651
For: QubesOS/qubes-issues#10835
Requires: QubesOS/qubes-core-admin#807
Requires: QubesOS/qubes-core-admin-client#469
@ben-grande ben-grande force-pushed the long-shutdown branch 2 times, most recently from 609befd to b544cb0 Compare May 13, 2026 16:28
@ben-grande
Copy link
Copy Markdown
Contributor Author

ben-grande commented May 13, 2026

The last commit, b544cb0, probably requires adapting the file created in qvm-shutdown to call asyncio.run(main()).

I will try again to not require the main() to be an async function, and pass the loop object... another day.

@marmarek
Copy link
Copy Markdown
Member

Besides the above, pylint complains, mostly about misplaced docstrings in tests.

@ben-grande
Copy link
Copy Markdown
Contributor Author

Sorry for the delay. I thought that rebooting a qube with libvirt was going to be easy.

@ben-grande ben-grande marked this pull request as ready for review May 20, 2026 19:12
Comment thread qubesadmin/utils.py Outdated
>>> from qubesadmin.utils import async_thread
>>> await async_thread(vm.shutdown, force=True, wait=True)
"""
return await asyncio.to_thread(functools.partial(func, *args, **kwargs))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really need a helper function for a single line?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not anymore, it was different then... Thanks for pointing it out.

return await asyncio.get_running_loop().run_in_executor(
    None, functools.partial(func, *args, **kwargs)
)

But checking asyncio.to_thread docs again, I don't need the functools.partial that was needed for run_in_executor, because the former accepts kwargs while the latter doesn't.

Comment thread qubesadmin/tools/qvm_shutdown.py Outdated
for qube in domains
]
results = await asyncio.gather(*tasks, return_exceptions=True)
for qube, res in zip(domains, results):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This zip() relies on the same iteration order in domains as in the above call. Type hint says it's list (which should be fine), but actually you give it as set, which doesn't guarantee the same order (even if it works this way most of the time).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that conversion to set is unecessary. app.domains only show unique items. But, it doesn't keep order anyway, because qubesadmin.tools.QubesArgumentParser.parse_qubes_app makes a set() conversion... will fix that.

Comment thread qubesadmin/tools/qvm_shutdown.py Outdated
Comment on lines +175 to +180
async def main_async(args=None, app=None):
# pylint: disable=missing-docstring
args, domains = parse_common(args=args, app=app)
if args.dry_run:
return
await run_async(args=args, domains=domains)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function used anywhere? Looks to be mostly in tests, and in one place in qvm_template_postprocess.py (at which I'll comment separately).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I plan to use qvm_shutdown.main_async in other repositoryes that handled each vm.shutdown by their own way.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in other repositoryes

Don't, the qubesadmin.tools is supposed to be just CLI wrappers around relevant functions. If you want to reuse some of the logic here elsewhere, move it to proper functions in other place (utils, if no other idea) use those.

Comment thread qubesadmin/tools/qvm_shutdown.py Outdated
Comment on lines +188 to +190
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
remaining_domains = set(args.domains)
for _ in range(len(args.domains)):
if not remaining_domains:
break
shutdown_failed = set()
if not args.dry_run:
for vm in remaining_domains:
try:
vm.shutdown(force=force)
except qubesadmin.exc.QubesVMNotStartedError:
pass
except qubesadmin.exc.QubesException as e:
if not args.wait:
vm.log.error('Shutdown error: {}'.format(e))
shutdown_failed.add(vm)
if not args.wait:
if shutdown_failed:
parser.error_runtime(
'Failed to shut down: ' +
', '.join(vm.name for vm in shutdown_failed),
len(shutdown_failed))
return
awaiting = remaining_domains - shutdown_failed
remaining_domains = shutdown_failed
if not awaiting:
# no VM shutdown request succeeded, no sense to try again
break
loop.run_until_complete(run_async(args=args, domains=domains))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asyncio.run(...) instead?

Comment on lines +242 to +245
await qubesadmin.tools.qvm_shutdown.main_async(
args=["--timeout={}".format(timeout), "--", vm.name],
app=vm.app,
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like overkill, you want to shutdown a single VM here. Just use vm.shutdown(wait=True), possibly via asyncio.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the old code is not just vm.shutdown(wait=True), it replicates qvm_shutdown, it waits for the domain-shutdown event and kills if the qube was not shutdown.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then do try/except with kill here? I still think calling the whole tool (via cmdline entry point) is an overkill.

Comment thread qubesadmin/tools/qvm_shutdown.py Outdated
Comment on lines +80 to +86
tasks = [
asyncio.wait_for(
async_thread(qube.shutdown, force=args.force, wait=args.wait),
timeout=args.timeout,
)
for qube in domains
]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach significantly differs from what qvm-shutdown did before. Previously, shutdown was done in a retry loop, where shutdown was retried after previous round domains completed. This way, if any failed because of dependencies, it would retry shutdown when others are off.
This new version will fail at that. For example, assuming only sys-net, sys-firewall and sys-whonix are running, qvm-shutdown --wait sys-net sys-firewall sys-whonix (without --force) should:

  1. First try to shutdown all, at which point sys-net and sys-firewall will refuse to shutdown because it's used by others. But sys-whonix shutdown will get initiated.
  2. Wait for successfully initiated shutdown to complete.
  3. Try again, this time sys-firewall will shutdown cleanly, but sys-net will still fail.
  4. Wait for successfully initiated shutdown to complete.
  5. Finally, sys-net shutdown will succeed cleanly.

Comment on lines 248 to 252
self.app.expected_calls[
('other-vm', 'admin.vm.Shutdown', None, None)] = [
b'2\x00QubesException\x00\x00Shutdown refused\x00',
('other-vm', 'admin.vm.Shutdown', 'wait', None)] = [
b'2\x00QubesVMShutdownTimeoutError\x00\x00Shutdown timed out\x00',
b'0\x00',
]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the test for the behavior of retry-loop I described in the other comment. Your new version do that only for timeout (and changed test now does only timeout...), and only once.

Comment thread qubesadmin/tools/qvm_kill.py Outdated
Comment on lines +59 to +61
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
return loop.run_until_complete(run_async(domains=args.domains))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asyncio.run() ?

ben-grande added a commit to ben-grande/qubes-desktop-linux-manager that referenced this pull request May 21, 2026
Waiting for the server to send the result of the API calls, allows
dealing with failures/exceptions:

- QubesVM.shutdown -> .shutdown(wait=True)
- QubesVM.run_service -> .run_service_for_stdio

It also does not block the widget from being opened again even if an
action is still running. This allows, for example:

- To create disposable qube and attach device to it (which takes 10s),
  but not block the widget from opening
- To detach and shutdown disposable that is hanging, without hanging the
  widget

For qui/tray/domains.py, the "widget.destroy()" is called earlier, cause
it's not needed after the response is received. Else it hangs until the
"react_to_question" finishes.

For: QubesOS/qubes-issues#10648
For: QubesOS/qubes-issues#10651
For: QubesOS/qubes-issues#10835
Requires: QubesOS/qubes-core-admin#807
Requires: QubesOS/qubes-core-admin-client#469
Comment thread qubesadmin/tools/qvm_shutdown.py Outdated
", ".join(qube.name for qube in used)
)
)
unhandled_retry, used, timedout_retry = await shutdown(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two remarks:

  • exit the loop if there is nothing to shutdown anymore
  • abort if no qube succeeded - if there is no progress, there is no point in retrying

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests for these, it wasn't tested...

pass
try:
await asyncio.wait_for(
asyncio.to_thread(vm.shutdown, wait=True),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait=True already enforces the timeout, no need to wrap it in wait_for. And then adjust handled exception below.

Copy link
Copy Markdown
Contributor Author

@ben-grande ben-grande May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda... this is a bit confusing...

Timeout is known to the client only and never passed in the API call, which means that for the freezed qube that will timeout only after 60s (either by shutdown_timeout property or libvirt timeout, depending when the freeze happens), and will raise QubesVMShutdownTimeoutError, and the call is ceased.

But when setting the timeout on the client lower than 60s, and the qube is freezed, this timeout will raise the exception asyncio.TimeoutError, but still, the server has not ended the connection as it doesn't know of the client shutdown timeout, and the client can't end the connection because python can't cancel process running in a different thread, the error message is printed, but on SystemExit, it hangs until the connection is terminated by the server.

This could be solved either by

  • having asynchronous caller that can cancel requests (still, the server would continue waiting for the qube to shutdown independently of client disconnect)
  • sending the timeout in the api call as an argument, so the server ends the connection

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for any of that. Existence of this client-side timeout is only because shutdown_timeout property didn't existed before. Let the server side enforce the timeout, that's enough.

@marmarek
Copy link
Copy Markdown
Member

Is cab1a91 intentionally in this PR? Looks unrelated to shutdown...

@ben-grande
Copy link
Copy Markdown
Contributor Author

Is cab1a91 intentionally in this PR? Looks unrelated to shutdown...

Unrelated, yes, made a commit first on the wrong branch, cherry picked to the correct one: #471, and forgot to remove from this one.

ben-grande added a commit to ben-grande/qubes-desktop-linux-manager that referenced this pull request May 27, 2026
Waiting for the server to send the result of the API calls, allows
dealing with failures/exceptions:

- QubesVM.shutdown -> .shutdown(wait=True)
- QubesVM.run_service -> .run_service_for_stdio

It also does not block the widget from being opened again even if an
action is still running. This allows, for example:

- To create disposable qube and attach device to it (which takes 10s),
  but not block the widget from opening
- To detach and shutdown disposable that is hanging, without hanging the
  widget

For qui/tray/domains.py, the "widget.destroy()" is called earlier, cause
it's not needed after the response is received. Else it hangs until the
"react_to_question" finishes.

For: QubesOS/qubes-issues#10648
For: QubesOS/qubes-issues#10651
For: QubesOS/qubes-issues#10835
Requires: QubesOS/qubes-core-admin#807
Requires: QubesOS/qubes-core-admin-client#469
ben-grande added a commit to ben-grande/qubes-desktop-linux-manager that referenced this pull request May 27, 2026
Waiting for the server to send the result of the API calls, allows
dealing with failures/exceptions:

- QubesVM.shutdown -> .shutdown(wait=True)
- QubesVM.run_service -> .run_service_for_stdio

It also does not block the widget from being opened again even if an
action is still running. This allows, for example:

- To create disposable qube and attach device to it (which takes 10s),
  but not block the widget from opening
- To detach and shutdown disposable that is hanging, without hanging the
  widget

For: QubesOS/qubes-issues#10648
For: QubesOS/qubes-issues#10651
For: QubesOS/qubes-issues#10835
Requires: QubesOS/qubes-core-admin#807
Requires: QubesOS/qubes-core-admin-client#469
@ben-grande
Copy link
Copy Markdown
Contributor Author

This is causing a RuntimeError for the event loop on Qube Manager, qvm_start.main() is being used there:

https://github.com/search?q=org%3AQubesOS+%22from+qubesadmin.tools%22+NOT+repo%3AQubesOS%2Fqubes-core-admin-client&type=code

I will see about moving part of the code to utils and modify qube manager to not use main(), or adapting the main() to support those requests.

@ben-grande
Copy link
Copy Markdown
Contributor Author

Marek, qubesmanager is using qvm_start.main. I can make it use qvm_start.run_async, or do you prefer that I make qvm_start with drive assignment reusable?

@marmarek
Copy link
Copy Markdown
Member

Marek, qubesmanager is using qvm_start.main. I can make it use qvm_start.run_async, or do you prefer that I make qvm_start with drive assignment reusable?

That's yet another reason why calling other's tool main function is a bad idea... Yes, better make the drive assignment reusable.

@ben-grande
Copy link
Copy Markdown
Contributor Author

CI error: https://gitlab.com/QubesOS/qubes-core-admin-client/-/jobs/14581983314

$ sudo dnf install -y python3-rpm python3-xlib
sudo: PAM account management error: Authentication service cannot retrieve authentication info
sudo: a password is required

@ben-grande
Copy link
Copy Markdown
Contributor Author

PipelineRetryFailed

It was necessary to make alternative main async functions, as Python is
picky about running async code when mixing async and sync functions.
Qube Manager is calling qvm_start.main on 3 separate locations...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Don't retry shutdown on qube that was already shutdown in the same qvm-shutdown invocation using --all

3 participants