feat: async run_experiment via RunHandle + cancellation + status widget#10
feat: async run_experiment via RunHandle + cancellation + status widget#10hinderling wants to merge 10 commits into
Conversation
Move the MDA feed loop onto a worker thread, expose live status through a
RunHandle + psygnal Signal, and add a minimal napari widget that mirrors
the current run.
Breaking change:
ctrl.run_experiment(events, ...) and ctrl.continue_experiment(...) now
return a RunHandle immediately instead of blocking until the run is
done. Existing notebooks that did `ctrl.run_experiment(events, ...)`
must be updated to either `handle = ctrl.run_experiment(events, ...);
handle.wait()` for the old blocking semantics, or to use the new
non-blocking flow (poll handle.status(), subscribe to
handle.statusChanged, call handle.cancel() to stop early).
What's in this commit:
- faro/core/run_status.py (new):
* RunStatus -- immutable snapshot dataclass with state, event/FOV
indices, frame count, lag_ms, error info.
* RunHandle -- owns the worker thread + cooperative cancel event,
exposes status()/wait()/cancel()/is_running() + a psygnal
statusChanged signal that emits the latest RunStatus on each update.
Subscribers on the main thread see queued-connection delivery via
psygnal's Qt integration.
- faro/core/controller.py:
* Controller exposes a class-level runStarted = Signal(object). Fires
on every new run/continue so widgets can re-bind.
* run_experiment / continue_experiment spawn a worker thread, return
the handle, emit runStarted. Validation still happens synchronously
so a bad event list raises on the calling thread.
* _run_worker centralises pre-flight setup (writer init -- including
the potentially-slow zarr rmtree on overwrite -- and Analyzer
construction) and wraps the feed loop in try/except so worker-side
failures land in handle.fatal_error rather than crashing the user.
* _run_mda_with_events accepts the handle, checks handle.cancel_event
at each loop iteration and in the backpressure throttle, asks the
engine to cancel the in-flight event when set, and emits status
updates on each RTMEvent dequeue.
* _on_frame_ready (and ControllerSimulated._on_frame_ready) call a
shared _bump_status_for_frame helper that increments
n_frames_received and computes lag_ms vs event.min_start_time.
* Now off the main thread, all the prior Qt-pumping helpers
(_pump_qt_and_sleep, _qt_join, _wait_for_frame_pumping_qt) and the
superqt ensure_main_thread import are obsolete and removed. The
preview-layer machinery (viewer=, _on_preview_frame, _apply_preview,
PREVIEW_LAYER_NAME) is also removed -- napari-micromanager's own
_NapariMDAHandler already routes generator events into the preview
layer.
* finish_experiment now waits for the current handle before shutting
down the Analyzer.
* _pending_sentinels guarded by a Lock since extend_experiment now
runs on the calling thread while the feed loop runs on the worker.
- faro/widgets/experiment_status.py (new):
* ExperimentStatusWidget -- read-out of state, FOV, event index,
frame count, lag, elapsed time, error count. Has a Stop button
that calls handle.cancel(). Subscribes to controller.runStarted
so it automatically re-binds when a new run begins; cleans up the
previous handle's signal subscription on each rebind.
Verified end-to-end via a Qt smoke test:
- Live updates flow from the worker thread to the widget on the main
thread (psygnal+Qt queued delivery).
- Stop button triggers handle.cancel(); the worker's cancel-check
fires within one iteration and the run exits at the next event
boundary.
- Starting a new run re-binds the widget to the new handle and resets
the progress bar / counters.
The OmeZarrWriter init in _run_worker still pulled image height/width via self._mic.mmc.getImageHeight/Width -- a pymmcore-plus-specific call that breaks any non-pymmcore microscope. Use the AbstractMicroscope-level convention: subclasses populate self.image_height / self.image_width on the microscope instance (Moench already does this in init_scope). Fall back to mmc if the attributes aren't present but mmc is, so existing pymmcore-only microscopes keep working without code changes. Raise a clear error when neither path is available.
Three independent bugs surfaced when running the new async run_experiment + ExperimentStatusWidget against a napari viewer (reproduced with the optogenetic virtual_microscope backend): 1. pymmcore-plus's signals_backend() auto-selects the *qt* backend whenever a QApplication is loaded. core.mda.events.frameReady then becomes a QtCore.SignalInstance and cross-thread emits land in Qt.QueuedConnection, where they're delivered only when the main thread pumps events. With Controller.run_experiment now spawning a worker and RunHandle.wait() joining on it, the main thread is typically idle-blocked exactly when the engine is firing frames -- so the controller's _on_frame_ready never ran, the engine completed "successfully" with zero frames received, and the pipeline never saw any data. Force PYMM_SIGNALS_BACKEND=psygnal in faro/microscope/base.py so the data path stays direct/synchronous on the engine thread regardless of whether Qt is loaded. The widget-side path (RunHandle.statusChanged) still uses psygnal's own queued delivery -- see fix #2. 2. ExperimentStatusWidget connected handle.statusChanged with the default (direct) connection. Status updates emitted from the worker thread therefore ran the widget's _refresh slot synchronously off-main, calling QLabel.setText / QProgressBar.setValue from a non-GUI thread. Under napari that lands in vispy's OpenGL compositor and aborts with "Cannot make QOpenGLContext current in a different thread" -> SIGABRT (kernel hard-crash in VSCode Jupyter). Switch to connect(..., thread="main") so psygnal queues the call into its main-thread queue. 3. psygnal's queued callbacks live in QueuedCallback._GLOBAL_QUEUE, which nothing drains by default -- the widget would be invoked on the main thread, but only when something explicitly calls psygnal.emit_queued(). RunHandle's docstring claims auto-Qt delivery; that's not how psygnal actually works. Call psygnal.qt.start_emitting_from_queue() in the widget's __init__, which installs a main-thread QTimer that fires emit_queued() on every Qt event-loop tick. Idempotent and global, so multiple widgets / multiple runs are safe. Lockfile: bump pymmcore-widgets (8c8f76e -> 48ff414) so the unrelated upstream crash in pymmcore_widgets._presets_widget._on_property_changed when handed an empty device label (virtual_microscope's shutter) is included. Without that bump, the MDA engine itself aborts on the first setShutterOpen() once frames actually start flowing. Verified end-to-end against virtual_microscope's optogenetic backend: - headless async run: 5/5 frames (regression check, unchanged) - napari.Viewer() + handle.wait(): 5/5 frames (was 0/5) - napari + napari-micromanager + widget: 5/5 frames, no crash, exit 0 - widget visibly updates progress / frames / state mid-experiment (sampled QLabel.text() while pumping Qt events) - 87 unit tests still pass
Sibling of demo_sim_optogenetic.ipynb that exercises the new async run_experiment + RunHandle + ExperimentStatusWidget end-to-end against virtual_microscope's optogenetic backend, with a live napari viewer dock-attached. Walks through: handle = ctrl.run_experiment(...) is non-blocking, the kernel is free; poll handle.status() while it runs; subscribe to handle.statusChanged from the kernel side; cancel via the widget Stop button or handle.cancel(); handle.wait() blocks if you want the old synchronous semantics; continue_experiment() re-binds the widget automatically via runStarted. Phases are concatenated with combine(..., axis="t") per the new RTMSequence API.
Backend changes that make an async run inspectable and steerable --
the data the new ExperimentStatusWidget renders, plus two bug fixes
surfaced while building it.
run_status.py
- RunHandle.events: optional snapshot of the (sorted) RTMEvents the
handle is driving, so widgets can render per-event visualisations
(event strip, FOV map) that need the full plan up front.
- Pause/resume: RunState gains "pausing"/"paused"; RunHandle gains
pause()/resume()/is_paused() and a pause_event the feed loop polls.
cancel() now also clears the pause event so a cancel while paused
still releases the feed loop.
controller.py
- run_experiment / continue_experiment sort events once (by
min_start_time, then position) and stash the sorted list on the
handle, so the order the worker processes matches what the widget
displays.
- Feed loop honors pause_event: before pulling the next RTMEvent it
checks the flag, flips state to "paused", and idles until resume()
-- the MDA engine drains whatever is already queued, then waits.
- fix: the engine queue (self._queue) is recreated per run. The
finally-block feeds a STOP_EVENT sentinel to stop the engine; on a
*cancelled* run cancel_mda() aborts the engine, which may stop
without draining the queue, leaving stale events + the sentinel
behind. Reusing that queue made the next run's engine consume the
stale sentinel and exit after a few events ("stuck at 3/80"). A
fresh queue per run fixes it.
- fix: _bump_status_for_frame skips IMG_STIM frames. A stim emission
is the SLM-illuminated snap paired with its imaging frame; counting
it double-updated the status (lag/elapsed refreshing twice per stim
event) and made n_frames_received drift away from the RTMEvent
count. Imaging + ref frames are the meaningful data frames.
Verified end-to-end against the optogenetic virtual-microscope backend:
cancel mid-run then restart reaches steady state (no stall); pause
halts feeding after the backpressure window drains and resume continues
to completion; frame count tracks RTMEvents 1:1 for single-channel plans.
Rework the minimal status widget into a full run dashboard, driven by
the RunHandle data exposed in the previous commit.
Components (top to bottom):
- State chip -- RUNNING / PAUSED / DONE / ... as plain text in a
translucent-neutral rounded chip (no per-state fill: a colored
banner competed with the imaging/stim/ref legend colors).
- Legend chips -- imaging / stim / ref; the chip matching the current
event type is fully opaque, the others dimmed.
- EventStrip -- one cell per RTMEvent, color-coded by type. Past +
current cells opaque (progress fill), future cells dimmed. Same-type
runs are coalesced into single fills so thousands of events render
with correct alpha instead of over-stacking at sub-pixel widths.
Empty state draws a "(no events loaded)" placeholder.
- FovMap -- one dot per unique FOV position, equal-aspect (a straight
line of FOVs stays a line), grey visit-order path, active dot
recolored to the current event type. Pinned square via resizeEvent.
Paints its own rounded panel background; "FOV X/Y" counter in the
corner.
- Stats form -- event N/M, elapsed, scheduled, lag, remaining, errors.
Times formatted hh:mm:ss with the leading unit suffixed and dropped
when zero; lag turns red past 5 s. Wrapped in a shaded panel echoing
napari's layer-controls boxes.
- Pause/Resume + Stop buttons.
Threading / theming details:
- statusChanged is connected with thread="main" and the widget calls
psygnal.qt.start_emitting_from_queue() so worker-thread emits are
delivered on the GUI thread (drives QWidgets safely under napari).
- A 250 ms QTimer ticks the elapsed/remaining clocks between status
emissions so time fields don't freeze between frames.
- The strip cursor tracks n_frames_received (actual snaps), not
n_events_consumed (the feed loop runs 3-4 ahead via backpressure,
which made the strip jump several cells at run start).
- Colors/fonts derive from the Qt palette so the widget adapts to
napari's light/dark theme; corner radii match napari widgets.
Add a second stage position (20, 20, 0) to the baseline / stim / recovery sequences so the demo exercises a 2-FOV acquisition -- the ExperimentStatusWidget's FOV map then shows both positions and the visit-order path between them. Drop the frame interval 1.5s -> 1s.
d473b9b to
3c0e798
Compare
|
@alandolt can you have a look if you see any general issues with this architecture change? still a few open TODOs before merging, but the main idea is there i think! but would be great to have your input before i start migrating the other notebooks etc. I think this will also be useful more long-term, running experiments on different microscopes simultaneously with BO for example, in combo with |
Add FrameDispenser.cancel() and the FrameWaitCancelled exception so a thread blocked in wait_for_frame / get_predecessor is woken immediately instead of sitting out the full timeout. This lets an experiment abort promptly: a feed loop parked in an up-to-80s stim-mask wait is released the instant the run is cancelled.
Cancellation: RunHandle gains an on_cancel hook, invoked synchronously from cancel(), that wakes a feed loop blocked in a stim-mask wait via Analyzer.cancel_pending_waits(). Previously a cancel issued during that wait took up to the stim-mask timeout (~80s) to take effect, leaving the frame handler connected in the meantime. Queue stats: Analyzer.queue_stats() / Controller.queue_stats() expose storage, pipeline and deferred queue depths for the status widget. finish_experiment runs its teardown (run wait + Analyzer drain) on a worker thread and pumps Qt, so napari stays responsive during the drain. Lag is anchored to the first frame's acquisition start rather than the worker's start time, so worker/engine startup (~1s) is no longer charged to every lag reading.
Stop now cancels the run and then runs finish_experiment(), so the next run starts clean instead of leaking the old Analyzer; the state banner shows STOPPING... while the drain runs. Stats are split into three panels (timing / queues / errors). The storage and pipeline queue depths render as grayscale fill bars that turn red past 80% of capacity; deferred shows as a plain count. The FovMap is freely resizable instead of pinned square.
|
looks super cool and well executed. Thanks. |
Summary
Move the MDA feed loop onto a worker thread, expose live status through a
RunHandle(psygnalSignal), and add a napari dock widget that mirrors + steers the current run. Replaces the synchronous-blockingrun_experiment/continue_experimentAPI.Why
The controller's feed loop ran on the main thread, so:
run_experimentblocked the calling cell — no interactive monitoring / cancellation without Ctrl-C (which sometimes left device state half-set).Moving the loop onto its own thread fixes all of these: napari is responsive by construction, the cell returns immediately, and cancellation / pause / live status become natural.
What changed
New:
faro/core/run_status.pyRunStatus— immutable snapshot dataclass:state,current_event_index,current_fov,n_events_total,n_events_consumed,n_frames_received,started_at/finished_at,lag_ms,background_errors,fatal_error, …RunHandle— owns the worker thread + cooperative cancel/pause events, carries the run's (sorted) event list. Methods:status(),wait(),cancel(),pause(),resume(),is_running(),is_paused(). Signal:statusChanged(psygnal) emitting the latestRunStatus.RunState:pending → running ⇄ pausing/paused → done/error(cancellingon cancel).faro/core/controller.pyController.runStarted = Signal(object)fires on each new run/continue carrying the freshRunHandle.run_experiment/continue_experimentspawn a worker thread and return the handle immediately; validation still runs synchronously on the caller. Events are sorted once and stashed on the handle so the widget renders them in execution order._run_workercentralises pre-flight setup and wraps the feed loop so failures land inhandle.fatal_errorinstead of crashing the user._run_mda_with_eventspollscancel_eventandpause_eventeach iteration — pause halts feeding after the in-flight backpressure window drains; resume continues.STOP_EVENTbehind; reusing the queue made the next run's engine consume that sentinel and stall after a few events ("stuck at 3/80")._bump_status_for_frameskipsIMG_STIMsnaps — a stim emission is the SLM-illuminated snap paired with its imaging frame; counting it double-updated lag/elapsed and drifted the frame count off the RTMEvent count._NapariMDAHandlerkeeps routing frames into thepreviewlayer throughout the run; the controller just stops continuous sequence acquisition once at MDA start to avoid a snap-buffer race. Notebooks can drop the old "break the CoreViewerLink before running" dance.New:
faro/widgets/experiment_status.pyExperimentStatusWidget— a napari dock panel that mirrors and controls the current run:runStarted.Async/Qt fixes folded in
PYMM_SIGNALS_BACKEND=psygnalforced infaro/microscope/base.py— with aQApplicationloaded, pymmcore-plus otherwise picks the Qt signal backend and queuesframeReadyto the main thread; if the main thread is blocked (handle.wait()), frames never reach the controller. Forcing psygnal keeps the data path direct/synchronous on the engine thread.statusChangedwiththread="main"+ drivespsygnal.qt.start_emitting_from_queue()so worker-thread emits reach QWidgets safely.uv.lock: bumpedpymmcore-widgetspast an upstream fix (_presets_widgetcrashing on an empty device label during MDA events).BREAKING: notebook updates required
Before
After — choose one:
(a) Blocking equivalent (smallest diff):
(b) Non-blocking, with status / cancel / pause:
Optional napari widget:
Demo notebook (test artifact — remove before merge)
experiments/02_demo_sim_optogenetic/demo_sim_optogenetic_napari_async.ipynbis included only to exercise this PR against the virtual-microscope optogenetic backend (async run, pause/resume, cancel/restart, the status widget, multi-FOV). It doubles as a worked example of what the migrated notebooks could look like. It should be deleted before this PR merges — the real deliverable is the API + widget, not this notebook.What to check / test before merging
experiments/*that callsrun_experiment/continue_experiment— migrate to.wait()or the non-blocking flow. Confirm none rely on the old blocking return.CoreViewerLinkbefore a run — that workaround is no longer needed; verify removing it and that the preview layer keeps updating during the run.tests/hardware/*— update for the newRunHandlereturn type; run on the Moench rig.n_frames_receivedoutpaces the RTMEvent count — verify the strip/stats still read sensibly or gate the assumption.continue_experiment+ the widget: confirm the strip/map rebuild correctly for the appended events and the FOV map merges positions.import farostays Qt-free;.wait()path works without aQApplication.virtual-microscopelockfile pin —uv lock --upgrade-package virtual-microscopeto pick up the fixes now on its default branch (JIT pre-warm;SimCameraDevicedigital ROI / MDA-teardown fix). Without this the demo notebook's first ~4 s of frames stall and the napari Snap preview freezes after a run. Commit theuv.lockchange separately (it is not async/widget code).Related (separate repo)
Two
virtual-microscopefixes were needed for the demo notebook and have already landed on its default branch (virtual-env):RealtimeEnginestarts; otherwise the first ~4 s of snaps stall behind a compile holding the sim lock, so frames arrive in a burst instead of paced.SimCameraDevicedigital ROI — implements real ROI cropping. It also fixes an MDA-teardown bug: the camera previously raisedNotImplementedErrorfromset_roi, which abortedMDARunner._finish_runbefore it emittedsequenceFinished; napari-micromanager then never cleared_mda_running, so the Snap preview silently stopped updating after a run.These are not part of this PR — faro just needs the lockfile bump above to pick them up.
Verification
Exercised end-to-end against the virtual-microscope optogenetic backend (napari + napari-micromanager + the widget):
Compatibility notes
import faro.widgets);import faro/import faro.corestay Qt-free.AbstractMicroscope.Screenshot