Skip to content

Queued offline updates#1430

Open
rwalker777 wants to merge 39 commits into
esphome:mainfrom
rwalker777:queued-offline-updates
Open

Queued offline updates#1430
rwalker777 wants to merge 39 commits into
esphome:mainfrom
rwalker777:queued-offline-updates

Conversation

@rwalker777

@rwalker777 rwalker777 commented Jun 13, 2026

Copy link
Copy Markdown

What does this implement/fix?

Add detection when device is offline to local build and add a queue flag that triggers an OTA when the device wakes.

Related issue or feature (if applicable):

  • fixes

Types of changes

  • Bugfix (non-breaking change which fixes an issue) — bugfix
  • New feature (non-breaking change which adds functionality) — new-feature
  • Enhancement to an existing feature — enhancement
  • Breaking change (fix or feature that would cause existing functionality to not work as expected) — breaking-change
  • Refactor (no behaviour change) — refactor
  • Documentation only — docs
  • Maintenance / chore — maintenance
  • CI / workflow change — ci
  • Dependencies bump — dependencies

Frontend coordination

  • No frontend change needed
  • [X ] Companion frontend PR: esphome/device-builder-frontend#

Checklist

  • The code change is tested and works locally.
  • Pre-commit hooks pass (ruff, codespell, yaml/json/python checks).
  • Tests have been added or updated under tests/ where applicable.
  • components.index.json / definitions/components/*.json have not been hand-edited (regenerate via script/sync_components.py if a sync is needed).
  • Architecture-level changes are reflected in docs/ARCHITECTURE.md and/or docs/API.md.

rwalker777 and others added 5 commits June 12, 2026 14:38
…e devices. When a firmware job completes successfully for a device that is currently offline, we set a new `queued_update` flag on the device. This indicates that the compiled firmware is ready and waiting to be flashed via OTA the next time the device checks in via mDNS.
@rwalker777

Copy link
Copy Markdown
Author

Will submit frontend to show queued status and option to cancel. Feedback appreciated.

@codspeed-hq

codspeed-hq Bot commented Jun 13, 2026

Copy link
Copy Markdown

Merging this PR will degrade performance by 10.17%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

❌ 1 regressed benchmark
✅ 26 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
test_device_list_to_dict_fleet[50] 637.8 µs 710 µs -10.17%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing rwalker777:queued-offline-updates (d06801c) with main (7af7fd4)

Open in CodSpeed

@esphbot

esphbot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

@rwalker777: Will submit frontend to show queued status and option to cancel. Feedback appreciated.

Core gap: on_queued_update_ready unwired. No caller passes it. Wake never flashes. Also queued_update never reset — flapping device re-installs forever. Persist flag via state_callbacks path, not direct device.queued_update = True in runner (lost on restart). Fire DeviceEventData, not raw {"device": device}. Two tests reimplement monitor logic inline — drive real apply() instead.

@esphbot

esphbot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Previous review — superseded by a newer review below.

@esphbot esphbot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issues found — see the review comment above.

@github-actions github-actions Bot added the new-feature New feature label Jun 15, 2026
@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.54%. Comparing base (00a6f51) to head (d06801c).
⚠️ Report is 18 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff            @@
##             main    #1430    +/-   ##
========================================
  Coverage   99.54%   99.54%            
========================================
  Files         223      224     +1     
  Lines       17423    17569   +146     
========================================
+ Hits        17343    17489   +146     
  Misses         80       80            
Flag Coverage Δ
py3.12 99.51% <100.00%> (+<0.01%) ⬆️
py3.14 99.54% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...home_device_builder/controllers/_device_scanner.py 99.40% <100.00%> (+<0.01%) ⬆️
...er/controllers/_device_state_monitor/controller.py 98.24% <100.00%> (+0.09%) ⬆️
...ice_builder/controllers/devices/_metadata_store.py 100.00% <ø> (ø)
...e_device_builder/controllers/devices/controller.py 99.71% <100.00%> (+<0.01%) ⬆️
...ome_device_builder/controllers/devices/metadata.py 100.00% <100.00%> (ø)
..._device_builder/controllers/firmware/controller.py 100.00% <100.00%> (ø)
esphome_device_builder/models/devices.py 100.00% <100.00%> (ø)
esphome_device_builder/models/firmware.py 100.00% <ø> (ø)

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rwalker777 rwalker777 requested a review from esphbot June 16, 2026 16:15
@esphbot

esphbot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Previous review — superseded by a newer review below.

@esphbot esphbot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issues found — see the review comment above.

@rwalker777 rwalker777 requested a review from esphbot June 16, 2026 20:29
@rwalker777

Copy link
Copy Markdown
Author

This is dependent on this ESPHome PR - esphome/esphome#16358

@esphbot

esphbot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@rwalker777: This is dependent on this ESPHome PR - esphome/esphome#16358

Noted — feature can't function until esphome/esphome#16358 lands. Mark this PR draft until then. Two blockers independent of that dependency: queued_update persists to the store but _resolve_device_metadata never reads it back (lost on restart/rescan), and _execute_job arms the flag for any offline COMPILE, so plain firmware/compile triggers an unwanted wake-flash. Both need a fix here regardless of upstream.

@esphbot

esphbot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Previous review — superseded by a newer review below.

@esphbot esphbot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issues found — see the review comment above.

rwalker777 and others added 5 commits June 16, 2026 19:58
…gate offline update queuing logic, ensuring that devices are only queued once per offline state. This prevents redundant compile-only jobs for devices that are already marked as having a queued update, while still allowing new offline devices to be queued appropriately. Additionally, the test suite is updated to reflect the new flag and its behavior in the offline queue processing.
@esphbot

esphbot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

PR Review — Queued offline updates

Solid persistence wiring, but the arming logic has a state-window gap that can silently swallow an install, plus reflective type access and unfilled frontend coordination.

What's done well:

  • Persistence is correct: queued_update is added to STORE_FIELDS, written through the tri-state update() (truthy=write, falsy=clear), and read back in _resolve_device_metadata, so a deferred install survives a restart. is_deferred_install likewise persists on the queued FirmwareJob.
  • The flag follows the established "Device is source of truth" pattern — apply_queued_update dedupes via _any_matching_device_differs, mutation flows through a callback that persists + fires DEVICE_UPDATED.
  • _handle_device_wake reads the correct flat {configuration, state} payload and clears the flag before kicking the upload, avoiding re-trigger; API.md is updated for the new command.

What needs attention:

  • Warning: a device coming ONLINE during the deferred compile lands in a gap — queued_update is armed only if still OFFLINE at compile-end, and the wake handler only fires on the ONLINE edge while already armed — so the install compiles but never flashes, silently.
  • Warning: hasattr(self, "_state_monitor") guards an impossible state (always set in __init__); the accompanying delattr test pins dead code.
  • Warning: getattr(job, "is_deferred_install", ...) / getattr(device, "queued_update", ...) / Any | None defeat type checking on flags that gate flashing.
  • Warning: new Device.queued_update field + firmware/clear_queued_update command need the companion frontend PR (still a placeholder) — without UI the queued state is invisible.
  • Suggestion: tests assert wiring rather than the end-to-end deferred flow; wake-flash may push a binary stale vs. edited YAML.

🟡 Important

1. Device coming online during the deferred compile silently drops the install (`esphome_device_builder/controllers/firmware/controller.py`, L258-272)

The deferred-install arming is gated on device.state == DeviceState.OFFLINE at two points that don't cover the full window:

  • install queues a compile-only job when the device is OFFLINE.
  • _execute_job sets queued_update=True after the compile only if the device is still OFFLINE.
  • _handle_device_wake triggers the OTA only on the ONLINE transition while queued_update is already True.

Consider the realistic sequence: offline device → install → compile starts → device wakes mid-compile (compiles take minutes) → _handle_device_wake runs but queued_update is still False, so nothing → compile finishes, device is now ONLINE so _execute_job skips arming.

Result: a binary is compiled but never flashed, with no error surfaced. The user clicked Install on an offline device, it came online, and nothing happened — which is exactly the scenario this feature exists to handle.

Fix: when a deferred compile completes successfully, arm queued_update unconditionally, then proactively check current reachability and kick the upload immediately if the device is already online (don't rely solely on a future ONLINE edge that may have already passed).

if is_comp and is_done and is_deferred:
    device = self._device_for_configuration(job.configuration)
    if device and device.state == DeviceState.OFFLINE and self._db.devices:
        self._db.devices.set_queued_update(device.name, is_queued=True)
2. `hasattr(self, "_state_monitor")` guards a state that cannot occur (`esphome_device_builder/controllers/devices/controller.py`, L321-325)

_state_monitor is set unconditionally in DevicesController.__init__ (controller.py:183) and accessed directly without guards everywhere else (start, stop, zeroconf, _build_address_cache_args, …). The hasattr branch here can never be False in production — it's dead defensive code.

Why it matters: the guard masks real wiring bugs (a genuine None/missing monitor would silently return False instead of failing loudly), and it spawned test_set_queued_update_handles_missing_attribute, which delattrs the attribute to exercise a branch that production can't reach — a test pinning the existence of dead code rather than observable behavior.

Simplify to a direct call: return self._state_monitor.apply_queued_update(name, is_queued=is_queued). Drop the testing-the-test case.

if hasattr(self, "_state_monitor") and self._state_monitor:
    return self._state_monitor.apply_queued_update(name, is_queued=is_queued)
return False
3. `getattr(...)`/`Any` reflective access on typed dataclass fields (`esphome_device_builder/controllers/firmware/controller.py`, L86-112)

Several new sites reach for fields through getattr with defaults even though they are declared dataclass attributes:

  • getattr(job, "is_deferred_install", False)is_deferred_install is a FirmwareJob field (default False).
  • getattr(device, "queued_update", False)Device.queued_update is a declared field.
  • getattr(d, "configuration", None) and the Any | None return on _device_for_configuration.

Why it matters: defensive getattr on known types defeats mypy/IDE checking — a typo'd attribute name silently returns the default instead of erroring, which is the opposite of what you want for a flag that decides whether firmware flashes. It also signals to the next reader that the type is uncertain when it isn't.

Use direct attribute access (job.is_deferred_install, device.queued_update, d.configuration) and type _device_for_configuration as Device | None. The conftest stub comment ("Test stubs must implement get_devices()") already commits to typed stubs, so the Any hedge isn't buying anything.

is_deferred = getattr(job, "is_deferred_install", False)
...
if device and getattr(device, "queued_update", False):
4. New model field + WS command need the companion frontend PR (`esphome_device_builder/models/devices.py`, L73-75)

Device.queued_update is a new model-shape field the frontend must render (the "queued / waiting to flash" indicator), and firmware/clear_queued_update is a new WS command that only makes sense if the UI exposes a way to see and clear the staged state.

The API.md row is added (good), but per CLAUDE.md, model-shape changes the frontend consumes require a coordinated companion PR, and the PR body still has the placeholder esphome/device-builder-frontend#<number> plus an unticked architecture-docs checkbox.

Why it matters: without the UI, a deferred install is invisible — the user clicks Install, nothing flashes immediately, and there is no surfaced "waiting for device to wake" state or a way to cancel it. Link the real frontend PR, or confirm explicitly that no frontend change is needed and explain how the user observes/clears the queued state.

queued_update: bool = False

🟢 Suggestions

1. Tests pin wiring, not the deferred-install behavior (`tests/controllers/_device_state_monitor/test_controller.py`, L24-73)

The monitor tests mock _any_matching_device_differs and then assert that apply_queued_update forwards to the callback — i.e. they re-assert the three-line body of the method under test rather than an end-to-end outcome. They'll pass even if the actual offline→compile→arm→wake→upload chain is broken.

What's missing is a test of the feature's real contract: install on an OFFLINE device enqueues a compile-only job (is_deferred_install=True, no upload), a successful deferred compile arms queued_update, and a subsequent ONLINE transition fires exactly one OTA upload (and clears the flag). That's where the race in _execute_job/_handle_device_wake would surface.

Consider replacing the redundant wiring tests with one behavioral test over the firmware controller's deferred path, including the "device comes online mid-compile" case.


Checklist

  • Feature functional end-to-end (offline→compile→wake→flash) — warning #1
  • State persisted across restart
  • No fragile reflective type discovery — warning #2, warning #3
  • Tests cover new branches without testing-the-test — suggestion #5
  • Docs + frontend coordination for new field/command — warning #4
  • No security/injection issues

To rebase specific severity levels, mention me: @esphbot rebase critical (fixes 🔴 only), @esphbot rebase important (fixes 🔴 + 🟡), or just @esphbot rebase for all.


Automated review by Kōan (Claude) HEAD=d06801c 5 min 29s

@esphbot esphbot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issues found — see the review comment above.

@rwalker777

Copy link
Copy Markdown
Author

@codspeedbot fix this regression

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants