Skip to content

mcp: work on several updates in parallel (workspaces, background jobs, lock wait)#216

Open
plusky wants to merge 9 commits into
openSUSE:mainfrom
plusky:feat/mcp-parallelism
Open

mcp: work on several updates in parallel (workspaces, background jobs, lock wait)#216
plusky wants to merge 9 commits into
openSUSE:mainfrom
plusky:feat/mcp-parallelism

Conversation

@plusky

@plusky plusky commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Lets one client — or several agents — keep more than one update moving at once, without each load_template tearing the previous update's hosts down. Five commits.

1. Named workspaces

Every tool gains an optional workspace selector (default "default"). Each name resolves, via the existing SessionRegistry, to its own isolated McpSession (own loaded template, targets, per-session lock). Calls in different workspaces run concurrently, so a single stdio client can advance update B while update A's slow host op runs. stdio now uses the registry too (idle sweeper disabled). New tools: list_workspaces, close_workspace. The default workspace reproduces today's behaviour.

2. Async background jobs

Slow host commands (run, update, downgrade, prepare, install, uninstall, set_repo, reboot) take background=true: returns a job id immediately instead of holding the request open, running under the workspace lock. Poll with job_status, fetch with job_result (job_list / job_cancel too).

3. Wait for a busy refhost

Separate agents are separate processes, so a refhost lock genuinely excludes them. [lock] wait (seconds, default 0 = fail-fast) makes lock queue on a busy host — polling every [lock] wait_poll seconds until released/reaped/ours — instead of erroring. The connect-time warning is untouched; a warning is logged on wait-start/timeout so a REPL user still sees the host is busy.

4. Refhost pool selection

When refhosts.yml lists several interchangeable hosts for the same test target and [refhosts] pool_select is on, add_host connects just one free host per target instead of the whole matrix: tries candidates in turn, skips any locked by another agent, and claims (locks) the one it takes — so parallel agents draw distinct hosts. The target is the full product + version + arch + addons the update asks for, not just arch — so an update spanning all arches of SLE15-SP5 and SP7 still gets a host per (service-pack, arch); only genuine duplicates collapse to one. Searched across all locations (location ignored). Off by default.

Together: a pool gives several agents distinct hosts; lock wait makes them queue when the pool is exhausted; workspaces + background jobs keep several updates moving per agent/client.

Notes

  • Workspaces inside one process share the OS pid, so the (user+pid) refhost lock treats them as one owner — they never block each other on a host; coordinate same-process workspaces so they don't drive the same host destructively at once. [lock] wait and pool selection matter across separate processes/agents.
  • location is optional: with none configured/specified it defaults to the default bucket of refhosts.yml; pool selection ignores location entirely.

Tests

Full suite green (1312 passed locally). New coverage for workspaces, background jobs, lock-wait, and pool search/selection (incl. the SP5-vs-SP7 distinct-slot case). Docs: new "Working on several updates in parallel" section in Documentation/mcp.rst.

@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.64706% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.85%. Comparing base (15761b8) to head (9a8dd8a).

Files with missing lines Patch % Lines
mtui/mcp/session.py 83.33% 13 Missing ⚠️
mtui/test_reports/testreport.py 89.28% 12 Missing ⚠️
mtui/mcp/tools.py 94.38% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #216      +/-   ##
==========================================
+ Coverage   86.52%   86.85%   +0.32%     
==========================================
  Files         160      161       +1     
  Lines        8980     9369     +389     
==========================================
+ Hits         7770     8137     +367     
- Misses       1210     1232      +22     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

@plusky plusky force-pushed the feat/mcp-parallelism branch from f667d83 to baf77ff Compare June 22, 2026 16:11
plusky added a commit to plusky/mtui that referenced this pull request Jun 22, 2026
Make the refhost pool usable from a single mtui-mcp client: run several
workspaces' host phases in parallel on distinct pool hosts, with no
refhosts.yml change (the pool already lists many hosts per arch).

The problem: the remote /var/lock/mtui.lock is keyed on user+pid, so two
workspaces in one process (same pid) both see a lock as "mine" — the lock
cannot keep them off the same host. Pool selection (openSUSE#216) therefore wasn't
safe within one client.

- host_arbiter.HostArbiter: a process-global, thread-safe map of refhost ->
  owning workspace with a wait queue. One instance per SessionRegistry,
  shared by every session it mints; each session is bound to it under its
  registry key (the owner) via McpSession.bind_arbiter.
- TestReport pool selection is now arbiter-aware: a candidate held by another
  workspace in this process is skipped, and if every candidate is held the
  claim queues up to [lock] wait seconds for one to be released
  (acquire_any) — "multiple queues per refhost". Falls back to the prior
  remote-lock-only path when no arbiter is bound (REPL / single session).
- Cross-process / manual visibility uses the existing lock mechanism: a claim
  takes the remote lock with an identifying comment ("mtui-mcp pool <RRID>
  [<owner>]"), so other mtui-mcp servers and manual `mtui` users see the host
  busy and by what; release_pool_claims() removes that remote lock on
  workspace close (McpSession.close, also fired by the idle sweeper) so hosts
  do not leak as locked.
- Stale mtui-mcp pool locks are reaped by the normal reap_if_stale path
  (commented locks are not exempt), recovering a crashed server's claims.

Tests: HostArbiter (claim/release/release_owner/acquire_any incl. the
queue-until-released wait); arbiter-aware selection (skip other-workspace
host, none-free, remote-locked-retry); release_pool_claims unlocks remote +
drops ownership; reap_if_stale reaps an mtui-mcp pool-commented lock.
plusky added a commit to plusky/mtui that referenced this pull request Jun 23, 2026
Make the refhost pool usable from a single mtui-mcp client: run several
workspaces' host phases in parallel on distinct pool hosts, with no
refhosts.yml change (the pool already lists many hosts per arch).

The problem: the remote /var/lock/mtui.lock is keyed on user+pid, so two
workspaces in one process (same pid) both see a lock as "mine" — the lock
cannot keep them off the same host. Pool selection (openSUSE#216) therefore wasn't
safe within one client.

- host_arbiter.HostArbiter: a process-global, thread-safe map of refhost ->
  owning workspace with a wait queue. One instance per SessionRegistry,
  shared by every session it mints; each session is bound to it under its
  registry key (the owner) via McpSession.bind_arbiter.
- TestReport pool selection is now arbiter-aware: a candidate held by another
  workspace in this process is skipped, and if every candidate is held the
  claim queues up to [lock] wait seconds for one to be released
  (acquire_any) — "multiple queues per refhost". Falls back to the prior
  remote-lock-only path when no arbiter is bound (REPL / single session).
- Cross-process / manual visibility uses the existing lock mechanism: a claim
  takes the remote lock with an identifying comment ("mtui-mcp pool <RRID>
  [<owner>]"), so other mtui-mcp servers and manual `mtui` users see the host
  busy and by what; release_pool_claims() removes that remote lock on
  workspace close (McpSession.close, also fired by the idle sweeper) so hosts
  do not leak as locked.
- Stale mtui-mcp pool locks are reaped by the normal reap_if_stale path
  (commented locks are not exempt), recovering a crashed server's claims.

Tests: HostArbiter (claim/release/release_owner/acquire_any incl. the
queue-until-released wait); arbiter-aware selection (skip other-workspace
host, none-free, remote-locked-retry); release_pool_claims unlocks remote +
drops ownership; reap_if_stale reaps an mtui-mcp pool-commented lock.
@plusky plusky force-pushed the feat/mcp-parallelism branch from 88d54d9 to c8614c3 Compare June 23, 2026 08:47
plusky added a commit to plusky/mtui that referenced this pull request Jun 23, 2026
…earch_pool/locked_by

Compute pool slots in-command via query() instead of Refhosts.search_pool, and
use the public Target.is_locked() for --free instead of Target.locked_by (both
of which are added by openSUSE#216 and absent on main). Keeps this PR independent.
mimi1vx pushed a commit to plusky/mtui that referenced this pull request Jun 23, 2026
…earch_pool/locked_by

Compute pool slots in-command via query() instead of Refhosts.search_pool, and
use the public Target.is_locked() for --free instead of Target.locked_by (both
of which are added by openSUSE#216 and absent on main). Keeps this PR independent.
mimi1vx pushed a commit that referenced this pull request Jun 23, 2026
…ol/locked_by

Compute pool slots in-command via query() instead of Refhosts.search_pool, and
use the public Target.is_locked() for --free instead of Target.locked_by (both
of which are added by #216 and absent on main). Keeps this PR independent.
plusky added 9 commits June 23, 2026 12:33
…one client

Every tool now takes an optional `workspace` selector (default "default").
Each distinct name resolves, via the existing SessionRegistry, to its own
isolated McpSession: own loaded template, own `targets`, own per-session
lock. Because the lock is per-session, calls in different workspaces run
concurrently (each blocking body in its own thread), so one stdio client
(Claude Code) can advance update B while update A's slow host op runs —
instead of load_template tearing A's hosts down to touch B.

- registry: `workspace_key`/`split_workspace_key` compose/parse the
  per-client base key + workspace name; `resolve_session` grows a
  `workspace` arg (default reproduces today's one-session-per-client);
  `live_sessions()` snapshot for listing.
- main: stdio now uses a SessionRegistry too (idle sweeper disabled — a
  workspace left quiet while you work another must keep its hosts). The
  default workspace is minted lazily, so callers that never name one are
  unaffected.
- tools / testreport_tools: surface the `workspace` parameter and thread
  it through; it is popped before argv encoding so it never leaks to the CLI.
- new tools: list_workspaces (this client's workspaces + their loaded
  template and hosts) and close_workspace (disconnect a workspace's hosts
  and drop it). Both are scoped to the calling client.
- tests: workspace key round-trip, per-workspace and cross-client isolation,
  default-workspace equivalence, live_sessions snapshot.
Slow host commands (run/update/downgrade/prepare/install/uninstall/
set_repo/reboot) gain a `background=true` flag. Instead of holding the
request open for the minutes the op takes, it returns a job id at once and
runs the command in an asyncio task that still acquires the session lock
for its duration (so it serialises against the workspace's other mutating
calls exactly like a foreground call). The client polls and meanwhile drives
other workspaces — the practical "don't block the desk on one slow host op".

- session: per-session job table + start_job (background runner), job_list,
  job_status, job_result (returns stdout when done, surfaces the command's
  failure envelope when failed, tells the caller to poll while running),
  job_cancel (with the documented mid-SSH detach caveat).
- tools: SLOW_COMMANDS gain the `background` parameter; new job_list /
  job_status / job_result / job_cancel tools, all workspace-scoped.
- tests: done / failed / still-running / unknown-id / list / cancel paths.
…nt sharing)

When a refhost is locked by another session/agent, TargetLock.lock() can
now queue on it — poll until the foreign lock is released (or reaped as
stale, or becomes ours) — instead of raising TargetLockedError immediately.
This lets several agents share one refhost pool: a host in use is waited
for, not errored on. Matters because separate agents are separate processes
(distinct pid) so the lock genuinely excludes them (workspaces inside one
mtui-mcp process share a pid and never contended).

- config: [lock] wait (seconds, default 0 = unchanged fail-fast) and
  [lock] wait_poll (seconds, default 15).
- locks: _wait_for_release polls up to `wait`, logging a warning when it
  starts waiting (so a REPL user still sees the host is locked and that
  mtui is now waiting — the connect-time warning is untouched) and on
  timeout; on timeout the caller raises TargetLockedError as before.
  _int_cfg reads the options defensively.
- tests: fail-fast default, wait-then-succeed (released mid-poll, fake
  clock), wait-then-timeout.

Note: this is the lock-handling half of the refhost-pool work; auto-
selecting a *free* candidate per arch from a multi-host pool (so agents
pick different hosts rather than queue on one) is the remaining infra step.
…jobs, lock wait)

New "Working on several updates in parallel" section covering the workspace
argument + list/close_workspace tools, the background=true slow-op flag +
job_status/job_result/job_list/job_cancel, and the [lock] wait/wait_poll
options, with the same-pid caveat for workspaces inside one process. Synopsis
updated to mention named workspaces.
… (parallel agents)

Completes the refhost-pool half of the parallelism work (the lock-wait
commit handles queueing; this picks the host). When refhosts.yml lists
several interchangeable hosts for the same test target and [refhosts]
pool_select is on, add_host connects just one *free* host per target instead
of the whole matching matrix: it tries candidates in turn, skips any locked
by another agent, and claims (locks) the one it takes — so parallel agents
drawing from the same pool end up on different hosts. If all candidates are
busy it falls back to the first and the [lock] wait policy governs the wait.

The selection slot is the full test-target identity the update asks for —
product + version + arch + addons — NOT just arch. So an update spanning all
arches of e.g. SLE15-SP5 and SP7 still gets a host for every (service-pack,
arch) pair; only genuine duplicates (several hosts for the very same target)
collapse to one. The slot is keyed on the matched query attribute, so a host
carrying an extra addon still pools with a plainer host that satisfies the
same target. The pool is searched across all locations (location ignored).

- config: [refhosts] pool_select (bool, default false -> unchanged behaviour).
- store: search_pool() returns (host, slot) pairs, slot = str(matched
  attribute); with all_locations=True aggregates across every location,
  de-duplicated by name (a host binds to the first slot it matches).
- testreport: refhosts_from_tp records each candidate's slot under pool mode;
  connect_targets first runs _claim_pool_candidates, which groups pending
  candidates by slot and, per multi-candidate slot, connect+claims the first
  free host (_claim_first_free) and drops the rest. Gates use `is True` so a
  MagicMock test config can't trip the path.
- docs: pool_select + the "location is optional / defaults to default" note.
- tests: search_pool all-locations / same-target-one-slot / distinct-SP-
  distinct-slot / fallback / dedupe; selection first-free / all-busy /
  lock-race / reaped-stale / reduces-only-within-a-slot.
Fixups for the lint/format/typecheck CI on this branch:

- tools.py: register_workspace_tools / register_job_tools now
  globals().setdefault("Context", Context) — not just to clear F401 but
  because FastMCP's find_context_parameter runs get_type_hints against the
  module (with `from __future__ import annotations`), so the closures' string
  `Context | None` annotation must resolve in module globals for ctx to be
  injected. Mirrors testreport_tools.
- target.py: add Target.try_claim() and Target.locked_by(), encapsulating the
  pool probe+claim so refhost-pool selection no longer reaches into the
  private Target._lock (clears SLF001) and drops the bare-except probe.
- testreport.py: _claim_first_free uses target.try_claim()/locked_by().
- locks.py: collapse the lock-wait guard into one `if` (SIM102).
- ruff format (locks/registry/tools/testreport); test cast() so the unbound
  TestReport methods accept the duck-typed fake under ty.

ruff check/format clean; ty clean for the touched files; tests green.
Make the refhost pool usable from a single mtui-mcp client: run several
workspaces' host phases in parallel on distinct pool hosts, with no
refhosts.yml change (the pool already lists many hosts per arch).

The problem: the remote /var/lock/mtui.lock is keyed on user+pid, so two
workspaces in one process (same pid) both see a lock as "mine" — the lock
cannot keep them off the same host. Pool selection (openSUSE#216) therefore wasn't
safe within one client.

- host_arbiter.HostArbiter: a process-global, thread-safe map of refhost ->
  owning workspace with a wait queue. One instance per SessionRegistry,
  shared by every session it mints; each session is bound to it under its
  registry key (the owner) via McpSession.bind_arbiter.
- TestReport pool selection is now arbiter-aware: a candidate held by another
  workspace in this process is skipped, and if every candidate is held the
  claim queues up to [lock] wait seconds for one to be released
  (acquire_any) — "multiple queues per refhost". Falls back to the prior
  remote-lock-only path when no arbiter is bound (REPL / single session).
- Cross-process / manual visibility uses the existing lock mechanism: a claim
  takes the remote lock with an identifying comment ("mtui-mcp pool <RRID>
  [<owner>]"), so other mtui-mcp servers and manual `mtui` users see the host
  busy and by what; release_pool_claims() removes that remote lock on
  workspace close (McpSession.close, also fired by the idle sweeper) so hosts
  do not leak as locked.
- Stale mtui-mcp pool locks are reaped by the normal reap_if_stale path
  (commented locks are not exempt), recovering a crashed server's claims.

Tests: HostArbiter (claim/release/release_owner/acquire_any incl. the
queue-until-released wait); arbiter-aware selection (skip other-workspace
host, none-free, remote-locked-retry); release_pool_claims unlocks remote +
drops ownership; reap_if_stale reaps an mtui-mcp pool-commented lock.
…laim, pool helpers

Raise patch coverage on the parallelism code codecov/patch flagged:
- test_mcp_tool_layer: register the workspace/job/testreport tools on a
  capturing fake server and drive the registered coroutines through a real
  SessionRegistry (list/close workspace incl. the no-provider-support path,
  job_list/status/result/cancel lifecycle, and the resolve_session hop in
  each testreport tool).
- test_target_try_claim: Target.try_claim branch matrix (free / locked-by-
  other / stale-reaped / own-lock / lost-race) + locked_by.
- test_pool_helpers_extra: real _pool_lock_comment, _int_cfg fallback,
  release_pool_claims skip/unlock-error branches, _disconnect_candidate
  teardown error.

Cuts the PR's uncovered new lines ~104 -> ~31; full suite 1476 passing.
@plusky plusky force-pushed the feat/mcp-parallelism branch from b6da297 to 9a8dd8a Compare June 23, 2026 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant