Skip to content

feat: parallel FMSR dispatch strategies for N×M LLM call bottleneck#414

Open
yassinejebbouri wants to merge 9 commits into
IBM:mainfrom
yassinejebbouri:parallelization-results
Open

feat: parallel FMSR dispatch strategies for N×M LLM call bottleneck#414
yassinejebbouri wants to merge 9 commits into
IBM:mainfrom
yassinejebbouri:parallelization-results

Conversation

@yassinejebbouri

Copy link
Copy Markdown

Summary

Implements four execution strategies for the FMSR N×M bottleneck, where each
(failure mode, sensor) pair requires one sequential LLM call by default.

  • Sequential (baseline): one call at a time
  • Parallel dispatch: fixed thread pool with configurable concurrency cap
  • Adaptive ceiling-start (AIMD): starts at max concurrency, halves on rate-limit errors, increments on consecutive successes
  • Hedged execution: fires a duplicate request after 8 s of silence, uses whichever response arrives first and cancels the other — caps p95 latency at ~16 s

Results

  • Hedged achieves up to 36× speedup on worst-case scenarios (559 s → 15.5 s)
  • Adaptive ceiling is the best cost-efficient strategy (~20×, no doubled API usage)
  • Strategy selected at runtime via the FMSR_STRATEGY environment variable

Files changed

  • src/servers/fmsr/main.py

Test plan

  • Run an FMSR scenario with FMSR_STRATEGY=hedged
  • Run an FMSR scenario with FMSR_STRATEGY=adaptive_ceiling
  • Run src/benchmarking/bench_fmsr.py to reproduce the full comparison

DariefMaes and others added 9 commits April 6, 2026 17:50
The previous implementation retried failures with a simple loop inside _call_relevancy.
Moved retry/backoff logic to LiteLLM Router (exponential backoff, circuit breaker).

Per-request timeout was also missing: the Router constructor timeout is not
forwarded automatically to individual .completion() calls, so hung WatsonX
requests could block indefinitely. Now passed explicitly on every call.
Two new parallel execution strategies for the N×M FM↔sensor mapping:
- _mapping_adaptive_ceiling: fires all pairs concurrently from t=0 and
  halves the semaphore limit immediately on any 500 error. Avoids the
  AIMD ramp-up penalty for small N where additive increase finishes
  only after all work is already done.
- _mapping_hedged: ceiling-start combined with speculative duplicate
  requests. If any call stalls past FMSR_HEDGE_AFTER_S (default 8s),
  a rescue copy is fired on a background thread. Whichever copy responds
  first wins, capping tail latency at ~2×hedge_after_s instead of the
  full 90s Router timeout.

Also removed unused intermediate implementations (_mapping_batched_parallel,
_mapping_async, _call_relevancy_async) that were never wired into the
benchmark, and cleaned up asyncio import left behind after their removal.
psutil: hardware sampling (CPU%, memory RSS, thread count) during runs.
filelock: safe append-only writes to the shared JSONL results file when
  multiple processes run the benchmark concurrently or resume after crash.
matplotlib: benchmark visualization plots (wall time, speedup, per-call
  latency distribution, hardware utilization).
…nv var

The benchmark controls which parallelization path runs by passing
FMSR_STRATEGY in the subprocess environment when spawning the server.
The tool interface (inputs, outputs) is unchanged — only the internal
N×M dispatch is selected at startup from the env var:

  sequential        — one LLM call at a time (baseline)
  parallel          — fixed thread pool (FMSR_PARALLEL_WORKERS workers)
  adaptive_ceiling  — ceiling-start semaphore, halves on 500 error
  hedged            — ceiling-start + speculative duplicate on stall

Default remains parallel with 2 workers, matching the original behaviour
when FMSR_STRATEGY is not set.
…ntation

bench_hardware.py  — samples CPU%, memory RSS, and thread count via psutil
                     at a configurable interval during each run
bench_stats.py     — t-distribution confidence intervals, per-call stat
                     aggregation, and the build_summary() roll-up used by
                     the main benchmark and plot generator
bench_instrumentation.py — timing hooks that patch fmsr._call_relevancy to
                     capture per-call latency and phase boundaries; used by
                     the in-process debug runner (test_scenario.py), not by
                     the MCP benchmark where the server runs out-of-process
Covers wall time grouped bars, speedup line charts, per-call latency
box plots, hardware utilization, phase breakdowns, and scenario scaling.
Strategies and colors are configurable; defaults match the 4-strategy set.
Calls get_failure_mode_sensor_mapping through the FMSR MCP server
(stdio subprocess) for each of 4 strategies across 15 scenarios × 3 runs.
This matches the real agent execution path exactly — same subprocess
spawn, same stdio protocol, same tool interface as workflow/executor.py.

Key design points:
- Sensors fetched live from iot-mcp-server (CouchDB), not hardcoded
- Failure modes fetched live from fmsr-mcp-server (YAML), not hardcoded
- Per-scenario sensor/FM slices derived from the real fetched lists using
  keyword filtering that mirrors what the agent query implies
- Strategy selected by passing FMSR_STRATEGY to the server subprocess env
- Resume support: skips (run, scenario, strategy) triples already in JSONL
- Results written to results_mcp/ to preserve existing results/
test_scenario.py  — debug tool that imports the FMSR server in-process
                    and patches _call_relevancy to print every (sensor, FM)
                    pair live as it executes. Sensors and failure modes are
                    fetched from the live MCP servers at startup. Useful for
                    tracing individual LLM calls without running the full
                    benchmark.
                    Usage: uv run python -m src.benchmarking.test_scenario
                           --scenario 109 --strategy hedged

eval_fmsr.py      — original sequential-vs-parallel evaluation script that
                    predates the multi-run benchmark; kept as a reference
                    baseline and for quick one-off comparisons.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants