Skip to content

yassinejebbouri/AssetOpsBench

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

453 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AssetOpsBench β€” HPML Performance Optimization Study

Columbia University Β· HPML Spring 2026 Β· Course Project

Yassine Jebbouri Β Β·Β  Darief Rida Maes Β Β·Β  Shriya Aishani Rachakonda Β Β·Β  Vivek G. Iyer
Advisor: Dr. Dhaval C. Patel Β Β·Β  Instructor: Dr. Kaoutar El Maghraoui

W&B GitHub

TL;DR β€” We optimized a plan-execute MCP agent for industrial asset diagnostics. The dominant bottleneck is the FMSR server's NΓ—M LLM call matrix (up to 63 sequential calls, 559–894 s wall time). Hedged parallelization cuts worst-case time by 36Γ— (559 s β†’ 15.5 s). INT4-quantized Llama 3.2 3B matches the default WatsonX 70B cloud baseline at 2.0 GB memory (βˆ’97%).


Optimization Summary

Strategy Best Speedup Status
Hedged parallelization 36Γ— (559 s β†’ 15.5 s) βœ… Recommended for tail-bound scenarios
Adaptive ceiling-start 20Γ— βœ… Best for short scenarios (<15 calls)
Parallel dispatch ~2Γ— βœ… Reliable baseline parallelization
DB context prefetching 5.7Γ— ⚠️ Conditional β€” hurts fast/cross-asset scenarios
LRU caching 40.3% tail improvement βœ… Reduces repeated sensor metadata retrieval
INT4 quantization (Llama 3.2 3B) 0.675 acc, 2.0 GB βœ… Pareto-optimal, matches 70B cloud baseline

Set the dispatch strategy via FMSR_STRATEGY=hedged (or parallel, adaptive_ceiling, sequential).


The Bottleneck: FMSR NΓ—M LLM Calls

The FMSRAgent's get_failure_mode_sensor_mapping tool executes one LLM API call per (asset, failure_mode, sensor) triple. For Chiller 6 at site MAIN this produces 45–63 calls per scenario. Under sequential execution a single stalled WatsonX call (30–90 s) blocks all subsequent calls, making the worst-case pipeline take nearly 15 minutes.

FMSR call_relevancy flowchart


Optimization Details

1 Β· Parallelization Strategies (src/servers/fmsr/main.py)

Four strategies are implemented, selectable via FMSR_STRATEGY:

Sequential β€” one call at a time. Any stall blocks everything. Baseline only.

Parallel β€” asyncio.gather with Semaphore(8). 429 errors are retried with exponential backoff.

Adaptive ceiling-start β€” inverts AIMD: starts at max concurrency, halves on 429, increments by 1 on success. Avoids the ramp-up penalty of standard AIMD.

Hedged β€” fires a duplicate request after 8 s of silence (the 99th-percentile non-stalled latency); whichever finishes first wins, the duplicate is cancelled. Cost: ~5–15% extra tokens. Benefit: caps p95 latency at ~16 s.

export FMSR_STRATEGY=hedged        # best for tail-bound scenarios
export FMSR_STRATEGY=adaptive_ceiling   # best for short scenarios
export FMSR_STRATEGY=parallel      # reliable general-purpose

Results (mean wall time in seconds):

Scenario Sequential Parallel Adaptive Hedged
106 92.1 17.3 17.0 11.0
108 217.1 91.4 67.9 13.8
109 566.0 218.8 84.6 43.0
110 444.6 190.7 54.0 18.8
112 75.5 43.6 15.7 11.0
114 559.4 193.4 64.2 15.5
120 362.8 226.3 92.4 44.6

2 Β· No-Tool Inference (src/workflow/executor.py)

When the planner generates a step with no tool call, the executor previously returned a static expected_output string. The new _answer_no_tool_step() function instead:

  1. Tries to deterministically extract the answer from prior step JSON (e.g., parse an assets response for the asset ID β€” zero LLM calls)
  2. Falls back to a minimal one-line LLM call using only the dependency context

This eliminates hallucinated intermediate values (e.g., "Chiller_6_id") that would corrupt downstream tool arguments, improving accuracy across all other optimization tracks.


3 Β· Deterministic Argument Resolution (src/workflow/executor.py)

The planner sometimes hallucinates tool argument values instead of using {step_N} placeholders. Two correction layers run before every tool call:

  • _infer_param() β€” extracts values deterministically from prior step JSON using exact key match then alias match (_ARG_ALIASES). Falls back to LLM only if extraction fails.
  • _fix_hardcoded_args() β€” corrects hardcoded-looking IDs (e.g., "Chiller_6_id") even when no placeholder was used, by scanning all prior step responses.

Extended placeholder regex handles LLM-generated variants like {step_1[0]}, {step_1[?].field}:

_PLACEHOLDER_RE = re.compile(r"\{step_(\d+)(?:\[[^\]]*\])?(?:\.\w+)?\}")

4 Β· DB Context Prefetching / LRU Caching

Prefetching (src/workflow/runner.py β€” fetch_db_context()): queries sites, assets, sensors, and failure modes upfront and injects them into the planner prompt so it can skip redundant discovery steps.

runner = PlanExecuteRunner(llm=llm, prefetch=True)

When it helps vs. hurts:

Scenario No Cache Prefetch Ξ”
120 317.4 s 55.5 s 5.7Γ— faster
118 182.8 s 61.1 s 3.0Γ— faster
107 42.1 s 239.8 s 5.7Γ— slower
117 38.8 s 436.5 s 11.2Γ— slower

Prefetching only fetches Chiller data. Wind Turbine scenarios receive wrong context, causing the planner to skip correct tool calls and dropping accuracy from 0.60 β†’ 0.43. Use only when scenarios are known to be slow (Execute phase >200 s) and asset scope matches.

LRU caching (src/servers/iot/cache.json, src/servers/fmsr/cache.json): caches recently accessed sensor/asset metadata at the tool-server layer. LRU is preferred over LFU because industrial workloads are dynamic β€” recency predicts reuse better than long-term frequency. Result: 40.3% wall-time reduction on tail-cases (β‰₯200 s scenarios).

Cache comparison


5 Β· Opt 2: Query-Driven Cell Pruning (src/workflow/pruner.py)

Before dispatching the NΓ—M relevancy matrix, scores each (failure_mode, sensor) pair against the user query using an overlap coefficient:

score = |query_tokens ∩ name_tokens| / min(|query_tokens|, |name_tokens|)

Pairs below PRUNE_THRESHOLD (default 0.30) are discarded, reducing the number of LLM calls proportionally.

export PRUNE_THRESHOLD=0.30
runner = PlanExecuteRunner(llm=llm, prune_fmsr=True, prune_threshold=0.30)

6 Β· Quantization-Aware Model Substitution (profiling/)

The _call_relevancy decision is binary (Yes/No on line 1 of a 3-line response). We tested whether a 70B model is necessary.

Experiment: 5 model families Γ— 3 precision levels Γ— 20 FMSR scenarios = 300 total runs via Ollama (CPU-only, Apple Silicon) routed through LiteLLM.

Model Precision Accuracy Latency Memory
WatsonX llama-3-3-70b cloud 0.660 7.78 s N/A
Llama 3.2 3B INT4 0.675 8.09 s 2.0 GB
Llama 3.2 3B FP16 0.650 7.79 s 6.0 GB
Llama 3.2 3B INT8 0.617 7.99 s 3.4 GB
Qwen2.5 7B INT4 0.642 8.04 s 4.7 GB
Granite 3.2 8B INT4 0.633 7.93 s 4.9 GB
DeepSeek-R1 7B INT4 0.633 7.75 s 4.7 GB
Gemma2 9B INT4 0.633 7.84 s 5.4 GB

All models vs API baseline

Key findings:

  • INT4 is Pareto-optimal β€” best or equal accuracy at lowest memory, within 0.30 s of FP16 latency, across all five families
  • Llama 3.2 3B INT4 matches the 70B cloud baseline β€” 0.675 vs. 0.660 (+1.5 pp) at 97% less memory
  • Parameter count does not predict accuracy β€” Gemma2 9B (largest) scores the same 0.633 as DeepSeek-R1 7B
  • Reasoning models degrade at FP16 β€” DeepSeek-R1 7B FP16 (0.617) is worse than its INT4 variant; longer chain-of-thought traces can contradict the binary answer

Per-scenario accuracy: Llama 3.2 3B

Quantization experiments use 1 run per cell; the INT4 vs. FP16 gap is not statistically significant at this sample size. Results indicate quantization is well-suited for binary FMSR classification β€” not that INT4 is inherently more accurate.


Quick Start (Optimizations)

uv sync

# Run with hedged parallelization (best for FMSR-heavy scenarios)
FMSR_STRATEGY=hedged uv run plan-execute "Which sensors detect Chiller 6 failure modes?"

# Run with DB prefetch + pruning enabled
uv run python -c "
import asyncio
from llm import LiteLLMBackend
from workflow.runner import PlanExecuteRunner

async def main():
    runner = PlanExecuteRunner(
        llm=LiteLLMBackend('openai/llama-3.3-70b-versatile'),
        prefetch=True,
        prune_fmsr=True,
        prune_threshold=0.30,
    )
    result = await runner.run('What sensors detect Chiller 6 failure modes?')
    print(result.answer)

asyncio.run(main())
"

# Full benchmark suite (139 scenarios, 3 runs each, results β†’ benchmarking_mcp.jsonl)
uv run python src/benchmarking/run_mcp.py --runs 3 --warmup 1

# Quantization benchmark (requires Ollama)
uv run python profiling/benchmark_runner.py

Environment Variables

Variable Default Description
FMSR_STRATEGY sequential sequential / parallel / adaptive_ceiling / hedged
FMSR_CONCURRENCY 8 Max concurrent calls for parallel/hedged
FMSR_MODEL_ID watsonx/meta-llama/llama-3-3-70b-instruct LLM backend for relevancy calls
PRUNE_THRESHOLD 0.30 Overlap coefficient threshold for cell pruning
LITELLM_API_KEY β€” API key for LiteLLM proxy
LITELLM_BASE_URL β€” Base URL for LiteLLM proxy
WATSONX_APIKEY β€” IBM WatsonX API key
WATSONX_PROJECT_ID β€” IBM WatsonX project ID

Repository Structure (Optimizations)

src/workflow/
β”œβ”€β”€ executor.py      # Step execution, arg resolution, hardware profiling, no-tool inference
β”œβ”€β”€ runner.py        # PlanExecuteRunner β€” prefetch, prune, plan, execute, summarize
β”œβ”€β”€ planner.py       # LLM plan generation, topology injection, DB context injection
β”œβ”€β”€ models.py        # PlanStep, StepResult, HardwareMetrics dataclasses
β”œβ”€β”€ profiler.py      # HardwareProfiler β€” CPU%, RAM, IO per tool call
β”œβ”€β”€ pruner.py        # Overlap-coefficient FMSR cell pruner (Opt 2)
└── timing.py        # HardwareMonitor, caching timing benchmarks

src/servers/fmsr/main.py     # FMSR server with 4 parallelization strategies
src/benchmarking/run_mcp.py  # Full 139-scenario benchmark harness (crash-safe JSONL)
src/evaluation/              # LLM judge, tool-call accuracy metrics, topology loader

profiling/
β”œβ”€β”€ benchmark_runner.py  # (model, precision, scenario) quantization sweep
β”œβ”€β”€ charts.py            # Chart generation
└── charts/              # Accuracy vs memory scatter, per-model heatmaps, ...

artifacts/timing/        # Caching comparison SVGs and raw timing JSON
eval_results/            # Baseline and topology-v1 evaluation run JSONs

Below: original IBM Research AssetOpsBench documentation.


AI Agents for Industrial Asset Operations & Maintenance

AssetOps MultiAgentBench EMNLP 2025 NeurIPS 2025 AAAI 2026

πŸ“˜ Tutorials: Learn more from our detailed guides β€”
ReActXen IoT Agent (EMNLP 2025) | FailureSensorIQ (NeurIPS 2025) | AssetOpsBench Lab (AAAI 2026) | Spiral (AAAI 2026) | AssetOpsBench Technical Material

πŸ“„ Paper | πŸ€— HF-Dataset | πŸ“’ IBM Blog | πŸ€— HF Blog | Contributors

Kaggle Hugging Face Open In Colab


πŸ“’ Call for Scenario Contribution

We are expanding AssetOpsBench to cover a broader range of industrial challenges. We invite researchers and practitioners to contribute new scenarios, particularly in the following areas:

  • Asset Classes: Turbines, HVAC Systems, Pumps, Transformers, CNC Machines, Robotics, Engines, and so on.
  • Task Domains: Prognostics and Health Management, Remaining Useful Life (RUL) estimation, or Root Cause Analysis (RCA), Diagnostic Analysis and Predictive Maintenance.

How to contribute:

  1. Define your scenario following our Utterance Guideline, Ground Truth Guideline

  2. Explore the Hugging Face dataset as examples.

  3. Submit a Pull Request or open an Issue with the tag new-scenario.

  4. Contact us via email if any question:


Resources


πŸ“‘ Table of Contents

  1. Announcements
  2. Introduction
  3. Datasets
  4. AI Agents
  5. Multi-Agent Frameworks
  6. System Diagram
  7. Leaderboards
  8. Docker Setup
  9. Talks & Events
  10. External Resources
  11. Contributors

Announcements (Papers, Invited Talks, etc)

  • πŸ“Š Dataset Update: AssetOpsBench expanded to cover wider variety of 9 Asset classes (Chiller, AHU, Pump, Motor, Bearing, Engine, Rotors, Boilers, Turbine, etc.) and various Tasks (Remaining Useful Life, Fault Classification, Rule Monitoring, etc.)
    Hugging Face Dataset
    Special Thanks to primary Contributors: πŸ‘₯ @DeveloperMindset123, @ChathurangiShyalika, @Fabio-Lorenzi1

  • πŸ“° AAAI-2026: SPIRAL: Symbolic LLM Planning via Grounded and Reflective Search Authors
    Code

  • 🎯 AAAI-2026 Lab: From Inception to Productization: Hands-on Lab for the Lifecycle of Multimodal Agentic AI in Industry 4.0
    Website Authors AAAI 2026 Slides

  • πŸ“° AABA4ET/AAAI-2026: Agentic Code Generation for Heuristic Rules in Equipment Monitoring Authors

  • πŸ“° IAAI/AAAI-2026: Diversity Meets Relevancy: Multi-Agent Knowledge Probing for Industry 4.0 Applications Authors

  • πŸ“° IAAI/AAAI-2026: Deployed AI Agents for Industrial Asset Management: CodeReAct Framework for Event Analysis and Work Order Automation Authors

  • πŸ“° AAAI-2026 Demo: AssetOpsBench-Live: Privacy-Aware Online Evaluation of Multi-Agent Performance in Industrial Operations
    Authors Demo Video

  • πŸ“° NeurIPS-2025 Social β€” Evaluating Agentic Systems
    Talk: Building Reliable Agentic Benchmarks: Insights from AssetOpsBench Total Registered Users: 2000+ Conference
    Speaker
    Attend on Luma

  • πŸ•“ Past Event: 2025-10-03 – 2-Hour Workshop: AI Agents and Their Role in Industry 4.0 Applications
    Event Host

  • πŸ† Accepted Papers: Parts of papers are accepted at NeurIPS 2025, EMNLP 2025 Research Track, and EMNLP 2025 Industry Track.

  • πŸš€ 2025-09-01: CODS 2025 Competition launched – Access AI Agentic Challenge AssetOpsBench-Live.

  • πŸ“¦ 2025-06-01: AssetOpsBench v1.0 released with 141 industrial Scenarios.

✨ Stay tuned for new tracks, competitions, and community events.


Introduction

AssetOpsBench is a unified framework for developing, orchestrating, and evaluating domain-specific AI agents in industrial asset operations and maintenance.

It provides:

  • 4 domain-specific agents
  • 2 multi-agent orchestration frameworks

Designed for maintenance engineers, reliability specialists, and facility planners, it allows reproducible evaluation of multi-step workflows in simulated industrial environments.


Datasets: 141 Scenarios

AssetOpsBench scenarios span multiple domains:

Domain Example Task
IoT "List all sensors of Chiller 6 in MAIN site"
FSMR "Identify failure modes detected by Chiller 6 Supply Temperature"
TSFM "Forecast 'Chiller 9 Condenser Water Flow' for the week of 2020-04-27"
WO "Generate a work order for Chiller 6 anomaly detection"

Some tasks focus on a single domain, others are multi-step end-to-end workflows.
Explore all scenarios HF-Dataset.


AI Agents

Domain-Specific Agents (Important tools)

  • IoT Agent: get_sites, get_history, get_assets, get_sensors
  • FMSR Agent: get_sensors, get_failure_modes, get_failure_sensor_mapping
  • TSFM Agent: forecasting, timeseries_anomaly_detection
  • WO Agent: generate_work_order

Multi-Agent Frameworks (Blue Prints)

  • MetaAgent: reAct-based single-agent-as-tool orchestration
  • AgentHive: plan-and-execute sequential workflow

MCP Environment

The src/ directory contains MCP servers and a plan-execute runner built on the Model Context Protocol. See INSTRUCTIONS.md for setup, usage, and testing.


Leaderboards

  • Evaluated with 7 Large Language Models
  • Trajectories scored using LLM Judge (Llama-4-Maverick-17B)
  • 6-dimensional criteria measure reasoning, execution, and data handling

Example: MetaAgent leaderboard

meta_agent_leaderboard


Run AssetOpsBench in Docker

  • Please Refer to the
  • Pre-built Docker Images: assetopsbench-basic (minimal) & assetopsbench-extra (full)
  • Conda environment: assetopsbench
  • Full setup guide
cd /path/to/AssetOpsBench
chmod +x benchmark/entrypoint.sh
docker-compose -f benchmark/docker-compose.yml build
docker-compose -f benchmark/docker-compose.yml up

External Resources


Star History Chart


Contributors

Thanks goes to these wonderful people ✨

DhavalRepo18
DhavalRepo18

πŸ’» πŸ“–
ShuxinLin
ShuxinLin

πŸ’» πŸ“–
jtrayfield
jtrayfield

πŸ’» πŸ“–
nianjunz
nianjunz

πŸ’» πŸ“–
ChathurangiShyalika
ChathurangiShyalika

πŸ’» πŸ“–
PUSHPAK-JAISWAL
PUSHPAK-JAISWAL

πŸ’» πŸ“–
bradleyjeck
bradleyjeck

πŸ’» πŸ“–
florenzi002
florenzi002

πŸ’» πŸ“–
kushwaha001
kushwaha001

πŸ’»
Mohit Gupta
Mohit Gupta

πŸ“–
Ayan Das
Ayan Das

πŸ“– πŸ’»

About

AssetOpsBench - Industry 4.0

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • HTML 81.5%
  • Python 17.2%
  • Jupyter Notebook 1.0%
  • Shell 0.1%
  • CSS 0.1%
  • Dockerfile 0.1%