Performance analysis: EventEngine serial processing is the primary bottleneck, not DRAM clocking

## Labels

`performance`, `analysis`

---

## Body

Hi @Halifuda, great work on Xerxes! I've been studying the codebase for CXL interconnect simulation and did some performance profiling. Sharing my findings in case they're useful.

### Summary

The main simulation loop bottleneck is **`EventEngine::step()`** (serial event processing), **not** `clock_all_mems_to_tick()` (DRAM clocking). DRAM clocking accounts for less than ~5% of total wall-clock time in typical AE workloads.

### Evidence

**Setup:** 8 hosts, 8 memories, `sample-bus.py --scale 8 --ratio 0.25`, built with MinGW GCC 6.3.0, Release mode.

**Test 1 — clock_granu has almost no impact on runtime:**

| clock_granu | Baseline runtime (avg of 3) |
|:-----------:|:---------------------------:|
| 1           | ~1,485 ms                   |
| 50          | ~1,485 ms                   |

If DRAM clocking were the bottleneck, increasing `clock_granu` from 1 to 50 (50x more `ClockTick()` calls per memory per iteration) should significantly increase runtime. It doesn't — meaning most `clock()` calls hit the early-return path (`issued.size() == 0`), or `ClockTick()` itself is very fast relative to `step()`.

**Test 2 — Parallelizing DRAM clocking yields no speedup:**

I implemented parallel DRAM clocking with persistent worker threads (one per memory) and deferred callbacks to avoid races on `PktStatsTable` and `EventEngine`. Results:

| Config               | Baseline | Parallel (8 threads) | Speedup |
|:--------------------:|:--------:|:--------------------:|:-------:|
| clock_granu=1, N=8   | 1,485 ms | 1,421 ms             | 1.04x   |
| clock_granu=50, N=8  | 1,485 ms | 1,558 ms             | 0.95x   |

**Correctness:** CSV output is **bit-identical** between baseline and parallel versions (all 4,008 packets, 15 latency columns).

### Root cause analysis

Per main-loop iteration:

```
while (clock_cnt < max_clock) {
    step();                          // ← ~95% of time (serial, 1 event at a time)
    clock_all_mems_to_tick(curt);    // ← ~5% of time (mostly early-return)
}
```

The global mutable state that enforces serial execution:

| Global state                 | Location                 | Blocks parallelism |
|------------------------------|--------------------------|--------------------|
| `EventEngine::events`        | `xerxes_standalone.cc`   | All device events in one `multimap` |
| `PktStatsTable::get()`       | `def.hh:117`             | All packets write to same `unordered_map` |
| `PktBuilder::id` (static)    | `def.hh:228`             | Non-atomic auto-increment |

### Potential optimization path

True speedup requires parallelizing `EventEngine::step()` itself (what I call "Level 3"):

- Events at the **same tick** targeting **different devices** could theoretically run concurrently
- Requires: per-device mutex, concurrent event queue, thread-safe `TopoNode::buffer`
- Estimated effort: ~500+ lines of refactoring
- Risk: event ordering differences may cause non-deterministic tie-breaking in `Timeline::transfer_time()`, leading to small numerical differences

I have working code for the DRAM parallelization (Level 1) with deferred callbacks. It doesn't help for the current workloads but could be useful as infrastructure if the DRAM-to-event ratio changes in future configurations.

### Separate issue: `ext/toml.hpp` fails to build on MinGW

The vendored toml11 (`ext/toml.hpp`) fails on MinGW because it enters the POSIX branch (`localtime_r`/`gmtime_r`) when `_POSIX_SOURCE` is defined, but MinGW doesn't provide these functions. One-line fix at line 972:

```diff
-#elif (defined(_POSIX_C_SOURCE) && _POSIX_C_SOURCE >= 1) || defined(_XOPEN_SOURCE) || defined(_BSD_SOURCE) || defined(_SVID_SOURCE) || defined(_POSIX_SOURCE)
+#elif !defined(__MINGW32__) && ((defined(_POSIX_C_SOURCE) && _POSIX_C_SOURCE >= 1) || defined(_XOPEN_SOURCE) || defined(_BSD_SOURCE) || defined(_SVID_SOURCE) || defined(_POSIX_SOURCE))
```

This should probably be reported to [toml11 upstream](https://github.com/ToruNiina/toml11) as well.

### My setup

- Windows 10, MinGW GCC 6.3.0 (win32 threads)
- DRAMsim3 submodule at `2981759`
- Xerxes at `d0a5d0f`

Happy to share the parallel DRAM clocking code or discuss Level 3 approaches if you're interested. Thanks for open-sourcing this work!


Global state	Location	Blocks parallelism
`EventEngine::events`	`xerxes_standalone.cc`	All device events in one `multimap`
`PktStatsTable::get()`	`def.hh:117`	All packets write to same `unordered_map`
`PktBuilder::id` (static)	`def.hh:228`	Non-atomic auto-increment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance analysis: EventEngine serial processing is the primary bottleneck, not DRAM clocking #3

Labels

Body

Summary

Evidence

Root cause analysis

Potential optimization path

Separate issue: `ext/toml.hpp` fails to build on MinGW

My setup

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Config	Baseline	Parallel (8 threads)	Speedup
clock_granu=1, N=8	1,485 ms	1,421 ms	1.04x
clock_granu=50, N=8	1,485 ms	1,558 ms	0.95x

Performance analysis: EventEngine serial processing is the primary bottleneck, not DRAM clocking #3

Description

Labels

Body

Summary

Evidence

Root cause analysis

Potential optimization path

Separate issue: ext/toml.hpp fails to build on MinGW

My setup

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Separate issue: `ext/toml.hpp` fails to build on MinGW