Labels
performance, analysis
Body
Hi @Halifuda, great work on Xerxes! I've been studying the codebase for CXL interconnect simulation and did some performance profiling. Sharing my findings in case they're useful.
Summary
The main simulation loop bottleneck is EventEngine::step() (serial event processing), not clock_all_mems_to_tick() (DRAM clocking). DRAM clocking accounts for less than ~5% of total wall-clock time in typical AE workloads.
Evidence
Setup: 8 hosts, 8 memories, sample-bus.py --scale 8 --ratio 0.25, built with MinGW GCC 6.3.0, Release mode.
Test 1 — clock_granu has almost no impact on runtime:
| clock_granu |
Baseline runtime (avg of 3) |
| 1 |
~1,485 ms |
| 50 |
~1,485 ms |
If DRAM clocking were the bottleneck, increasing clock_granu from 1 to 50 (50x more ClockTick() calls per memory per iteration) should significantly increase runtime. It doesn't — meaning most clock() calls hit the early-return path (issued.size() == 0), or ClockTick() itself is very fast relative to step().
Test 2 — Parallelizing DRAM clocking yields no speedup:
I implemented parallel DRAM clocking with persistent worker threads (one per memory) and deferred callbacks to avoid races on PktStatsTable and EventEngine. Results:
| Config |
Baseline |
Parallel (8 threads) |
Speedup |
| clock_granu=1, N=8 |
1,485 ms |
1,421 ms |
1.04x |
| clock_granu=50, N=8 |
1,485 ms |
1,558 ms |
0.95x |
Correctness: CSV output is bit-identical between baseline and parallel versions (all 4,008 packets, 15 latency columns).
Root cause analysis
Per main-loop iteration:
while (clock_cnt < max_clock) {
step(); // ← ~95% of time (serial, 1 event at a time)
clock_all_mems_to_tick(curt); // ← ~5% of time (mostly early-return)
}
The global mutable state that enforces serial execution:
| Global state |
Location |
Blocks parallelism |
EventEngine::events |
xerxes_standalone.cc |
All device events in one multimap |
PktStatsTable::get() |
def.hh:117 |
All packets write to same unordered_map |
PktBuilder::id (static) |
def.hh:228 |
Non-atomic auto-increment |
Potential optimization path
True speedup requires parallelizing EventEngine::step() itself (what I call "Level 3"):
- Events at the same tick targeting different devices could theoretically run concurrently
- Requires: per-device mutex, concurrent event queue, thread-safe
TopoNode::buffer
- Estimated effort: ~500+ lines of refactoring
- Risk: event ordering differences may cause non-deterministic tie-breaking in
Timeline::transfer_time(), leading to small numerical differences
I have working code for the DRAM parallelization (Level 1) with deferred callbacks. It doesn't help for the current workloads but could be useful as infrastructure if the DRAM-to-event ratio changes in future configurations.
Separate issue: ext/toml.hpp fails to build on MinGW
The vendored toml11 (ext/toml.hpp) fails on MinGW because it enters the POSIX branch (localtime_r/gmtime_r) when _POSIX_SOURCE is defined, but MinGW doesn't provide these functions. One-line fix at line 972:
-#elif (defined(_POSIX_C_SOURCE) && _POSIX_C_SOURCE >= 1) || defined(_XOPEN_SOURCE) || defined(_BSD_SOURCE) || defined(_SVID_SOURCE) || defined(_POSIX_SOURCE)
+#elif !defined(__MINGW32__) && ((defined(_POSIX_C_SOURCE) && _POSIX_C_SOURCE >= 1) || defined(_XOPEN_SOURCE) || defined(_BSD_SOURCE) || defined(_SVID_SOURCE) || defined(_POSIX_SOURCE))
This should probably be reported to toml11 upstream as well.
My setup
- Windows 10, MinGW GCC 6.3.0 (win32 threads)
- DRAMsim3 submodule at
2981759
- Xerxes at
d0a5d0f
Happy to share the parallel DRAM clocking code or discuss Level 3 approaches if you're interested. Thanks for open-sourcing this work!
Labels
performance,analysisBody
Hi @Halifuda, great work on Xerxes! I've been studying the codebase for CXL interconnect simulation and did some performance profiling. Sharing my findings in case they're useful.
Summary
The main simulation loop bottleneck is
EventEngine::step()(serial event processing), notclock_all_mems_to_tick()(DRAM clocking). DRAM clocking accounts for less than ~5% of total wall-clock time in typical AE workloads.Evidence
Setup: 8 hosts, 8 memories,
sample-bus.py --scale 8 --ratio 0.25, built with MinGW GCC 6.3.0, Release mode.Test 1 — clock_granu has almost no impact on runtime:
If DRAM clocking were the bottleneck, increasing
clock_granufrom 1 to 50 (50x moreClockTick()calls per memory per iteration) should significantly increase runtime. It doesn't — meaning mostclock()calls hit the early-return path (issued.size() == 0), orClockTick()itself is very fast relative tostep().Test 2 — Parallelizing DRAM clocking yields no speedup:
I implemented parallel DRAM clocking with persistent worker threads (one per memory) and deferred callbacks to avoid races on
PktStatsTableandEventEngine. Results:Correctness: CSV output is bit-identical between baseline and parallel versions (all 4,008 packets, 15 latency columns).
Root cause analysis
Per main-loop iteration:
The global mutable state that enforces serial execution:
EventEngine::eventsxerxes_standalone.ccmultimapPktStatsTable::get()def.hh:117unordered_mapPktBuilder::id(static)def.hh:228Potential optimization path
True speedup requires parallelizing
EventEngine::step()itself (what I call "Level 3"):TopoNode::bufferTimeline::transfer_time(), leading to small numerical differencesI have working code for the DRAM parallelization (Level 1) with deferred callbacks. It doesn't help for the current workloads but could be useful as infrastructure if the DRAM-to-event ratio changes in future configurations.
Separate issue:
ext/toml.hppfails to build on MinGWThe vendored toml11 (
ext/toml.hpp) fails on MinGW because it enters the POSIX branch (localtime_r/gmtime_r) when_POSIX_SOURCEis defined, but MinGW doesn't provide these functions. One-line fix at line 972:This should probably be reported to toml11 upstream as well.
My setup
2981759d0a5d0fHappy to share the parallel DRAM clocking code or discuss Level 3 approaches if you're interested. Thanks for open-sourcing this work!