Skip to content

Performance analysis: EventEngine serial processing is the primary bottleneck, not DRAM clocking #3

@KueipoH

Description

@KueipoH

Labels

performance, analysis


Body

Hi @Halifuda, great work on Xerxes! I've been studying the codebase for CXL interconnect simulation and did some performance profiling. Sharing my findings in case they're useful.

Summary

The main simulation loop bottleneck is EventEngine::step() (serial event processing), not clock_all_mems_to_tick() (DRAM clocking). DRAM clocking accounts for less than ~5% of total wall-clock time in typical AE workloads.

Evidence

Setup: 8 hosts, 8 memories, sample-bus.py --scale 8 --ratio 0.25, built with MinGW GCC 6.3.0, Release mode.

Test 1 — clock_granu has almost no impact on runtime:

clock_granu Baseline runtime (avg of 3)
1 ~1,485 ms
50 ~1,485 ms

If DRAM clocking were the bottleneck, increasing clock_granu from 1 to 50 (50x more ClockTick() calls per memory per iteration) should significantly increase runtime. It doesn't — meaning most clock() calls hit the early-return path (issued.size() == 0), or ClockTick() itself is very fast relative to step().

Test 2 — Parallelizing DRAM clocking yields no speedup:

I implemented parallel DRAM clocking with persistent worker threads (one per memory) and deferred callbacks to avoid races on PktStatsTable and EventEngine. Results:

Config Baseline Parallel (8 threads) Speedup
clock_granu=1, N=8 1,485 ms 1,421 ms 1.04x
clock_granu=50, N=8 1,485 ms 1,558 ms 0.95x

Correctness: CSV output is bit-identical between baseline and parallel versions (all 4,008 packets, 15 latency columns).

Root cause analysis

Per main-loop iteration:

while (clock_cnt < max_clock) {
    step();                          // ← ~95% of time (serial, 1 event at a time)
    clock_all_mems_to_tick(curt);    // ← ~5% of time (mostly early-return)
}

The global mutable state that enforces serial execution:

Global state Location Blocks parallelism
EventEngine::events xerxes_standalone.cc All device events in one multimap
PktStatsTable::get() def.hh:117 All packets write to same unordered_map
PktBuilder::id (static) def.hh:228 Non-atomic auto-increment

Potential optimization path

True speedup requires parallelizing EventEngine::step() itself (what I call "Level 3"):

  • Events at the same tick targeting different devices could theoretically run concurrently
  • Requires: per-device mutex, concurrent event queue, thread-safe TopoNode::buffer
  • Estimated effort: ~500+ lines of refactoring
  • Risk: event ordering differences may cause non-deterministic tie-breaking in Timeline::transfer_time(), leading to small numerical differences

I have working code for the DRAM parallelization (Level 1) with deferred callbacks. It doesn't help for the current workloads but could be useful as infrastructure if the DRAM-to-event ratio changes in future configurations.

Separate issue: ext/toml.hpp fails to build on MinGW

The vendored toml11 (ext/toml.hpp) fails on MinGW because it enters the POSIX branch (localtime_r/gmtime_r) when _POSIX_SOURCE is defined, but MinGW doesn't provide these functions. One-line fix at line 972:

-#elif (defined(_POSIX_C_SOURCE) && _POSIX_C_SOURCE >= 1) || defined(_XOPEN_SOURCE) || defined(_BSD_SOURCE) || defined(_SVID_SOURCE) || defined(_POSIX_SOURCE)
+#elif !defined(__MINGW32__) && ((defined(_POSIX_C_SOURCE) && _POSIX_C_SOURCE >= 1) || defined(_XOPEN_SOURCE) || defined(_BSD_SOURCE) || defined(_SVID_SOURCE) || defined(_POSIX_SOURCE))

This should probably be reported to toml11 upstream as well.

My setup

  • Windows 10, MinGW GCC 6.3.0 (win32 threads)
  • DRAMsim3 submodule at 2981759
  • Xerxes at d0a5d0f

Happy to share the parallel DRAM clocking code or discuss Level 3 approaches if you're interested. Thanks for open-sourcing this work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions