Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,6 @@ __pycache__/
*.dot
.pyre
*et_def.pb.cc
*et_def.pb.h
*et_def.pb.h
/mlsys26/traces
chakra_env/
139 changes: 139 additions & 0 deletions mlsys26/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# MLSys 2026 MLCommons Chakra Artifact Evaluation

## Install/Set up Chakra

### Create python virtual environment for Chakra
```bash
# Create a virtual environment in the path/to/chakra/
$ python3 -m venv chakra_env

# Activate the virtual environment
$ source chakra_env/bin/activate
```

## Install Chakra and Convert NeMo Traces to Chakra .et

### Install Chakra
```bash
source chakra_env/bin/activate
pip install .
```

### Pin protobuf version
> **Critical:** The protobuf version used to **generate** the `.et` traces must match the
> version compiled into the ASTRA-sim Docker image. The Dockerfile builds **protobuf 6.33.0**.
> Pin your Chakra environment to the same version before converting traces.
```bash
pip install protobuf==6.33.0
```

### Install PARAM (required by `chakra_trace_link`)
`chakra_trace_link` depends on `et_replay` from the [PARAM](https://github.com/facebookresearch/param) project.
```bash
git clone https://github.com/facebookresearch/param.git
cd param/et_replay
git checkout 7b19f586dd8b267333114992833a0d7e0d601630
pip install .
cd ../..
```

### Download traces
```bash
cd mlsys26
bash download_nemo_chakra_traces.sh
```

### Convert traces (trace link + converter in one step)
```bash
bash convert_traces.sh
```

Outputs are written to:
- `mlsys26/traces/linked/` — linked JSON (host + device merged per rank)
- `mlsys26/traces/et/` — protobuf `.et` files ready for ASTRA-sim

## Using ASTRA-sim for Chakra-Based Simulation of Diverse Networked Systems

ASTRA-sim leverages Chakra’s ET feeder to replace its original custom workload format. This integration has enabled a range of co-design studies on emerging platforms, particularly for exploring and optimizing networking infrastructures.

### ASTRA-sim Installation
> [!WARNING]
> Run the below command inside the `${CHAKRA_REPO_ROOT}/mlsys26` directory.

```bash
# Clone ASTRA-sim.
git clone git@github.com:astra-sim/astra-sim.git


cd ./astra-sim
# Pin ASTRA-sim to the validated commit for this artifact
git checkout changhai/chakra_main_paper
git submodule update --init --recursive
cd ..
```

> [!NOTE]
> Building the docker container can take several minutes.
```bash
# Align the protobuf versions through the following patch
cd ${CHAKRA_REPO_ROOT}/mlsys26
bash astra-sim-patch.sh ./astra-sim/Dockerfile

# Remove any old container and image first, if any (full clean rebuild)
docker rm -f astra-sim-mlsys26 2>/dev/null || true
docker rmi -f astra-sim:mlsys26 2>/dev/null || true

# Build Docker image
docker build -t astra-sim:mlsys26 -f ./astra-sim/Dockerfile ./astra-sim

# Run container with bind mounts:
# /app/astra-sim <- astra-sim source + build output
# /app/astra-sim/mlsys26/plots <- run scripts and configs
# /traces <- .et trace files
docker run -it --name astra-sim-mlsys26 --shm-size=8g \
-v "$(pwd)/astra-sim:/app/astra-sim" \
-v "$(pwd)/plots:/app/astra-sim/mlsys26/plots" \
-v "$(pwd)/traces/et:/traces" \
astra-sim:mlsys26 bash
```

### Build ASTRA-sim inside the container
```bash
# Inside the container:
./build/astra_analytical/build.sh
```


### Final Step (with Astra-Sim and Chakra all in place) - Run the simulation
```bash
# Inside the container (after building):
bash /app/astra-sim/mlsys26/plots/m8x7/mixtral_8x7b.sh
```

### Draw the plots (Fig. 6,7,8,12)
```bash
# Assume going back to the path/to/chakra/mlsys26 and with chakra_env activated
# Go to plots directory
cd plots

# install matplotlib for plotting
$ pip install matplotlib

# Figure 6
python chakra_kineto_reconstruct.py

# Figure 7
python plot_coll_ib.py

# Figure 8
bash run_plot_memory.sh

# Figure 12
cd ./m8x7/
python plot_astra-sim_bw_analysis.py

# Cleanup the results logs in the directory generated (Optional)
cd /app/astra-sim/mlsys26/plots/m8x7/
find . -maxdepth 1 -type d ! -name . -exec rm -rf {} +
```

67 changes: 67 additions & 0 deletions mlsys26/astra-sim-patch.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
#!/usr/bin/env bash
set -euo pipefail

DOCKERFILE="${1:-Dockerfile}"

if [[ ! -f "$DOCKERFILE" ]]; then
echo "Error: $DOCKERFILE not found"
exit 1
fi

# cp "$DOCKERFILE" "${DOCKERFILE}.bak"

sed -i \
-e 's/^ARG ABSL_VER=20240722\.0$/ARG ABSL_VER=20250814.1/' \
-e 's/^## Download Abseil 20240722\.0.*/## Download Abseil 20250814.1 (Latest LTS as of 10\/31\/2024)/' \
-e 's/^ARG PROTOBUF_VER=29\.0$/ARG PROTOBUF_VER=33.0/' \
-e 's/^## Download Protobuf 29\.0.*/## Download Protobuf 33.0 (=v6.33.0, latest stable version as of Feb\/01\/2025)/' \
-e 's/protobuf==5\.\${PROTOBUF_VER}/protobuf==6.${PROTOBUF_VER}/' \
"$DOCKERFILE"

python3 - "$DOCKERFILE" <<'PY'
from pathlib import Path
import re
import sys

path = Path(sys.argv[1])
text = path.read_text()

# Update all C++ standard settings from 14 -> 17
text = re.sub(r'(-DCMAKE_CXX_STANDARD=)14\b', r'\g<1>17', text)

path.write_text(text)
PY

# Patch the CMakeLists.txt that lives alongside the Dockerfile
CMAKEFILE="$(dirname "$DOCKERFILE")/CMakeLists.txt"

if [[ ! -f "$CMAKEFILE" ]]; then
echo "Warning: $CMAKEFILE not found, skipping CMakeLists.txt patch"
else
python3 - "$CMAKEFILE" <<'PY'
from pathlib import Path
import re
import sys

path = Path(sys.argv[1])
text = path.read_text()

# Remove hardcoded abseil .so linker lines that are baked into the repo
# but break builds when the abseil version changes.
cleaned, n = re.subn(
r'\ntarget_link_libraries\(AstraSim PRIVATE /usr/local/lib/libabsl_log_internal[^\n]+\)',
'',
text,
)

if n == 0:
print(f"No abseil link libraries found in {sys.argv[1]}, nothing to remove")
else:
path.write_text(cleaned)
print(f"Removed {n} abseil link librar{'y' if n == 1 else 'ies'} from {sys.argv[1]}")
PY
echo "Patched $CMAKEFILE"
fi

echo "Patched $DOCKERFILE"
# echo "Backup saved as ${DOCKERFILE}.bak"
80 changes: 80 additions & 0 deletions mlsys26/convert_traces.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
#!/usr/bin/env bash
# convert_traces.sh
# Links Chakra host+device traces and converts them to protobuf (.et) format
# for all ranks in the Mixtral-8x7B NeMo trace set.
#
# Usage:
# source <chakra-env>/bin/activate
# bash mlsys26/convert_traces.sh
#

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
TRACE_DIR="${SCRIPT_DIR}/traces/nemo-chakra-mixtral-8x7B-traces"
LINKED_DIR="${SCRIPT_DIR}/traces/linked"
ET_DIR="${SCRIPT_DIR}/traces/et"

# ---------------------------------------------------------------------------
# Validate inputs
# ---------------------------------------------------------------------------
if [[ ! -d "${TRACE_DIR}" ]]; then
echo "[ERROR] Trace directory not found: ${TRACE_DIR}"
echo " Run download_nemo_chakra_traces.sh first."
exit 1
fi

mkdir -p "${LINKED_DIR}" "${ET_DIR}"

# Automatically detect number of ranks from host_*.json files
NUM_RANKS=$(ls "${TRACE_DIR}"/host_*.json 2>/dev/null | wc -l)
if [[ "${NUM_RANKS}" -eq 0 ]]; then
echo "[ERROR] No host_*.json files found in ${TRACE_DIR}"
exit 1
fi
echo "[INFO] Found ${NUM_RANKS} rank(s) in ${TRACE_DIR}"

# ---------------------------------------------------------------------------
# Step 1: chakra_trace_link (host + device → linked JSON)
# ---------------------------------------------------------------------------
echo ""
echo "=== Step 1: chakra_trace_link ==="
for ((rank=0; rank<NUM_RANKS; rank++)); do
HOST_TRACE="${TRACE_DIR}/host_${rank}.json"
DEVICE_TRACE="${TRACE_DIR}/device_${rank}.json"
LINKED_OUT="${LINKED_DIR}/rank${rank}_linked.json"

echo "[rank ${rank}] Linking ${HOST_TRACE} + ${DEVICE_TRACE} -> ${LINKED_OUT}"
chakra_trace_link \
--chakra-host-trace "${HOST_TRACE}" \
--chakra-device-trace "${DEVICE_TRACE}" \
--rank "${rank}" \
--output-file "${LINKED_OUT}"
done
echo "[INFO] All ranks linked."

# ---------------------------------------------------------------------------
# Step 2: chakra_converter (linked JSON → protobuf .et)
# ASTRA-sim expects files named {prefix}.{npu_id}.et
# e.g. chakra_trace.0.et, chakra_trace.1.et, ...
# so we use --output <ET_DIR>/chakra_trace.<rank> → chakra_trace.<rank>.et
# ---------------------------------------------------------------------------
echo ""
echo "=== Step 2: chakra_converter ==="
for ((rank=0; rank<NUM_RANKS; rank++)); do
LINKED_IN="${LINKED_DIR}/rank${rank}_linked.json"
ET_OUT="${ET_DIR}/chakra_trace.${rank}.et"

echo "[rank ${rank}] Converting ${LINKED_IN} -> ${ET_OUT}"
chakra_converter PyTorch \
--input "${LINKED_IN}" \
--output "${ET_OUT}"
done
echo "[INFO] All ranks converted."

echo ""
echo "=== Done ==="
echo "Linked JSON traces : ${LINKED_DIR}/"
echo "Protobuf .et traces: ${ET_DIR}/"
echo " Files: chakra_trace.0.et ... chakra_trace.$((NUM_RANKS-1)).et"
echo " ASTRA-sim workload prefix: /traces/chakra_trace"
8 changes: 8 additions & 0 deletions mlsys26/download_nemo_chakra_traces.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
echo 'Running dataset download script'

mkdir -p traces
cd traces

pip3 install gdown charset_normalizer chardet
gdown --id 1lz6VCqQ-n5lSyshH0XKSqdynKOVRqGZs -O nemo-chakra-mixtral-8x7B-traces.zip
tar -xzvf nemo-chakra-mixtral-8x7B-traces.zip
Loading
Loading