From ad88809f21618a318d6684dceb5f94517635f6fe Mon Sep 17 00:00:00 2001 From: Sayali Bhavsar Date: Tue, 21 Apr 2026 20:37:49 +0530 Subject: [PATCH] docs: update README --- README.md | 341 +++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 288 insertions(+), 53 deletions(-) diff --git a/README.md b/README.md index be763d4..9a809a9 100644 --- a/README.md +++ b/README.md @@ -1,58 +1,293 @@ -Automation wrapper for streams +# STREAM Memory Bandwidth Benchmark Wrapper -Description: - This wrapper runs the streams program written by - John D. McCalpin - Joe R. Zagar - The program being executed, measures memory transfer - rates in MB/s and provides a rough idea of memory rates - of the machine. However it is a bit outdated - and does not deal with numa. - -Location of underlying workload: part of the github kit +## Description -Packages required: gcc,bc +This wrapper facilitates the automated execution of the STREAM memory bandwidth benchmark. STREAM is a standard metric for assessing a system's sustainable memory bandwidth, measuring throughput in MB/s across four vector operations (Copy, Scale, Add, Triad). + +The wrapper provides: +- Automated STREAM compilation and execution with OpenMP support. +- Automatic array sizing based on system cache topology. +- Scaling tests across NUMA nodes and CPU sockets. +- Support for multiple GCC optimization levels (O2 and O3). +- Support for x86_64 (AMD/Intel) and aarch64 (ARM) architectures. +- Result collection, processing, and verification. +- CSV and JSON output formats. +- System configuration metadata capture. +- Integration with test_tools framework. +- Optional Performance Co-Pilot (PCP) integration. + +## Command-Line Options + +``` +Streams Options: + --cache_multiply : Multiply cache sizes by . Default is 2. + --cache_start_factor : Start the cache size at base cache * . Default is 1. + --cache_cap_size : Caps the size of cache to this value. Default is no cap (0). + --nsizes : Maximum number of cache sizes to test. Default is 4. + --opt2 : If value is not 0, run with O2 optimization level. Default is 1. + --opt3 : If value is not 0, run with O3 optimization level. Default is 1. + --result_dir : Directory to place results into. Default is + results_streams_tuned__. + --size_list : Explicit comma-separated list of array sizes in bytes. + Overrides automatic cache-based sizing. + --threads_multiple : Multiply number of threads by . Default is 2. + --tools_git : Git repo to retrieve the required tools from. + Default: https://github.com/redhat-performance/test_tools-wrappers + +General test_tools options: + --home_parent : Parent home directory. If not set, defaults to current working directory. + --host_config : Host configuration name, defaults to current hostname. + --iterations : Number of times to run the test, defaults to 5. + --run_user: User that is actually running the test on the test system. Defaults to current user. + --sys_type: Type of system working with (aws, azure, hostname). Defaults to hostname. + --sysname: Name of the system running, used in determining config files. Defaults to hostname. + --tuned_setting: Used in naming the results directory. For RHEL, defaults to current active tuned profile. + For non-RHEL systems, defaults to 'none'. + --use_pcp: Enable Performance Co-Pilot monitoring during test execution. + --tools_git : Git repo to retrieve the required tools from. + Default: https://github.com/redhat-performance/test_tools-wrappers + --usage: Display this usage message. +``` + +## What the Script Does + +The `streams_run` script performs the following workflow: + +1. **Environment Setup**: + - Clones the test_tools-wrappers repository if not present (default: ~/test_tools). + - Sources error codes and general setup utilities. + - Detects NUMA topology and cache hierarchy. + +2. **Package Installation**: + - Installs required dependencies via package_tool (gcc, bc, numactl, etc.). + - Dependencies are defined in streams.json for different OS variants (RHEL, Ubuntu, SLES, Amazon Linux). + +3. **Cache Detection and Array Sizing**: + - Reads L3 cache size from `/sys/devices/system/cpu/cpu0/cache/index*/size`. + - For ARM Neoverse systems without exposed L3 cache, uses hardcoded 32 MB SLC per NUMA node. + - Calculates progressively larger array sizes based on cache size and `cache_multiply`. + - Validates each size against available free memory (skips sizes requiring >90%). + +4. **STREAM Compilation**: + - Compiles STREAM 5.10 source (stream_omp_5_10.c) with OpenMP support. + - Generates separate binaries for each array size. + - Supports both O2 and O3 optimization levels. + - Uses architecture-specific compiler flags (`-m64` for x86_64, `-mno-outline-atomics` for aarch64). + +5. **Test Execution**: + - Scales across socket configurations (1 socket through all sockets). + - Sets `OMP_NUM_THREADS` based on CPUs per NUMA node and socket count. + - Pins threads to specific CPUs via `GOMP_CPU_AFFINITY`. + - Runs each (socket count, array size) combination for the specified number of iterations. + - Captures system information (lscpu) alongside each run. + +6. **Data Collection**: + - Captures system configuration (CPU model, memory, NUMA topology, kernel version). + - Records STREAM configuration parameters (array sizes, optimization level). + - Logs timestamps for each test run. + - Optionally records PCP performance data. + +7. **Result Processing**: + - Extracts performance metrics (Copy, Scale, Add, Triad rates in MB/s) from STREAM output. + - Averages results across iterations for each configuration. + - Sorts results by array size and socket count. + - Generates CSV files with configuration and performance data. + - Creates JSON output for verification. + - Validates results against Pydantic schema. + +8. **Verification**: + - Validates results against Pydantic schema (results_schema.py). + - Ensures all required fields are present and valid (all rates > 0). + - Uses csv_to_json and verify_results from test_tools. + +9. **Output**: + - Creates timestamped results directory in `results_streams__`. + - Saves all raw output files, processed CSV/JSON, and system metadata. + - Optionally saves PCP performance data. + - Archives results to configured storage location. + +## Dependencies + +Location of underlying workload: included in the git repository (stream_omp_5_10.c, STREAM version 5.10). + +**Required packages by platform**: +- **RHEL**: gcc, bc, perf, zip, unzip, numactl. +- **Ubuntu**: bc, zip, unzip, numactl, libnuma-dev. +- **SLES**: gcc, make, bc, perf, git, unzip, zip, libnuma1, numactl. +- **Amazon Linux**: gcc, bc, git, zip, unzip, numactl. To run: +```bash +git clone https://github.com/redhat-performance/streams-wrapper +cd streams-wrapper/streams +./streams_run +``` + +The script will automatically detect your cache topology and set buffer sizes based on the hardware it is being executed on. + +## The STREAM Benchmark + +STREAM is a benchmark that measures sustainable memory bandwidth by performing simple vector operations on large arrays of double-precision floating-point numbers. + +### STREAM Kernels + +STREAM measures four operations: + +| Kernel | Operation | Bytes per Element | +|--------|-----------|-------------------| +| **Copy** | `c[i] = a[i]` | 24 (2 reads + 1 write) | +| **Scale** | `b[i] = scalar * c[i]` | 24 (2 reads + 1 write) | +| **Add** | `c[i] = a[i] + b[i]` | 32 (3 reads + 1 write) | +| **Triad** | `a[i] = b[i] + scalar * c[i]` | 32 (3 reads + 1 write) | + +### Key Parameters + +1. **STREAM_ARRAY_SIZE**: The number of double-precision elements in each array. Larger arrays ensure the working set exceeds cache, forcing memory accesses. The wrapper automatically sizes arrays based on the system's cache hierarchy. + +2. **Optimization Level**: GCC optimization flags (-O2 or -O3). Both levels are run by default to compare compiler optimization effects on memory bandwidth. + +3. **OMP_NUM_THREADS**: Number of OpenMP threads used. The wrapper scales from a single NUMA node's worth of CPUs up to all CPUs across all sockets. + +4. **Performance Metric**: STREAM reports bandwidth in **MB/s** (megabytes per second). Higher values indicate better memory bandwidth. The benchmark runs each kernel 10 times internally and reports the best rate (excluding the first iteration). + +## Output Files + +The results directory contains: + +- **results_streams_opt_\.csv**: CSV file with STREAM configuration and performance metrics per optimization level. +- **stream.\k.out.threads_\.numb_sockets_\_iter_\**: Raw output files from individual STREAM runs including lscpu data, CPU affinity, and performance metrics. +- **results_streams.wrkr**: Aggregated worker results in `buffer_size:#threads:#sockets:Copy:Scale:Add:Triad` format. +- **streams_build_options**: Log of all gcc compilation commands executed. +- **meta_data\*.yml**: System metadata (CPU info, memory, NUMA topology, kernel version). +- **PCP data** (if --use_pcp option used): Performance Co-Pilot monitoring data. + +## Examples + +### Basic run with defaults +```bash +./streams_run +``` +This runs with: +- Both O2 and O3 optimization levels. +- Automatic cache-based array sizing (4 sizes, starting at L3 cache size, doubling each step). +- 5 iterations per configuration. +- Automatic scaling across all socket configurations. + +### Run with custom cache multiplier +```bash +./streams_run --cache_multiply 4 +``` +Uses a 4x multiplier between successive array sizes instead of the default 2x. + +### Run with specific array sizes +```bash +./streams_run --size_list 1048576,4194304,16777216 +``` +Tests with explicit array sizes (in bytes) instead of automatic cache-based sizing. + +### Run with more cache sizes +```bash +./streams_run --nsizes 8 +``` +Tests up to 8 progressively larger array sizes instead of the default 4. + +### Run only O3 optimization +```bash +./streams_run --opt2 0 +``` +Disables O2 optimization, running only with O3. + +### Run multiple iterations +```bash +./streams_run --iterations 10 +``` +Runs each configuration 10 times instead of the default 5. + +### Run with capped cache size +```bash +./streams_run --cache_cap_size 65536 +``` +Caps array sizes at 65536 KB, useful for limiting test duration. + +### Run with PCP monitoring +```bash +./streams_run --use_pcp ``` -[root@hawkeye ~]# git clone https://github.com/redhat-performance/streams-wrapper -[root@hawkeye ~]# streams-wrapper/streams/streams_run -``` - -The script will set the buffer sizes based on the hardware it is being executed on. - -``` -Options ---cache_multiply : Multiply cache sizes by . Default is 2 ---cache_start_factor : Start the cache size at base cache * - Default is 1 ---cache_cap_size : Caps the size of cache to this value. Default is no cap. ---nsizes : Maximum number of cache sizes to do. Default is 4 ---opt2 : If value is not 0, then we will run with optimization level - 2. Default value is 1 ---opt3 : If value is not 0, then we will run with optimization level - 3. Default value is 1 ---result_dir : Directory to place results into. Default is - results_streams_tuned__ ---size_list : List of array sizes in byte ---threads_multiple : Multiply number threads by . Default is 2 ---tools_git : git repo to retrieve the required tools from, default is https://github.com/redhat-performance/test_tools-wrappers - -General options - --home_parent : Our parent home directory. If not set, defaults to current working directory. - --host_config : default is the current host name. - --iterations : Number of times to run the test, defaults to 1. - --pbench: use pbench-user-benchmark and place information into pbench, defaults to do not use. - --pbench_user : user who started everything. Defaults to the current user. - --pbench_copy: Copy the pbench data, not move it. - --pbench_stats: What stats to gather. Defaults to all stats. - --run_label: the label to associate with the pbench run. No default setting. - --run_user: user that is actually running the test on the test system. Defaults to user running wrapper. - --sys_type: Type of system working with, aws, azure, hostname. Defaults to hostname. - --sysname: name of the system running, used in determining config files. Defaults to hostname. - --tuned_setting: used in naming the tar file, default for RHEL is the current active tuned. For non - RHEL systems, default is none. - --usage: this usage message. -``` - -Note: The script does not install pbench for you. You need to do that manually. +Collects Performance Co-Pilot data during the run. + +### Combination example +```bash +./streams_run --cache_multiply 4 --nsizes 6 --iterations 10 --use_pcp +``` +Uses 4x cache multiplier, tests 6 sizes, runs 10 iterations, and collects PCP data. + +## How Array Sizing Works + +The script automatically calculates STREAM array sizes based on system cache topology: + +### Cache Detection +1. Reads the highest-level cache size from `/sys/devices/system/cpu/cpu0/cache/index*/size`. +2. For ARM Neoverse systems where L3 cache is not exposed in sysfs, uses a hardcoded 32 MB System-Level Cache (SLC) per NUMA node. +3. Multiplies the base cache size by `cache_start_factor` (default: 1). + +### Array Size Progression +1. Starts at the base cache size (in KB). +2. Each subsequent size is multiplied by `cache_multiply` (default: 2). +3. Generates up to `nsizes` (default: 4) different sizes. +4. Sizes exceeding `cache_cap_size` are skipped (if set). +5. Sizes requiring more than 90% of free memory are skipped. + +For example, with a 32 MB L3 cache and default settings: +- Size 1: 32 MB (1x cache) +- Size 2: 64 MB (2x cache) +- Size 3: 128 MB (4x cache) +- Size 4: 256 MB (8x cache) + +### Thread and Socket Scaling +1. Detects the number of NUMA nodes. +2. Calculates CPUs per NUMA node: `total_cpus / numa_nodes`. +3. Scales from 1 socket to all sockets: + - 1 socket: `OMP_NUM_THREADS = cpus_per_node` + - 2 sockets: `OMP_NUM_THREADS = cpus_per_node * 2` + - N sockets: `OMP_NUM_THREADS = cpus_per_node * N` +4. Sets `GOMP_CPU_AFFINITY` to pin threads to the CPUs in the active NUMA nodes. + +## Return Codes + +The script uses standardized error codes from test_tools error_codes: +- **0**: Success. +- **101**: Git clone failure. +- **E_GENERAL**: General execution errors (compilation failures, memory sizing issues, execution failures, validation failures). +- **E_USAGE**: Invalid usage/arguments. + +Exit codes indicate specific failure points for automated testing workflows. + +## Notes + +### Architecture Support +- **x86_64**: Full support for AMD and Intel CPUs. Uses `-m64` compiler flag. +- **aarch64**: Full support for ARM CPUs. Uses `-mno-outline-atomics` compiler flag. Special cache handling for Neoverse systems with unexposed L3 cache. + +### Memory Considerations +- STREAM uses three arrays of double-precision elements (8 bytes each), so total memory per array size is `STREAM_ARRAY_SIZE * 24` bytes. +- The wrapper skips array sizes that would require more than 90% of free memory. +- For systems with very large caches, use `--cache_cap_size` to limit the largest array size tested. + +### Special Cases +- **ARM Neoverse**: L3/SLC cache is not exposed in sysfs. The wrapper uses a hardcoded 32 MB per NUMA node as the base cache size. +- **Thread pinning**: Threads are pinned to specific CPUs via `GOMP_CPU_AFFINITY` based on NUMA node membership, ensuring accurate per-socket bandwidth measurements. + +### Performance Tips +- Run multiple iterations (default is 5) to verify consistency. +- Ensure the system is idle (no other workloads) for best results. +- Disable CPU frequency scaling (use performance governor) for reproducible results. +- Consider the active tuned profile on RHEL systems. +- Compare O2 and O3 results to understand compiler optimization effects on memory bandwidth. +- For production benchmarking, allow the system to warm up with a test run first. + +### Troubleshooting +- If STREAM fails to compile, verify that gcc is installed and supports OpenMP (`-fopenmp`). +- If array sizes are being skipped, check available free memory with `free -k`. +- If performance is unexpectedly low, check CPU frequency, system load, and NUMA binding. +- Use `--use_pcp` to collect detailed performance counters for analysis. +- Check the `streams_build_options` file to verify compilation flags are correct.