Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 77 additions & 33 deletions docs/computing/running/batch-job-partitions.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,61 +49,105 @@ and resource requirements. These are explained in the table below.
| Allocation type | Resource request |
|:---------------:|---------------------------------------------------------------------------|
| R | Memory and CPU resources can be changed independently |
| N | Full node requests only |
| C | Share of memory resources fixed based on requested number of CPU cores |
| G | Share of CPU and memory resources fixed based on requested number of GPUs |
| N | Full-node requests only |
| C | Memory allocation is fixed based on the requested number of CPU cores |
| G | CPU and memory allocation is fixed based on the requested number of GPUs |

### Roihu CPU partitions

Roihu features the following partitions for submitting jobs to CPU nodes:

| Partition | Allocation type | Time limit | Min CPU cores | Max CPU cores | Max nodes | [Node types](../systems-roihu.md#nodes) | Max memory | Requirements |
|-------------------|-----------------|------------|---------------|---------------|-----------|-----------------------------------------|------------------|--------------------|
| `test` | R | 15 minutes | 1 | 768 | 2 | M | 744 GiB per job | |
| `interactive` | R | 36 hours | 1 | 32 | 1 | M | 64 GiB per job | |
| `longrun` | R | 10 days | 1 | 192 | 1 | M, L | 744 GiB per job | |
| `small` | R | 72 hours | 1 | 384 | 1 | M, L | 1500 GiB per job | |
| `medium` | N | 36 hours | 384 | 2304 | 6 | M | 744 GiB per node | |
| `large` | N | 36 hours | 2304 | 23040 | 60 | M | 744 GiB per node | [scalability test] |
| `hugemem` | C | 36 hours | 16 | 128 | 1 | XL | 6037 GiB per job | |
| `hugemem_longrun` | C | 10 days | 16 | 128 | 1 | XL | 6037 GiB per job | |
Roihu provides the following partitions for submitting jobs to CPU nodes:

| Partition | Allocation type | Time limit | Nodes | Max CPUs | [Node types](../systems-roihu.md#nodes) | Max memory | Requirements |
|-------------------|-----------------|------------|--------|---------------|-----------------------------------------|-----------------------|--------------------|
| `test` | R | 15 minutes | 1 - 2 | 384 per node | M | 744 GiB per node | |
| `small` | R | 72 hours | 1 | 384 per job | M, L | 1500 GiB per job | |
| `medium` | N | 36 hours | 1 - 6 | 384 per node | M | 744 GiB per node | |
| `large` | N | 36 hours | 6 - 60 | 384 per node | M | 744 GiB per node | [scalability test] |
| `longrun` | R | 10 days | 1 | 192 per job | M, L | 1500 GiB per job | |
| `hugemem` | C | 36 hours | 1 | 128 per job | XL | 6037 GiB per job | |
| `hugemem_longrun` | C | 10 days | 1 | 128 per job | XL | 6037 GiB per job | |

### Roihu GPU partitions

Roihu features the following partitions for submitting jobs to GPU nodes:
Roihu provides the following partitions for submitting jobs to GPU nodes:

| Partition | Allocation type | Time limit | Nodes | Max GPUs | [Node types](../systems-roihu.md#nodes) | Requirements |
|------------------|-----------------|------------|--------|---------------|-----------------------------------------|--------------------|
| `gputest` | G | 15 minutes | 1 - 2 | 4 per node | GPU | |
| `gpuinteractive` | G | 12 hours | 1 | 1 per job | GPU ([slice](#roihu-gpu-slices)) | |
| `gpumedium` | G | 36 hours | 1 | 4 per job | GPU | |
| `gpularge` | G | 36 hours | 1 - 10 | 4 per node | GPU | [scalability test] |

Each full GPU node has 4 GH200 GPUs. On full GPU nodes, each reserved GPU grants access to up to 72 CPU cores,
95 GiB of HBM3 memory, and 116 GiB of LPDDR5 memory.

The memory amounts listed here are the allocatable amounts available to jobs;
some memory is reserved for system use.

### Roihu interactive partitions

Roihu has several partitions reserved for interactive use and for data visualization.

#### Roihu-CPU interactive use

The `interactive` partition on Roihu allows running
[interactive jobs](./interactive-usage.md) on CPU nodes, through the `sinteractive` command.

| Partition | Allocation type | Time limit | Min GPUs | Max GPUs | Max nodes | [Node types](../systems-roihu.md#nodes) | Memory per GPU | Requirements |
|------------------|-----------------|------------|----------|----------|-----------|-----------------------------------------|------------------|--------------------|
| `gputest` | G | 15 minutes | 1 | 8 | 2 | GPU | 116 GiB + 95 GiB | |
| `gpuinteractive` | G | 12 hours | 1 | 1 | 1 | GPU ([slice](#roihu-gpu-slices)) | TBA | |
| `gpumedium` | G | 36 hours | 1 | 4 | 1 | GPU | 116 GiB + 95 GiB | |
| `gpularge` | G | 36 hours | 4 | 40 | 10 | GPU | 116 GiB + 95 GiB | [scalability test] |
| `vizinteractive` | G | 12 hours | 1 | 1 | 1 | V | 183 GiB + 44 GiB | |
| Partition | Allocation type | Time limit | Nodes | Max CPUs | [Node types](../systems-roihu.md#nodes) | Max memory |
|-------------------|-----------------|------------|--------|---------------|-----------------------------------------|-----------------------|
| `interactive` | R | 36 hours | 1 | 32 per job | M | 64 GiB per job |

#### Roihu-GPU interactive use

`sinteractive` selects the correct partition based on your resource request
and automatically provides a GPU slice when run from a Roihu-GPU login node.

| Partition | Allocation type | Time limit | Nodes | Max CPUs | [Node types](../systems-roihu.md#nodes) |
|-------------------|-----------------|------------|--------|---------------|-----------------------------------------|
| `gpuinteractive` | G | 12 hours | 1 | 1 per job | GPU ([slice](#roihu-gpu-slices)) |

#### Roihu GPU slices

Roihu `gpuinteractive` partition features GH200 superchips that are divided
into a total of 48 smaller slices that have one-seventh of the compute capacity
and one-eighth of the GPU memory capacity (12 GiB) of a full GH200 superchip.
The Roihu `gpuinteractive` partition uses GH200 superchips divided into 48 smaller slices.
Each slice has one-seventh of the compute capacity and one-eighth of the GPU memory capacity (12 GiB) of a full GH200 superchip.

#### Vizinteractive

Roihu also features the following partition for interactive use and data visualization with specialized hardware:

| Partition | Allocation type | Time limit | Nodes | Max GPUs | [Node types](../systems-roihu.md#nodes) |
|------------------|-----------------|------------|-------|-----------|-----------------------------------------|
| `vizinteractive` | G | 12 hours | 1 | 2 per job | V |

Each node in the partition has 2 Nvidia L40 GPUs with 44 GiB of memory and a 64-core AMD EPYC 9335 CPU.
Each reserved GPU grants access to up to 32 CPU cores and 183 GiB of CPU memory.

### Roihu pilot partitions

In addition to the regular partitions, the following partitions are also
In addition to the regular partitions, the following partitions are
available during the Roihu pilot phase:

| Partition | Allocation type | Time limit | Min nodes | Max nodes | [Node types](../systems-roihu.md#nodes) |
|------------|-----------------|------------|-----------|-----------|-----------------------------------------|
| `pilot` | N | 24 hours | 1 | 200 | M |
| `gpupilot` | N | 48 hours | 1 | 60 | GPU |
| Partition | Allocation type | Time limit | Nodes | Max CPUs | Max GPUs | [Node types](../systems-roihu.md#nodes) |
|------------|-----------------|------------|---------|---------------|---------------|-----------------------------------------|
| `pilot` | N | 24 hours | 1 - 200 | 384 per node | 0 | M |
| `gpupilot` | G | 48 hours | 1 - 60 | 288 per node | 4 per node | GPU |


### Local storage on Roihu nodes

Local storage on Roihu M, L and GPU nodes is meant for storing temporary files only, not high-performance I/O.
Local storage on Roihu M, L, and GPU nodes is meant for storing temporary files only, not high-performance I/O.

High-performance local storage is available on Roihu XL and V nodes, which is ideal for I/O-intensive jobs.

The amount of local storage available to a single user depends on the [partition](#roihu-partitions) used:

High-performance local storage is available on Roihu XL and V nodes. Ideal for I/O-intensive jobs.
| Allocation type | Quota per user | Read / Write speeds |
|:-------------------|---------------:|---------------------|
| R (shared nodes) | 20 GiB | 5000 / 1400 MB/s |
| N (full nodes) | 600 GiB | 5000 / 1400 MB/s |
| G (GPU nodes) | 150 GiB | 5000 / 1400 MB/s |
| Hugemem (XL) nodes | 1.6 TiB | 6700 / 4000 MB/s |
| VIZ nodes | 6.5 TiB | 6700 / 4000 MB/s |

Read more about: [Local storage on Roihu nodes](../disk.md#temporary-local-disk-areas)

Expand Down