Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: v2
name: a4_jobset_workload
description: a4_jobset_workload
type: application
version: 0.1.0
appVersion: "1.16.0"
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
<!-- mdformat global-off -->
# Pretrain deepseek_v3-bf16-gbs2048-gpus256 workloads on a4 GKE Node pools with Megatron-Bridge

This recipe outlines the steps for running a deepseek_v3 pretraining
workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
[NVIDIA Megatron-Bridge framework](https://github.com/NVIDIA-NeMo/Megatron-Bridge).

## Orchestration and deployment tools

For this recipe, the following setup is used:

- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
- Pretraining job configuration and deployment - A Helm chart is used to configure and deploy the Kubernetes Jobset resource which manages the execution of the [Megatron-Bridge pretraining workload](https://github.com/NVIDIA-NeMo/Megatron-Bridge).

## Test environment

This recipe has been optimized for and tested with the following configuration:

- GKE cluster: Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4) to create your a4 GKE cluster.
- Node Configuration: 32 nodes (8 GPUs per node, 256 GPUs total).
- GPU Architecture: NVIDIA Blackwell (B200).

## Training dataset

This recipe uses a mock pretraining dataset provided by [Megatron Bridge Framework Datasets utils](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/scripts/performance/utils/datasets.py)

## Docker container image

This recipe uses the following docker images:

- `nvcr.io/nvidia/nemo:25.11.01`
- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1`

## Run the recipe

From your client workstation, complete the following steps:

### Configure environment settings

Set the environment variables to match your environment:

```bash
export PROJECT_ID=<PROJECT_ID>
export CLUSTER_REGION=<CLUSTER_REGION>
export CLUSTER_NAME=<CLUSTER_NAME>
export GCS_BUCKET=<GCS_BUCKET> # Note: path should not be prefixed with gs://
export KUEUE_NAME=<KUEUE_NAME>
```

Replace the following values:

- `<PROJECT_ID>`: your Google Cloud project ID.
- `<CLUSTER_REGION>`: the region where your cluster is located.
- `<CLUSTER_NAME>`: the name of your GKE cluster.
- `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the gs:// prefix.
- `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is a4.

Set the default project:

```bash
gcloud config set project $PROJECT_ID
```

### Get cluster credentials

```bash
gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
```

### Get the recipe

Clone the `gpu-recipes` repository and set a reference to the recipe folder.

```
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe
cd $RECIPE_ROOT
```

### Configure and submit a pretraining job

#### Using 32 nodes (256 gpus) bf16 precision

To execute the job with the default settings, run the following command from your client:

```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=$USER-deepseek-v3-32node-bf16-seq4096-gbs2048
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=custom_setup_experiment.py \
--set workload.image=nvcr.io/nvidia/nemo:25.11.01 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME}
```

**Examples**

- To set the number of training steps to 100, run the following command from
your client:

```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=$USER-deepseek-v3-32node-bf16-seq4096-gbs2048
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=custom_setup_experiment.py \
--set workload.image=nvcr.io/nvidia/nemo:25.11.01 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME} \
--set workload.arguments[0]="trainer.max_steps=100"
```

### Monitor the job

To check the status of pods in your job, run the following command:

```
kubectl get pods | grep $USER-deepseek-v3-32node-bf16-seq4096-gbs2048
```

Replace the following:

- JOB_NAME_PREFIX - your job name prefix. For example $USER-deepseek-v3-32node-bf16-seq4096-gbs2048.

To get the logs for one of the pods, run the following command:

```
kubectl logs POD_NAME
```

Information about the training job's progress, including crucial details such as
loss, step count, and step time, is generated by the rank 0 process.
This process runs on the pod whose name begins with
`JOB_NAME_PREFIX-workload-0-0`.
For example: `$USER-deepseek-v3-32node-bf16-seq4096-gbs2048-workload-0-0-s9zrv`.

### Uninstall the Helm release

You can delete the job and other resources created by the Helm chart. To
uninstall Helm, run the following command from your client:

```bash
helm uninstall $USER-deepseek-v3-32node-bf16-seq4096-gbs2048
```
Loading