AI-Hypercomputer · tonyjohnchen · Mar 5, 2026 · Mar 5, 2026
diff --git a/...3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/Chart.yaml b/...3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/Chart.yaml
@@ -0,0 +1,20 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: v2
+name: a4_jobset_workload
+description: a4_jobset_workload
+type: application
+version: 0.1.0
+appVersion: "1.16.0"
diff --git a/...n-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/README.md b/...n-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/README.md
@@ -0,0 +1,151 @@
+<!-- mdformat global-off -->
+# Pretrain deepseek_v3-bf16-gbs2048-gpus256 workloads on a4 GKE Node pools with Megatron-Bridge
+
+This recipe outlines the steps for running a deepseek_v3 pretraining
+workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
+[NVIDIA Megatron-Bridge framework](https://github.com/NVIDIA-NeMo/Megatron-Bridge).
+
+## Orchestration and deployment tools
+
+For this recipe, the following setup is used:
+
+- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
+- Pretraining job configuration and deployment - A Helm chart is used to configure and deploy the Kubernetes Jobset resource which manages the execution of the [Megatron-Bridge pretraining workload](https://github.com/NVIDIA-NeMo/Megatron-Bridge).
+
+## Test environment
+
+This recipe has been optimized for and tested with the following configuration:
+
+- GKE cluster: Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4) to create your a4 GKE cluster.
+- Node Configuration: 32 nodes (8 GPUs per node, 256 GPUs total).
+- GPU Architecture: NVIDIA Blackwell (B200).
+
+## Training dataset
+
+This recipe uses a mock pretraining dataset provided by [Megatron Bridge Framework Datasets utils](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/scripts/performance/utils/datasets.py)
+
+## Docker container image
+
+This recipe uses the following docker images:
+
+- `nvcr.io/nvidia/nemo:25.11.01`
+- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1`
+
+## Run the recipe
+
+From your client workstation, complete the following steps:
+
+### Configure environment settings
+
+Set the environment variables to match your environment:
+
+```bash
+export PROJECT_ID=<PROJECT_ID>
+export CLUSTER_REGION=<CLUSTER_REGION>
+export CLUSTER_NAME=<CLUSTER_NAME>
+export GCS_BUCKET=<GCS_BUCKET> # Note: path should not be prefixed with gs://
+export KUEUE_NAME=<KUEUE_NAME>
+```
+
+Replace the following values:
+
+- `<PROJECT_ID>`: your Google Cloud project ID.
+- `<CLUSTER_REGION>`: the region where your cluster is located.
+- `<CLUSTER_NAME>`: the name of your GKE cluster.
+- `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the gs:// prefix.
+- `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is a4.
+
+Set the default project:
+
+```bash
+gcloud config set project $PROJECT_ID
+```
+
+### Get cluster credentials
+
+```bash
+gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
+```
+
+### Get the recipe
+
+Clone the `gpu-recipes` repository and set a reference to the recipe folder.
+
+```
+git clone https://github.com/ai-hypercomputer/gpu-recipes.git
+cd gpu-recipes
+export REPO_ROOT=`git rev-parse --show-toplevel`
+export RECIPE_ROOT=$REPO_ROOT/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe
+cd $RECIPE_ROOT
+```
+
+### Configure and submit a pretraining job
+
+#### Using 32 nodes (256 gpus) bf16 precision
+
+To execute the job with the default settings, run the following command from your client:
+
+```bash
+cd $RECIPE_ROOT
+export WORKLOAD_NAME=$USER-deepseek-v3-32node-bf16-seq4096-gbs2048
+helm install $WORKLOAD_NAME . -f values.yaml \
+--set-file workload_launcher=launcher.sh \
+--set-file workload_config=custom_setup_experiment.py \
+--set workload.image=nvcr.io/nvidia/nemo:25.11.01 \
+--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
+--set volumes.gcsMounts[0].mountPath=/job-logs \
+--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
+--set queue=${KUEUE_NAME}
+```
+
+**Examples**
+
+-   To set the number of training steps to 100, run the following command from
+    your client:
+
+    ```bash
+    cd $RECIPE_ROOT
+    export WORKLOAD_NAME=$USER-deepseek-v3-32node-bf16-seq4096-gbs2048
+    helm install $WORKLOAD_NAME . -f values.yaml \
+    --set-file workload_launcher=launcher.sh \
+    --set-file workload_config=custom_setup_experiment.py \
+    --set workload.image=nvcr.io/nvidia/nemo:25.11.01 \
+    --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
+    --set volumes.gcsMounts[0].mountPath=/job-logs \
+    --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
+    --set queue=${KUEUE_NAME} \
+    --set workload.arguments[0]="trainer.max_steps=100"
+    ```
+
+### Monitor the job
+
+To check the status of pods in your job, run the following command:
+
+```
+kubectl get pods | grep $USER-deepseek-v3-32node-bf16-seq4096-gbs2048
+```
+
+Replace the following:
+
+- JOB_NAME_PREFIX - your job name prefix. For example $USER-deepseek-v3-32node-bf16-seq4096-gbs2048.
+
+To get the logs for one of the pods, run the following command:
+
+```
+kubectl logs POD_NAME
+```
+
+Information about the training job's progress, including crucial details such as
+loss, step count, and step time, is generated by the rank 0 process.
+This process runs on the pod whose name begins with
+`JOB_NAME_PREFIX-workload-0-0`.
+For example: `$USER-deepseek-v3-32node-bf16-seq4096-gbs2048-workload-0-0-s9zrv`.
+
+### Uninstall the Helm release
+
+You can delete the job and other resources created by the Helm chart. To
+uninstall Helm, run the following command from your client:
+
+```bash
+helm uninstall $USER-deepseek-v3-32node-bf16-seq4096-gbs2048
+```