Skip to content

starter task: move model ckpt cache to /raid on mi300 cluster #977

@functionstackx

Description

@functionstackx

currently model ckpt is in cluster level /home, most of /home NFS is filled with AMD files, (not inferencex files/model ckpts) and commonly leads to errors every couple of weeks of no more storage.

instead of model ckpt being on cluster level storage, it should be on compute node level storage (each compute node having its own storage)

lsblk shows a bunch of unmounted disk with only nvme1n1 & nvme6n1 being used.

task breakdown

  1. learn about slurm specifically srun & sinfo & general storage commands likedf -h & lsblk
  2. srun into each compute node (srun --gres=gpu:8 --nodelist=<nodename> --pty bash, ur nodename can be found via sinfo) and then use claude to create RAID0 disk combining all the drives together and mounting them at /raid/. ensure that gharunner linux user (which is the linux user that CI uses) has write/delete/read perms to this mount on each compute node. also ensure that this RAID is auto mounted for node restart (dont restart the node but make sure it is in whatever file or run whatever command to ensure that happens).
  3. testing: create an PR that changes the HF_HUB_CACHE_MOUNT to the new dir
    export HF_HUB_CACHE_MOUNT="/home/gharunner/gharunners/hf-hub-cache/"
    & then in perfchange-log.yaml create an test of adding mi300 gptoss to it & then in the PR label add sweep-enabled and then see if the jobs succeeds. if it does succed, then u have done it correctly, then delete ur perfhcange-log.yaml diff and only merge the diff on launch_mi300x-amds.sh

**note that each node might have an different lsblk so ensure that ur raid create and mount commands is double checked per node

nvme0n1     259:1    0   3.5T  0 disk  
├─nvme0n1p1 259:2    0   3.5T  0 part  
└─nvme0n1p9 259:3    0     8M  0 part  
nvme2n1     259:5    0   3.5T  0 disk  
├─nvme2n1p1 259:6    0   3.5T  0 part  
└─nvme2n1p9 259:7    0     8M  0 part  
nvme1n1     259:8    0 894.3G  0 disk  
├─nvme1n1p1 259:36   0   512M  0 part  
└─nvme1n1p2 259:37   0 893.8G  0 part  
  └─md0       9:0    0 893.7G  0 raid1 /
nvme3n1     259:10   0   3.5T  0 disk  
├─nvme3n1p1 259:11   0   3.5T  0 part  
└─nvme3n1p9 259:12   0     8M  0 part  
nvme4n1     259:14   0   3.5T  0 disk  
├─nvme4n1p1 259:15   0   3.5T  0 part  
└─nvme4n1p9 259:16   0     8M  0 part  
nvme7n1     259:19   0   3.5T  0 disk  
├─nvme7n1p1 259:27   0   3.5T  0 part  
└─nvme7n1p9 259:28   0     8M  0 part  
nvme9n1     259:21   0   3.5T  0 disk  
├─nvme9n1p1 259:22   0   3.5T  0 part  
└─nvme9n1p9 259:23   0     8M  0 part  
nvme8n1     259:24   0   3.5T  0 disk  
├─nvme8n1p1 259:25   0   3.5T  0 part  
└─nvme8n1p9 259:26   0     8M  0 part  
nvme5n1     259:30   0   3.5T  0 disk  
├─nvme5n1p1 259:31   0   3.5T  0 part  
└─nvme5n1p9 259:32   0     8M  0 part  
nvme6n1     259:33   0 894.3G  0 disk  
├─nvme6n1p1 259:34   0   512M  0 part  /boot/efi
└─nvme6n1p2 259:35   0 893.8G  0 part  
  └─md0       9:0    0 893.7G  0 raid1 /

Metadata

Metadata

Assignees

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions