-
Notifications
You must be signed in to change notification settings - Fork 117
starter task: move model ckpt cache to /raid on mi300 cluster #977
Copy link
Copy link
Open
Labels
Description
currently model ckpt is in cluster level /home, most of /home NFS is filled with AMD files, (not inferencex files/model ckpts) and commonly leads to errors every couple of weeks of no more storage.
instead of model ckpt being on cluster level storage, it should be on compute node level storage (each compute node having its own storage)
lsblk shows a bunch of unmounted disk with only nvme1n1 & nvme6n1 being used.
task breakdown
- learn about slurm specifically
srun&sinfo& general storage commands likedf -h&lsblk sruninto each compute node (srun --gres=gpu:8 --nodelist=<nodename> --pty bash, ur nodename can be found viasinfo) and then use claude to create RAID0 disk combining all the drives together and mounting them at /raid/. ensure thatgharunnerlinux user (which is the linux user that CI uses) has write/delete/read perms to this mount on each compute node. also ensure that this RAID is auto mounted for node restart (dont restart the node but make sure it is in whatever file or run whatever command to ensure that happens).- testing: create an PR that changes the HF_HUB_CACHE_MOUNT to the new dir & then in perfchange-log.yaml create an test of adding mi300 gptoss to it & then in the PR label add
export HF_HUB_CACHE_MOUNT="/home/gharunner/gharunners/hf-hub-cache/" sweep-enabledand then see if the jobs succeeds. if it does succed, then u have done it correctly, then delete ur perfhcange-log.yaml diff and only merge the diff onlaunch_mi300x-amds.sh
**note that each node might have an different lsblk so ensure that ur raid create and mount commands is double checked per node
nvme0n1 259:1 0 3.5T 0 disk
├─nvme0n1p1 259:2 0 3.5T 0 part
└─nvme0n1p9 259:3 0 8M 0 part
nvme2n1 259:5 0 3.5T 0 disk
├─nvme2n1p1 259:6 0 3.5T 0 part
└─nvme2n1p9 259:7 0 8M 0 part
nvme1n1 259:8 0 894.3G 0 disk
├─nvme1n1p1 259:36 0 512M 0 part
└─nvme1n1p2 259:37 0 893.8G 0 part
└─md0 9:0 0 893.7G 0 raid1 /
nvme3n1 259:10 0 3.5T 0 disk
├─nvme3n1p1 259:11 0 3.5T 0 part
└─nvme3n1p9 259:12 0 8M 0 part
nvme4n1 259:14 0 3.5T 0 disk
├─nvme4n1p1 259:15 0 3.5T 0 part
└─nvme4n1p9 259:16 0 8M 0 part
nvme7n1 259:19 0 3.5T 0 disk
├─nvme7n1p1 259:27 0 3.5T 0 part
└─nvme7n1p9 259:28 0 8M 0 part
nvme9n1 259:21 0 3.5T 0 disk
├─nvme9n1p1 259:22 0 3.5T 0 part
└─nvme9n1p9 259:23 0 8M 0 part
nvme8n1 259:24 0 3.5T 0 disk
├─nvme8n1p1 259:25 0 3.5T 0 part
└─nvme8n1p9 259:26 0 8M 0 part
nvme5n1 259:30 0 3.5T 0 disk
├─nvme5n1p1 259:31 0 3.5T 0 part
└─nvme5n1p9 259:32 0 8M 0 part
nvme6n1 259:33 0 894.3G 0 disk
├─nvme6n1p1 259:34 0 512M 0 part /boot/efi
└─nvme6n1p2 259:35 0 893.8G 0 part
└─md0 9:0 0 893.7G 0 raid1 /
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
No status