# batch_size = 2 在两张显卡报错
(OSGNet3) lihua@cs-02:/media/lihua/hu/OSGs/OSGNet (main)$ bash tools/train.sh configs/tacos/my_tacos.yaml False test2 3,4 train
。。。。。。
Parameter Count: all 124,031,986; trainable 124,031,986
Start training model MultiTaskArch ...
ego4d_data/tacos/test_lemma.jsonl
Start training model MultiTaskArch ...
ego4d_data/tacos/test_lemma.jsonl
[Train]: [GPU1] Epoch 0 started
[Train]: [GPU0] Epoch 0 started
training one epoch: 0it [00:04, ?it/s]
Traceback (most recent call last):
File "/media/lihua/hu/OSGs/OSGNet/train.py", line 273, in <module>
main(args)
File "/media/lihua/hu/OSGs/OSGNet/train.py", line 161, in main
best_avgiou=train_one_epoch(
File "/media/lihua/hu/OSGs/OSGNet/libs/utils/train_utils.py", line 414, in train_one_epoch
losses = model(video_list,t=t)
File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/media/lihua/hu/OSGs/OSGNet/libs/modeling/my_archs.py", line 252, in forward
sim_matrix,shot_query,shot_query_mask=self.predict_VTM(bs_shots,enc_vid_pure,enc_video_txt,src_video_txt_mask)#[txt_nums,shots_num]
File "/media/lihua/hu/OSGs/OSGNet/libs/modeling/my_archs.py", line 512, in predict_VTM
txt_query,txt_query_mask=self.txt_aggregator(enc_video_txt, src_video_txt_mask)#[bs,c,Q]
File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/media/lihua/hu/OSGs/OSGNet/libs/modeling/my_layers.py", line 177, in forward
query,query_mask=qformer(query,query_mask,x,mask)
File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/media/lihua/hu/OSGs/OSGNet/libs/modeling/my_layers.py", line 137, in forward
cross_out, cross_out_mask = self.cross_mixer(self.ln3(out), out_mask_float, self.ln3(cross_y), cross_y_mask)
File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/media/lihua/hu/OSGs/OSGNet/libs/modeling/blocks.py", line 124, in forward
sigma = torch.mean(res_x ** 2, dim=1, keepdim=True)
File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/_tensor.py", line 39, in wrapped
return f(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 34.00 MiB (GPU 0; 23.53 GiB total capacity; 22.94 GiB already allocated; 23.06 MiB free; 22.98 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1218435 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1218434) of binary: /home/lihua/.conda/envs/OSGNet3/bin/python
Traceback (most recent call last):
File "/home/lihua/.conda/envs/OSGNet3/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-08-01_16:17:35
host : cs-02
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1218434)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
另外,我的nvcc和transformer和readme中环境有一点不同。(我使用cuda11.8是懒得下11.7了。transformer是因为默认方法配置的环境报兼容性错误)
你好,很感谢你能公布完整的代码。但是你的代码在多卡训练时似乎有严重的内存管理问题。
比如:
# batch_size = 1 在一张显卡能能跑 (OSGNet3) lihua@cs-02:/media/lihua/hu/OSGs/OSGNet (main)$ bash tools/train.sh configs/tacos/my_tacos.yaml False test2 3 train与此同时:
报错后,多卡训练时指定的第二张显卡的内存爆满。
同时发现第4张显卡上有进程在跑:
此时只能暴力结束进程:
另外,我的nvcc和transformer和readme中环境有一点不同。(我使用cuda11.8是懒得下11.7了。transformer是因为默认方法配置的环境报兼容性错误)
编辑文档时修改过路径,但应该没有影响。