Skip to content

内存管理 #2

@cosxsinxds

Description

@cosxsinxds

你好,很感谢你能公布完整的代码。但是你的代码在多卡训练时似乎有严重的内存管理问题。
比如:

# batch_size = 1 在一张显卡能能跑
(OSGNet3) lihua@cs-02:/media/lihua/hu/OSGs/OSGNet (main)$ bash tools/train.sh configs/tacos/my_tacos.yaml False test2 3 train
# batch_size = 2 在两张显卡报错
(OSGNet3) lihua@cs-02:/media/lihua/hu/OSGs/OSGNet (main)$ bash tools/train.sh configs/tacos/my_tacos.yaml False test2 3,4 train
。。。。。。
Parameter Count: all 124,031,986; trainable 124,031,986

Start training model MultiTaskArch ...
ego4d_data/tacos/test_lemma.jsonl

Start training model MultiTaskArch ...
ego4d_data/tacos/test_lemma.jsonl

[Train]: [GPU1] Epoch 0 started

[Train]: [GPU0] Epoch 0 started
training one epoch: 0it [00:04, ?it/s]
Traceback (most recent call last):
  File "/media/lihua/hu/OSGs/OSGNet/train.py", line 273, in <module>
    main(args)
  File "/media/lihua/hu/OSGs/OSGNet/train.py", line 161, in main
    best_avgiou=train_one_epoch(
  File "/media/lihua/hu/OSGs/OSGNet/libs/utils/train_utils.py", line 414, in train_one_epoch
    losses = model(video_list,t=t)
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/lihua/hu/OSGs/OSGNet/libs/modeling/my_archs.py", line 252, in forward
    sim_matrix,shot_query,shot_query_mask=self.predict_VTM(bs_shots,enc_vid_pure,enc_video_txt,src_video_txt_mask)#[txt_nums,shots_num]
  File "/media/lihua/hu/OSGs/OSGNet/libs/modeling/my_archs.py", line 512, in predict_VTM
    txt_query,txt_query_mask=self.txt_aggregator(enc_video_txt, src_video_txt_mask)#[bs,c,Q]
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/lihua/hu/OSGs/OSGNet/libs/modeling/my_layers.py", line 177, in forward
    query,query_mask=qformer(query,query_mask,x,mask)
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/lihua/hu/OSGs/OSGNet/libs/modeling/my_layers.py", line 137, in forward
    cross_out, cross_out_mask = self.cross_mixer(self.ln3(out), out_mask_float, self.ln3(cross_y), cross_y_mask)
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/lihua/hu/OSGs/OSGNet/libs/modeling/blocks.py", line 124, in forward
    sigma = torch.mean(res_x ** 2, dim=1, keepdim=True)
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/_tensor.py", line 39, in wrapped
    return f(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 34.00 MiB (GPU 0; 23.53 GiB total capacity; 22.94 GiB already allocated; 23.06 MiB free; 22.98 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1218435 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1218434) of binary: /home/lihua/.conda/envs/OSGNet3/bin/python
Traceback (most recent call last):
  File "/home/lihua/.conda/envs/OSGNet3/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-08-01_16:17:35
  host      : cs-02
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1218434)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

与此同时:

(OSGNet) lihua@cs-02:/media/lihua/hu/OSGs/OSGNet (main)$ nvidia-smi
Fri Aug  1 16:18:17 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 D      Off |   00000000:34:00.0 Off |                  Off |
| 32%   30C    P8             19W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090 D      Off |   00000000:35:00.0 Off |                  Off |
| 32%   31C    P8             21W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090 D      Off |   00000000:36:00.0 Off |                  Off |
| 32%   30C    P8             20W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090 D      Off |   00000000:37:00.0 Off |                  Off |
| 32%   29C    P8             16W /  425W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA GeForce RTX 4090 D      Off |   00000000:9B:00.0 Off |                  Off |
| 31%   36C    P0             74W /  425W |   17575MiB /  24564MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA GeForce RTX 4090 D      Off |   00000000:9C:00.0 Off |                  Off |
| 32%   30C    P8             16W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA GeForce RTX 4090 D      Off |   00000000:9D:00.0 Off |                  Off |
| 31%   31C    P8             12W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA GeForce RTX 4090 D      Off |   00000000:9E:00.0 Off |                  Off |
| 31%   29C    P8             12W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    1   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    2   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    3   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    4   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    5   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    6   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    7   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
+-----------------------------------------------------------------------------------------+

报错后,多卡训练时指定的第二张显卡的内存爆满。
同时发现第4张显卡上有进程在跑:

(OSGNet3) lihua@cs-02:/media/lihua/hu/OSGs/OSGNet (main)$ ps aux | grep python
root        2014  0.0  0.0  44284 18432 ?        Ss   7月24   0:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
root        2175  0.0  0.0 121280 22528 ?        Ssl  7月24   0:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
lihua 1210352  0.0  0.0 135344 48128 ?        Sl   16:08   0:00 /home/lihua/.conda/envs/BE/bin/python /home/lihua/.vscode-server/extensions/ms-python.black-formatter-2025.2.0/bundled/tool/lsp_server.py --stdio
lihua 1210353  0.0  0.0 135344 48128 ?        Sl   16:08   0:00 /home/lihua/.conda/envs/BE/bin/python /home/lihua/.vscode-server/extensions/ms-python.black-formatter-2025.2.0/bundled/tool/lsp_server.py --stdio
lihua 1210355  0.3  0.0 1152064 199660 ?      Sl   16:08   0:02 /home/lihua/.vscode-server/cli/servers/Stable-488a1f239235055e34e673291fb8d8c810886f81/server/node /home/lihua/.vscode-server/extensions/ms-python.vscode-pylance-2025.7.1/dist/server.bundle.js --cancellationReceive=file:cf8cc755f12c8ce0f96b2bd8c003539d705be39d4b --node-ipc --clientProcessId=1146095
lihua 1210364  0.3  0.0 1152064 200488 ?      Sl   16:08   0:02 /home/lihua/.vscode-server/cli/servers/Stable-488a1f239235055e34e673291fb8d8c810886f81/server/node /home/lihua/.vscode-server/extensions/ms-python.vscode-pylance-2025.7.1/dist/server.bundle.js --cancellationReceive=file:cc39cfbd97b0f3d384550f2a946bcf16cc3f509230 --node-ipc --clientProcessId=1093670
lihua 1218622  0.9  0.8 18890512 2147904 pts/50 Sl 16:17   0:01 /home/lihua/.conda/envs/OSGNet3/bin/python -u train.py configs/tacos/my_tacos.yaml --output test2 --resume False --mode=train
lihua 1218696  1.0  0.8 18892220 2148024 pts/50 Sl 16:17   0:01 /home/lihua/.conda/envs/OSGNet3/bin/python -u train.py configs/tacos/my_tacos.yaml --output test2 --resume False --mode=train
lihua 1220754  0.0  0.0  12304  2048 pts/50   S+   16:19   0:00 grep --color=auto python

此时只能暴力结束进程:

(OSGNet3) lihua@cs-02:/media/lihua/hu/OSGs/OSGNet (main)$ nvidia-smi
Fri Aug  1 16:20:46 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 D      Off |   00000000:34:00.0 Off |                  Off |
| 32%   30C    P8             18W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090 D      Off |   00000000:35:00.0 Off |                  Off |
| 32%   31C    P8             21W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090 D      Off |   00000000:36:00.0 Off |                  Off |
| 32%   30C    P8             20W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090 D      Off |   00000000:37:00.0 Off |                  Off |
| 32%   29C    P8             16W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA GeForce RTX 4090 D      Off |   00000000:9B:00.0 Off |                  Off |
| 31%   37C    P8             21W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA GeForce RTX 4090 D      Off |   00000000:9C:00.0 Off |                  Off |
| 32%   30C    P8             16W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA GeForce RTX 4090 D      Off |   00000000:9D:00.0 Off |                  Off |
| 31%   31C    P8             12W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA GeForce RTX 4090 D      Off |   00000000:9E:00.0 Off |                  Off |
| 31%   29C    P8             12W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    1   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    2   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    3   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    4   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    5   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    6   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    7   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
+-----------------------------------------------------------------------------------------+

另外,我的nvcc和transformer和readme中环境有一点不同。(我使用cuda11.8是懒得下11.7了。transformer是因为默认方法配置的环境报兼容性错误)

(OSGNet3) lihua@cs-02:/media/lihua/hu/OSGs/OSGNet (main)$ pip list
Package                 Version     Editable project location
----------------------- ----------- ---------------------------------------------------------------------
absl-py                 2.3.1
brotlicffi              1.0.9.2
causal_conv1d           1.0.0       /media/lihua/hu/OSGs/OSGNet/third_party/VideoMamba/causal-conv1d
certifi                 2025.7.14
cffi                    1.17.1
charset-normalizer      3.4.2
einops                  0.8.1
filelock                3.18.0
fsspec                  2025.7.0
grpcio                  1.74.0
hf-xet                  1.1.5
huggingface-hub         0.34.3
idna                    3.10
line_profiler           5.0.0
lmdb                    1.7.3
mamba_ssm               1.0.1       /media/lihua/hu/OSGs/OSGNet/third_party/VideoMamba/mamba
Markdown                3.8.2
MarkupSafe              3.0.2
mkl_fft                 1.3.11
mkl_random              1.2.8
mkl-service             2.4.0
ninja                   1.11.1.4
nms_1d_cpu              0.0.0
numpy                   1.26.4
packaging               25.0
pandas                  2.3.1
pillow                  11.3.0
pip                     25.1
prettytable             3.16.0
protobuf                6.31.1
pycparser               2.21
PySocks                 1.7.1
python-dateutil         2.9.0.post0
pytz                    2025.2
PyYAML                  6.0.2
regex                   2025.7.34
requests                2.32.4
safetensors             0.5.3
setuptools              78.1.1
six                     1.17.0
tensorboard             2.20.0
tensorboard-data-server 0.7.2
terminaltables          3.1.10
tokenizers              0.13.3
tomli                   2.2.1
torch                   1.13.1
torch-kmeans            0.2.0
torchaudio              0.13.1
torchvision             0.14.1
tqdm                    4.67.1
transformers            4.28.0
triton                  3.4.0
typing_extensions       4.14.1
tzdata                  2025.2
urllib3                 2.5.0
wcwidth                 0.2.13
Werkzeug                3.1.3
wheel                   0.45.1

编辑文档时修改过路径,但应该没有影响。

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions