内存管理

你好，很感谢你能公布完整的代码。但是你的代码在多卡训练时似乎有严重的内存管理问题。
比如：
```bash
# batch_size = 1 在一张显卡能能跑
(OSGNet3) lihua@cs-02:/media/lihua/hu/OSGs/OSGNet (main)$ bash tools/train.sh configs/tacos/my_tacos.yaml False test2 3 train
```
```bash
# batch_size = 2 在两张显卡报错
(OSGNet3) lihua@cs-02:/media/lihua/hu/OSGs/OSGNet (main)$ bash tools/train.sh configs/tacos/my_tacos.yaml False test2 3,4 train
。。。。。。
Parameter Count: all 124,031,986; trainable 124,031,986

Start training model MultiTaskArch ...
ego4d_data/tacos/test_lemma.jsonl

Start training model MultiTaskArch ...
ego4d_data/tacos/test_lemma.jsonl

[Train]: [GPU1] Epoch 0 started

[Train]: [GPU0] Epoch 0 started
training one epoch: 0it [00:04, ?it/s]
Traceback (most recent call last):
  File "/media/lihua/hu/OSGs/OSGNet/train.py", line 273, in <module>
    main(args)
  File "/media/lihua/hu/OSGs/OSGNet/train.py", line 161, in main
    best_avgiou=train_one_epoch(
  File "/media/lihua/hu/OSGs/OSGNet/libs/utils/train_utils.py", line 414, in train_one_epoch
    losses = model(video_list,t=t)
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/lihua/hu/OSGs/OSGNet/libs/modeling/my_archs.py", line 252, in forward
    sim_matrix,shot_query,shot_query_mask=self.predict_VTM(bs_shots,enc_vid_pure,enc_video_txt,src_video_txt_mask)#[txt_nums,shots_num]
  File "/media/lihua/hu/OSGs/OSGNet/libs/modeling/my_archs.py", line 512, in predict_VTM
    txt_query,txt_query_mask=self.txt_aggregator(enc_video_txt, src_video_txt_mask)#[bs,c,Q]
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/lihua/hu/OSGs/OSGNet/libs/modeling/my_layers.py", line 177, in forward
    query,query_mask=qformer(query,query_mask,x,mask)
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/lihua/hu/OSGs/OSGNet/libs/modeling/my_layers.py", line 137, in forward
    cross_out, cross_out_mask = self.cross_mixer(self.ln3(out), out_mask_float, self.ln3(cross_y), cross_y_mask)
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/lihua/hu/OSGs/OSGNet/libs/modeling/blocks.py", line 124, in forward
    sigma = torch.mean(res_x ** 2, dim=1, keepdim=True)
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/_tensor.py", line 39, in wrapped
    return f(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 34.00 MiB (GPU 0; 23.53 GiB total capacity; 22.94 GiB already allocated; 23.06 MiB free; 22.98 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1218435 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1218434) of binary: /home/lihua/.conda/envs/OSGNet3/bin/python
Traceback (most recent call last):
  File "/home/lihua/.conda/envs/OSGNet3/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/lihua/.conda/envs/OSGNet3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-08-01_16:17:35
  host      : cs-02
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1218434)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
```
与此同时：
```bash
(OSGNet) lihua@cs-02:/media/lihua/hu/OSGs/OSGNet (main)$ nvidia-smi
Fri Aug  1 16:18:17 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 D      Off |   00000000:34:00.0 Off |                  Off |
| 32%   30C    P8             19W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090 D      Off |   00000000:35:00.0 Off |                  Off |
| 32%   31C    P8             21W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090 D      Off |   00000000:36:00.0 Off |                  Off |
| 32%   30C    P8             20W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090 D      Off |   00000000:37:00.0 Off |                  Off |
| 32%   29C    P8             16W /  425W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA GeForce RTX 4090 D      Off |   00000000:9B:00.0 Off |                  Off |
| 31%   36C    P0             74W /  425W |   17575MiB /  24564MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA GeForce RTX 4090 D      Off |   00000000:9C:00.0 Off |                  Off |
| 32%   30C    P8             16W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA GeForce RTX 4090 D      Off |   00000000:9D:00.0 Off |                  Off |
| 31%   31C    P8             12W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA GeForce RTX 4090 D      Off |   00000000:9E:00.0 Off |                  Off |
| 31%   29C    P8             12W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    1   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    2   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    3   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    4   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    5   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    6   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    7   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
+-----------------------------------------------------------------------------------------+
```
报错后，多卡训练时指定的第二张显卡的内存爆满。
同时发现第4张显卡上有进程在跑：
```bash
(OSGNet3) lihua@cs-02:/media/lihua/hu/OSGs/OSGNet (main)$ ps aux | grep python
root        2014  0.0  0.0  44284 18432 ?        Ss   7月24   0:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
root        2175  0.0  0.0 121280 22528 ?        Ssl  7月24   0:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
lihua 1210352  0.0  0.0 135344 48128 ?        Sl   16:08   0:00 /home/lihua/.conda/envs/BE/bin/python /home/lihua/.vscode-server/extensions/ms-python.black-formatter-2025.2.0/bundled/tool/lsp_server.py --stdio
lihua 1210353  0.0  0.0 135344 48128 ?        Sl   16:08   0:00 /home/lihua/.conda/envs/BE/bin/python /home/lihua/.vscode-server/extensions/ms-python.black-formatter-2025.2.0/bundled/tool/lsp_server.py --stdio
lihua 1210355  0.3  0.0 1152064 199660 ?      Sl   16:08   0:02 /home/lihua/.vscode-server/cli/servers/Stable-488a1f239235055e34e673291fb8d8c810886f81/server/node /home/lihua/.vscode-server/extensions/ms-python.vscode-pylance-2025.7.1/dist/server.bundle.js --cancellationReceive=file:cf8cc755f12c8ce0f96b2bd8c003539d705be39d4b --node-ipc --clientProcessId=1146095
lihua 1210364  0.3  0.0 1152064 200488 ?      Sl   16:08   0:02 /home/lihua/.vscode-server/cli/servers/Stable-488a1f239235055e34e673291fb8d8c810886f81/server/node /home/lihua/.vscode-server/extensions/ms-python.vscode-pylance-2025.7.1/dist/server.bundle.js --cancellationReceive=file:cc39cfbd97b0f3d384550f2a946bcf16cc3f509230 --node-ipc --clientProcessId=1093670
lihua 1218622  0.9  0.8 18890512 2147904 pts/50 Sl 16:17   0:01 /home/lihua/.conda/envs/OSGNet3/bin/python -u train.py configs/tacos/my_tacos.yaml --output test2 --resume False --mode=train
lihua 1218696  1.0  0.8 18892220 2148024 pts/50 Sl 16:17   0:01 /home/lihua/.conda/envs/OSGNet3/bin/python -u train.py configs/tacos/my_tacos.yaml --output test2 --resume False --mode=train
lihua 1220754  0.0  0.0  12304  2048 pts/50   S+   16:19   0:00 grep --color=auto python
```
此时只能暴力结束进程：
```bash
(OSGNet3) lihua@cs-02:/media/lihua/hu/OSGs/OSGNet (main)$ nvidia-smi
Fri Aug  1 16:20:46 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 D      Off |   00000000:34:00.0 Off |                  Off |
| 32%   30C    P8             18W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090 D      Off |   00000000:35:00.0 Off |                  Off |
| 32%   31C    P8             21W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090 D      Off |   00000000:36:00.0 Off |                  Off |
| 32%   30C    P8             20W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090 D      Off |   00000000:37:00.0 Off |                  Off |
| 32%   29C    P8             16W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA GeForce RTX 4090 D      Off |   00000000:9B:00.0 Off |                  Off |
| 31%   37C    P8             21W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA GeForce RTX 4090 D      Off |   00000000:9C:00.0 Off |                  Off |
| 32%   30C    P8             16W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA GeForce RTX 4090 D      Off |   00000000:9D:00.0 Off |                  Off |
| 31%   31C    P8             12W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA GeForce RTX 4090 D      Off |   00000000:9E:00.0 Off |                  Off |
| 31%   29C    P8             12W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    1   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    2   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    3   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    4   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    5   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    6   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
|    7   N/A  N/A         1203313      G   /usr/lib/xorg/Xorg                        4MiB |
+-----------------------------------------------------------------------------------------+
```
另外，我的nvcc和transformer和readme中环境有一点不同。（我使用cuda11.8是懒得下11.7了。transformer是因为默认方法配置的环境报兼容性错误）
```bash
(OSGNet3) lihua@cs-02:/media/lihua/hu/OSGs/OSGNet (main)$ pip list
Package                 Version     Editable project location
----------------------- ----------- ---------------------------------------------------------------------
absl-py                 2.3.1
brotlicffi              1.0.9.2
causal_conv1d           1.0.0       /media/lihua/hu/OSGs/OSGNet/third_party/VideoMamba/causal-conv1d
certifi                 2025.7.14
cffi                    1.17.1
charset-normalizer      3.4.2
einops                  0.8.1
filelock                3.18.0
fsspec                  2025.7.0
grpcio                  1.74.0
hf-xet                  1.1.5
huggingface-hub         0.34.3
idna                    3.10
line_profiler           5.0.0
lmdb                    1.7.3
mamba_ssm               1.0.1       /media/lihua/hu/OSGs/OSGNet/third_party/VideoMamba/mamba
Markdown                3.8.2
MarkupSafe              3.0.2
mkl_fft                 1.3.11
mkl_random              1.2.8
mkl-service             2.4.0
ninja                   1.11.1.4
nms_1d_cpu              0.0.0
numpy                   1.26.4
packaging               25.0
pandas                  2.3.1
pillow                  11.3.0
pip                     25.1
prettytable             3.16.0
protobuf                6.31.1
pycparser               2.21
PySocks                 1.7.1
python-dateutil         2.9.0.post0
pytz                    2025.2
PyYAML                  6.0.2
regex                   2025.7.34
requests                2.32.4
safetensors             0.5.3
setuptools              78.1.1
six                     1.17.0
tensorboard             2.20.0
tensorboard-data-server 0.7.2
terminaltables          3.1.10
tokenizers              0.13.3
tomli                   2.2.1
torch                   1.13.1
torch-kmeans            0.2.0
torchaudio              0.13.1
torchvision             0.14.1
tqdm                    4.67.1
transformers            4.28.0
triton                  3.4.0
typing_extensions       4.14.1
tzdata                  2025.2
urllib3                 2.5.0
wcwidth                 0.2.13
Werkzeug                3.1.3
wheel                   0.45.1
```
编辑文档时修改过路径，但应该没有影响。


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

内存管理 #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

内存管理 #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions