continuously growing memory #602

anonymoussss · 2023-08-28T08:42:12Z

Hi, I am training DETR on coco dataset with default training script as follows,
python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path /path/to/coco
But every time I train a few epochs, it reports an error as follows,

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).  
... ...
RuntimeError: DataLoader worker (pid 8686) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
... ...
RuntimeError: DataLoader worker (pid(s) 8686) exited unexpectedly

I checked the memory usage using free -h and found that the memory usage continued to increase until it crashed during training. How to solve this problem?

My mechine have 256G memory，8 T4 GPUs. I run the training script in a docker container with ’ --shm 256G ‘, cuda 11.7， python3.8.5, torch 2.01, torchvison 0.15.2

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

continuously growing memory #602

continuously growing memory #602

anonymoussss commented Aug 28, 2023

continuously growing memory #602

continuously growing memory #602

Comments

anonymoussss commented Aug 28, 2023