Skip to content
This repository has been archived by the owner on Mar 12, 2024. It is now read-only.

continuously growing memory #602

Open
anonymoussss opened this issue Aug 28, 2023 · 0 comments
Open

continuously growing memory #602

anonymoussss opened this issue Aug 28, 2023 · 0 comments

Comments

@anonymoussss
Copy link

Hi, I am training DETR on coco dataset with default training script as follows,
python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path /path/to/coco
But every time I train a few epochs, it reports an error as follows,

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).  
... ...
RuntimeError: DataLoader worker (pid 8686) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
... ...
RuntimeError: DataLoader worker (pid(s) 8686) exited unexpectedly

I checked the memory usage using free -h and found that the memory usage continued to increase until it crashed during training. How to solve this problem?

My mechine have 256G memory,8 T4 GPUs. I run the training script in a docker container with ’ --shm 256G ‘, cuda 11.7, python3.8.5, torch 2.01, torchvison 0.15.2

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant