You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 12, 2024. It is now read-only.
Hi, I am training DETR on coco dataset with default training script as follows, python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path /path/to/coco
But every time I train a few epochs, it reports an error as follows,
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
... ...
RuntimeError: DataLoader worker (pid 8686) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
... ...
RuntimeError: DataLoader worker (pid(s) 8686) exited unexpectedly
I checked the memory usage using free -h and found that the memory usage continued to increase until it crashed during training. How to solve this problem?
My mechine have 256G memory,8 T4 GPUs. I run the training script in a docker container with ’ --shm 256G ‘, cuda 11.7, python3.8.5, torch 2.01, torchvison 0.15.2
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hi, I am training DETR on coco dataset with default training script as follows,
python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path /path/to/coco
But every time I train a few epochs, it reports an error as follows,
I checked the memory usage using
free -h
and found that the memory usage continued to increase until it crashed during training. How to solve this problem?My mechine have 256G memory,8 T4 GPUs. I run the training script in a docker container with ’ --shm 256G ‘, cuda 11.7, python3.8.5, torch 2.01, torchvison 0.15.2
The text was updated successfully, but these errors were encountered: