Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
increase CPU memory requirement for test_nll_loss_large (pytorch#110963)
Running `python test_nn.py -v -k test_nll_loss_large_tensor` on a machine with a small host RAM availability (e.g. ~50GB) fails with a `SIGKILL` even though the currently specified memory requirements for CPU (and GPU) are set to 48GB and are thus met. Profiling the peak memory usage via: ``` \time -v python test_nn.py -v -k test_nll_loss_large_tensor ``` and adding `print(torch.cuda.memory_summaryu())` at the end of the test shows a higher host RAM usage of >100GB and a device memory usage of ~32GB. ``` Command being timed: "python test_nn.py -v -k test_nll_loss_large_tensor" User time (seconds): 81.66 System time (seconds): 229.02 Percent of CPU this job got: 671% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:46.30 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 118150096 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 90280839 Voluntary context switches: 1669 Involuntary context switches: 1214548 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ``` ``` | PyTorch CUDA memory summary, device ID 0 | |---------------------------------------------------------------------------| | CUDA OOMs: 0 | cudaMalloc retries: 0 | |===========================================================================| | Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed | |---------------------------------------------------------------------------| | Allocated memory | 32769 MiB | 32769 MiB | 81923 MiB | 49154 MiB | | from large pool | 32768 MiB | 32768 MiB | 81921 MiB | 49152 MiB | | from small pool | 0 MiB | 0 MiB | 1 MiB | 1 MiB | |---------------------------------------------------------------------------| | Active memory | 32769 MiB | 32769 MiB | 81923 MiB | 49154 MiB | | from large pool | 32768 MiB | 32768 MiB | 81921 MiB | 49152 MiB | | from small pool | 0 MiB | 0 MiB | 1 MiB | 1 MiB | |---------------------------------------------------------------------------| | Requested memory | 32769 MiB | 32769 MiB | 81923 MiB | 49154 MiB | | from large pool | 32768 MiB | 32768 MiB | 81921 MiB | 49152 MiB | | from small pool | 0 MiB | 0 MiB | 1 MiB | 1 MiB | |---------------------------------------------------------------------------| | GPU reserved memory | 32774 MiB | 32774 MiB | 81938 MiB | 49164 MiB | | from large pool | 32772 MiB | 32772 MiB | 81930 MiB | 49158 MiB | | from small pool | 2 MiB | 2 MiB | 8 MiB | 6 MiB | |---------------------------------------------------------------------------| ... ``` We haven't seen this issue before as the majority of our runners have sufficient host RAM and I just ran into it by chance. CC @atalman @malfet @crcrpar Pull Request resolved: pytorch#110963 Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy, https://github.com/malfet
- Loading branch information