Skip to content

Commit

Permalink
increase CPU memory requirement for test_nll_loss_large (pytorch#110963)
Browse files Browse the repository at this point in the history
Running `python test_nn.py -v -k test_nll_loss_large_tensor` on a machine with a small host RAM availability (e.g. ~50GB) fails with a `SIGKILL` even though the currently specified memory requirements for CPU (and GPU) are set to 48GB and are thus met.

Profiling the peak memory usage via:
```
\time -v python test_nn.py -v -k test_nll_loss_large_tensor
```
and adding `print(torch.cuda.memory_summaryu())` at the end of the test shows a higher host RAM usage of >100GB and a device memory usage of ~32GB.
```
	Command being timed: "python test_nn.py -v -k test_nll_loss_large_tensor"
	User time (seconds): 81.66
	System time (seconds): 229.02
	Percent of CPU this job got: 671%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:46.30
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 118150096
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 90280839
	Voluntary context switches: 1669
	Involuntary context switches: 1214548
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
```
```
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  32769 MiB |  32769 MiB |  81923 MiB |  49154 MiB |
|       from large pool |  32768 MiB |  32768 MiB |  81921 MiB |  49152 MiB |
|       from small pool |      0 MiB |      0 MiB |      1 MiB |      1 MiB |
|---------------------------------------------------------------------------|
| Active memory         |  32769 MiB |  32769 MiB |  81923 MiB |  49154 MiB |
|       from large pool |  32768 MiB |  32768 MiB |  81921 MiB |  49152 MiB |
|       from small pool |      0 MiB |      0 MiB |      1 MiB |      1 MiB |
|---------------------------------------------------------------------------|
| Requested memory      |  32769 MiB |  32769 MiB |  81923 MiB |  49154 MiB |
|       from large pool |  32768 MiB |  32768 MiB |  81921 MiB |  49152 MiB |
|       from small pool |      0 MiB |      0 MiB |      1 MiB |      1 MiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |  32774 MiB |  32774 MiB |  81938 MiB |  49164 MiB |
|       from large pool |  32772 MiB |  32772 MiB |  81930 MiB |  49158 MiB |
|       from small pool |      2 MiB |      2 MiB |      8 MiB |      6 MiB |
|---------------------------------------------------------------------------|
...
```

We haven't seen this issue before as the majority of our runners have sufficient host RAM and I just ran into it by chance.

CC @atalman @malfet @crcrpar
Pull Request resolved: pytorch#110963
Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy, https://github.com/malfet
  • Loading branch information
ptrblck authored and pytorchmergebot committed Oct 25, 2023
1 parent 8516b4d commit 17b732e
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion test/test_nn.py
Original file line number Diff line number Diff line change
Expand Up @@ -11492,7 +11492,7 @@ def test_nll_loss_invalid_weights(self, device):

# Ref: https://github.com/pytorch/pytorch/issue/85005
@onlyCUDA
@largeTensorTest("45GB", "cpu")
@largeTensorTest("120GB", "cpu")
@largeTensorTest("45GB", "cuda")
@parametrize_test("reduction", ("none", "mean", "sum"))
def test_nll_loss_large_tensor(self, device, reduction):
Expand Down

0 comments on commit 17b732e

Please sign in to comment.