increase CPU memory requirement for test_nll_loss_large (#110963) · heysaeed/pytorchw@17b732e

Commit

increase CPU memory requirement for test_nll_loss_large (pytorch#110963)

Running `python test_nn.py -v -k test_nll_loss_large_tensor` on a machine with a small host RAM availability (e.g. ~50GB) fails with a `SIGKILL` even though the currently specified memory requirements for CPU (and GPU) are set to 48GB and are thus met.

Profiling the peak memory usage via:
```
\time -v python test_nn.py -v -k test_nll_loss_large_tensor
```
and adding `print(torch.cuda.memory_summaryu())` at the end of the test shows a higher host RAM usage of >100GB and a device memory usage of ~32GB.
```
	Command being timed: "python test_nn.py -v -k test_nll_loss_large_tensor"
	User time (seconds): 81.66
	System time (seconds): 229.02
	Percent of CPU this job got: 671%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:46.30
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 118150096
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 90280839
	Voluntary context switches: 1669
	Involuntary context switches: 1214548
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
```
```
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  32769 MiB |  32769 MiB |  81923 MiB |  49154 MiB |
|       from large pool |  32768 MiB |  32768 MiB |  81921 MiB |  49152 MiB |
|       from small pool |      0 MiB |      0 MiB |      1 MiB |      1 MiB |
|---------------------------------------------------------------------------|
| Active memory         |  32769 MiB |  32769 MiB |  81923 MiB |  49154 MiB |
|       from large pool |  32768 MiB |  32768 MiB |  81921 MiB |  49152 MiB |
|       from small pool |      0 MiB |      0 MiB |      1 MiB |      1 MiB |
|---------------------------------------------------------------------------|
| Requested memory      |  32769 MiB |  32769 MiB |  81923 MiB |  49154 MiB |
|       from large pool |  32768 MiB |  32768 MiB |  81921 MiB |  49152 MiB |
|       from small pool |      0 MiB |      0 MiB |      1 MiB |      1 MiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |  32774 MiB |  32774 MiB |  81938 MiB |  49164 MiB |
|       from large pool |  32772 MiB |  32772 MiB |  81930 MiB |  49158 MiB |
|       from small pool |      2 MiB |      2 MiB |      8 MiB |      6 MiB |
|---------------------------------------------------------------------------|
...
```

We haven't seen this issue before as the majority of our runners have sufficient host RAM and I just ran into it by chance.

CC @atalman @malfet @crcrpar
Pull Request resolved: pytorch#110963
Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy, https://github.com/malfet

Loading branch information

ptrblck authored and pytorchmergebot committed Oct 25, 2023

1 parent 8516b4d commit 17b732e

test/test_nn.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -11492,7 +11492,7 @@ def test_nll_loss_invalid_weights(self, device): @@
         # Ref: https://github.com/pytorch/pytorch/issue/85005
         @onlyCUDA
-        @largeTensorTest("45GB", "cpu")
+        @largeTensorTest("120GB", "cpu")
         @largeTensorTest("45GB", "cuda")
         @parametrize_test("reduction", ("none", "mean", "sum"))
         def test_nll_loss_large_tensor(self, device, reduction):
@@ Expand Down @@

0 comments on commit `17b732e`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `17b732e`

Commit

There are no files selected for viewing

0 comments on commit 17b732e

0 comments on commit `17b732e`