LocalCUDACluster doesn't come up cleanly after client.restart #1075

randerzander · 2023-01-06T19:58:00Z

I'm seeing some strange behavior w/ client.restart() on a LocalCUDACluster.

Usually the cluster comes back up fine and I can execute DAGs as expected. But sometimes, I immediately OOM (or hang).

Note: this also happens for protocol='tcp'.

from dask.distributed import Client
from dask_cuda import LocalCUDACluster

if __name__ == "__main__":
  cluster = LocalCUDACluster(
      protocol="ucx",
      jit_unspill=True,
      rmm_pool_size="75GiB",
      device_memory_limit="78GiB",
      local_directory="/raid/dask-local-dir",
  )
  client = Client(cluster)
  for n in range(5):
    client.restart()

Result after a couple of successful restarts:

2023-01-06 11:57:19,191 - distributed.core - ERROR - Waited for 8 worker(s) to reconnect after restarting, but after 120s, only 3 have returned. Consider a longer timeout, or `wait_for_workers=Fals
e`.
Traceback (most recent call last):
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/utils.py", line 742, in wrapper
    return await func(*args, **kwargs)
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/scheduler.py", line 5821, in restart
    raise TimeoutError(msg) from None
asyncio.exceptions.TimeoutError: Waited for 8 worker(s) to reconnect after restarting, but after 120s, only 3 have returned. Consider a longer timeout, or `wait_for_workers=False`.
2023-01-06 11:57:19,192 - distributed.core - ERROR - Exception while handling op restart
Traceback (most recent call last):
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/core.py", line 820, in _handle_comm
    result = await result
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/utils.py", line 742, in wrapper
    return await func(*args, **kwargs)
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/scheduler.py", line 5821, in restart
    raise TimeoutError(msg) from None
asyncio.exceptions.TimeoutError: Waited for 8 worker(s) to reconnect after restarting, but after 120s, only 3 have returned. Consider a longer timeout, or `wait_for_workers=False`.
Traceback (most recent call last):
  File "/home/nfs/rgelhausen/projects/tpc-ds/test.py", line 15, in <module>
    client.restart()
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/client.py", line 3489, in restart
    return self.sync(
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/utils.py", line 339, in sync
    return sync(
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/utils.py", line 406, in sync
    raise exc.with_traceback(tb)
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/utils.py", line 379, in f
    result = yield future
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/client.py", line 3456, in _restart
    await self.scheduler.restart(timeout=timeout, wait_for_workers=wait_for_workers)
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/core.py", line 1221, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/core.py", line 1011, in send_recv
    raise exc.with_traceback(tb)
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/core.py", line 820, in _handle_comm
    result = await result
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/utils.py", line 742, in wrapper
    return await func(*args, **kwargs)
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/scheduler.py", line 5821, in restart
    raise TimeoutError(msg) from None
asyncio.exceptions.TimeoutError: Waited for 8 worker(s) to reconnect after restarting, but after 120s, only 3 have returned. Consider a longer timeout, or `wait_for_workers=False`

It seems like processes are not exiting cleanly, sometimes causing a hang, and sometimes causing OOMs:

std::bad_alloc: out_of_memory: RMM failure at:/raid/rgelhausen/conda/envs/pynds/include/rmm/mr/device/pool_memory_resource.hpp:193: Maximum pool size exceeded

conda list | grep dask
dask                      2022.12.1+17.gf463fcf28          pypi_0    pypi
dask-cuda                 22.12.0            pyhd8ed1ab_0    conda-forge
distributed               2022.12.1+27.g24ba207a          pypi_0    pypi

The text was updated successfully, but these errors were encountered:

randerzander · 2023-01-06T23:49:59Z

As a workaround for my lengthy jobs, I've broken it up into a series of bash commands that does something like:

python script.py --arg part_1
python script.py --arg part_2

Unfortunately, it seems that workers do not terminate cleanly, leaving jobs intermittently frozen at completion:

INFO:distributed.nanny:Closing Nanny at 'ucx://127.0.0.1:56111'. Reason: nanny-close
INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close
INFO:distributed.nanny:Closing Nanny at 'ucx://127.0.0.1:51293'. Reason: nanny-close
INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close
INFO:distributed.nanny:Closing Nanny at 'ucx://127.0.0.1:37857'. Reason: nanny-close
INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close
INFO:distributed.nanny:Closing Nanny at 'ucx://127.0.0.1:37675'. Reason: nanny-close
INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close
INFO:distributed.nanny:Closing Nanny at 'ucx://127.0.0.1:60263'. Reason: nanny-close
INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close
INFO:distributed.nanny:Closing Nanny at 'ucx://127.0.0.1:33279'. Reason: nanny-close
INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close
INFO:distributed.nanny:Closing Nanny at 'ucx://127.0.0.1:58533'. Reason: nanny-close
INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close
INFO:distributed.nanny:Closing Nanny at 'ucx://127.0.0.1:44161'. Reason: nanny-close
INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close
INFO:distributed.core:Received 'close-stream' from ucx://127.0.0.1:51819; closing.
INFO:distributed.core:Received 'close-stream' from ucx://127.0.0.1:51819; closing.
INFO:distributed.core:Received 'close-stream' from ucx://127.0.0.1:51819; closing.
INFO:distributed.core:Received 'close-stream' from ucx://127.0.0.1:51819; closing.
INFO:distributed.core:Received 'close-stream' from ucx://127.0.0.1:51819; closing.
INFO:distributed.scheduler:Remove worker <WorkerState 'ucx://127.0.0.1:39339', name: 0, status: closing, memory: 0, processing: 0>
INFO:distributed.core:Removing comms to ucx://127.0.0.1:39339
INFO:distributed.scheduler:Remove worker <WorkerState 'ucx://127.0.0.1:54669', name: 1, status: closing, memory: 3, processing: 0>
INFO:distributed.core:Removing comms to ucx://127.0.0.1:54669
INFO:distributed.scheduler:Remove worker <WorkerState 'ucx://127.0.0.1:32911', name: 2, status: closing, memory: 4, processing: 0>
INFO:distributed.core:Removing comms to ucx://127.0.0.1:32911
INFO:distributed.scheduler:Remove worker <WorkerState 'ucx://127.0.0.1:46503', name: 3, status: closing, memory: 2, processing: 0>
INFO:distributed.core:Removing comms to ucx://127.0.0.1:46503
INFO:distributed.scheduler:Remove worker <WorkerState 'ucx://127.0.0.1:40081', name: 4, status: closing, memory: 3, processing: 0>
INFO:distributed.core:Removing comms to ucx://127.0.0.1:40081
INFO:distributed.core:Received 'close-stream' from ucx://127.0.0.1:51819; closing.
INFO:distributed.core:Received 'close-stream' from ucx://127.0.0.1:51819; closing.
INFO:distributed.core:Received 'close-stream' from ucx://127.0.0.1:51819; closing.
INFO:distributed.scheduler:Remove worker <WorkerState 'ucx://127.0.0.1:60727', name: 7, status: closing, memory: 1, processing: 4>
INFO:distributed.core:Removing comms to ucx://127.0.0.1:60727
INFO:distributed.scheduler:Remove worker <WorkerState 'ucx://127.0.0.1:36305', name: 5, status: closing, memory: 2, processing: 4>
INFO:distributed.core:Removing comms to ucx://127.0.0.1:36305
INFO:distributed.scheduler:Remove worker <WorkerState 'ucx://127.0.0.1:56345', name: 6, status: closing, memory: 2, processing: 4>
INFO:distributed.core:Removing comms to ucx://127.0.0.1:56345
INFO:distributed.scheduler:Lost all workers
INFO:distributed.scheduler:Scheduler closing...
INFO:distributed.scheduler:Scheduler closing all comms
INFO:distributed.core:Connection to ucx://127.0.0.1:51819 has been closed.

But, nvidia-smi shows memory still allocated to workers:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:01:00.0 Off |                    0 |
| N/A   49C    P0    76W / 300W |    770MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:23:00.0 Off |                    0 |
| N/A   41C    P0    55W / 300W |      2MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   42C    P0    53W / 300W |      2MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  On   | 00000000:61:00.0 Off |                    0 |
| N/A   41C    P0    53W / 300W |      2MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100 80G...  On   | 00000000:81:00.0 Off |                    0 |
| N/A   42C    P0    52W / 300W |      2MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100 80G...  On   | 00000000:A1:00.0 Off |                    0 |
| N/A   41C    P0    54W / 300W |      2MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100 80G...  On   | 00000000:C1:00.0 Off |                    0 |
| N/A   43C    P0    57W / 300W |      2MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100 80G...  On   | 00000000:E1:00.0 Off |                    0 |
| N/A   42C    P0    55W / 300W |      2MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   3917295      C   python                            768MiB |
+-----------------------------------------------------------------------------+

wence- · 2023-01-09T13:53:18Z

0 N/A N/A 3917295 C python 768MiB

Are you sure this is a worker process? (And not, say, the client?)

randerzander · 2023-01-09T17:31:06Z

0 N/A N/A 3917295 C python 768MiB

Are you sure this is a worker process? (And not, say, the client?)

I think you're right that the pid shown under nvidia-smi is the client process. However, I suspect some aspect of worker exit/restart is not functioning properly given all GPUs still have some small allocation.

wence- · 2023-01-09T17:56:56Z

given all GPUs still have some small allocation.

A cuda context on the system I have access to is around 300MiB, so 2MiB is a bit small I think.

randerzander · 2023-01-09T18:00:04Z

given all GPUs still have some small allocation.

A cuda context on the system I have access to is around 300MiB, so 2MiB is a bit small I think.

Agreed, 2MiB is small, but there's no other work happening on this machine, so it seems it's still related to Dask-CUDA workers.

That said, I'm most concerned here with workers not successfully exiting or restarting in the reproducer. The nvidia-smi report showing small allocations on all GPUs just seems like a side-effect of that.

wence- · 2023-01-09T18:01:00Z

given all GPUs still have some small allocation.

A cuda context on the system I have access to is around 300MiB, so 2MiB is a bit small I think.

Agreed, 2MiB is small. That said, there's no other work happening on this machine, so it seems it's still related to Dask-CUDA workers.

That said, I'm most concerned here with workers not successfully exiting or restarting in the reproducer. The nvidia-smi report showing small allocations on all GPUs just seems like a side-effect of that.

Sure. I'll try and look to see if I can replicate.

wence- self-assigned this Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LocalCUDACluster doesn't come up cleanly after client.restart #1075

LocalCUDACluster doesn't come up cleanly after client.restart #1075

randerzander commented Jan 6, 2023 •

edited

Loading

randerzander commented Jan 6, 2023

wence- commented Jan 9, 2023

randerzander commented Jan 9, 2023

wence- commented Jan 9, 2023

randerzander commented Jan 9, 2023 •

edited

Loading

wence- commented Jan 9, 2023

LocalCUDACluster doesn't come up cleanly after client.restart #1075

LocalCUDACluster doesn't come up cleanly after client.restart #1075

Comments

randerzander commented Jan 6, 2023 • edited Loading

randerzander commented Jan 6, 2023

wence- commented Jan 9, 2023

randerzander commented Jan 9, 2023

wence- commented Jan 9, 2023

randerzander commented Jan 9, 2023 • edited Loading

wence- commented Jan 9, 2023

randerzander commented Jan 6, 2023 •

edited

Loading

randerzander commented Jan 9, 2023 •

edited

Loading