Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LocalCUDACluster doesn't come up cleanly after client.restart #1075

Open
randerzander opened this issue Jan 6, 2023 · 6 comments
Open

LocalCUDACluster doesn't come up cleanly after client.restart #1075

randerzander opened this issue Jan 6, 2023 · 6 comments
Assignees

Comments

@randerzander
Copy link
Contributor

randerzander commented Jan 6, 2023

I'm seeing some strange behavior w/ client.restart() on a LocalCUDACluster.

Usually the cluster comes back up fine and I can execute DAGs as expected. But sometimes, I immediately OOM (or hang).

Note: this also happens for protocol='tcp'.

from dask.distributed import Client
from dask_cuda import LocalCUDACluster

if __name__ == "__main__":
  cluster = LocalCUDACluster(
      protocol="ucx",
      jit_unspill=True,
      rmm_pool_size="75GiB",
      device_memory_limit="78GiB",
      local_directory="/raid/dask-local-dir",
  )
  client = Client(cluster)
  for n in range(5):
    client.restart()

Result after a couple of successful restarts:

2023-01-06 11:57:19,191 - distributed.core - ERROR - Waited for 8 worker(s) to reconnect after restarting, but after 120s, only 3 have returned. Consider a longer timeout, or `wait_for_workers=Fals
e`.
Traceback (most recent call last):
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/utils.py", line 742, in wrapper
    return await func(*args, **kwargs)
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/scheduler.py", line 5821, in restart
    raise TimeoutError(msg) from None
asyncio.exceptions.TimeoutError: Waited for 8 worker(s) to reconnect after restarting, but after 120s, only 3 have returned. Consider a longer timeout, or `wait_for_workers=False`.
2023-01-06 11:57:19,192 - distributed.core - ERROR - Exception while handling op restart
Traceback (most recent call last):
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/core.py", line 820, in _handle_comm
    result = await result
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/utils.py", line 742, in wrapper
    return await func(*args, **kwargs)
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/scheduler.py", line 5821, in restart
    raise TimeoutError(msg) from None
asyncio.exceptions.TimeoutError: Waited for 8 worker(s) to reconnect after restarting, but after 120s, only 3 have returned. Consider a longer timeout, or `wait_for_workers=False`.
Traceback (most recent call last):
  File "/home/nfs/rgelhausen/projects/tpc-ds/test.py", line 15, in <module>
    client.restart()
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/client.py", line 3489, in restart
    return self.sync(
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/utils.py", line 339, in sync
    return sync(
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/utils.py", line 406, in sync
    raise exc.with_traceback(tb)
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/utils.py", line 379, in f
    result = yield future
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/client.py", line 3456, in _restart
    await self.scheduler.restart(timeout=timeout, wait_for_workers=wait_for_workers)
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/core.py", line 1221, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/core.py", line 1011, in send_recv
    raise exc.with_traceback(tb)
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/core.py", line 820, in _handle_comm
    result = await result
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/utils.py", line 742, in wrapper
    return await func(*args, **kwargs)
  File "/raid/rgelhausen/conda/envs/pynds/lib/python3.9/site-packages/distributed-2022.12.1+27.g24ba207a-py3.9.egg/distributed/scheduler.py", line 5821, in restart
    raise TimeoutError(msg) from None
asyncio.exceptions.TimeoutError: Waited for 8 worker(s) to reconnect after restarting, but after 120s, only 3 have returned. Consider a longer timeout, or `wait_for_workers=False`

It seems like processes are not exiting cleanly, sometimes causing a hang, and sometimes causing OOMs:

std::bad_alloc: out_of_memory: RMM failure at:/raid/rgelhausen/conda/envs/pynds/include/rmm/mr/device/pool_memory_resource.hpp:193: Maximum pool size exceeded
conda list | grep dask
dask                      2022.12.1+17.gf463fcf28          pypi_0    pypi
dask-cuda                 22.12.0            pyhd8ed1ab_0    conda-forge
distributed               2022.12.1+27.g24ba207a          pypi_0    pypi
@randerzander
Copy link
Contributor Author

As a workaround for my lengthy jobs, I've broken it up into a series of bash commands that does something like:

python script.py --arg part_1
python script.py --arg part_2

Unfortunately, it seems that workers do not terminate cleanly, leaving jobs intermittently frozen at completion:

INFO:distributed.nanny:Closing Nanny at 'ucx://127.0.0.1:56111'. Reason: nanny-close
INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close
INFO:distributed.nanny:Closing Nanny at 'ucx://127.0.0.1:51293'. Reason: nanny-close
INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close
INFO:distributed.nanny:Closing Nanny at 'ucx://127.0.0.1:37857'. Reason: nanny-close
INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close
INFO:distributed.nanny:Closing Nanny at 'ucx://127.0.0.1:37675'. Reason: nanny-close
INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close
INFO:distributed.nanny:Closing Nanny at 'ucx://127.0.0.1:60263'. Reason: nanny-close
INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close
INFO:distributed.nanny:Closing Nanny at 'ucx://127.0.0.1:33279'. Reason: nanny-close
INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close
INFO:distributed.nanny:Closing Nanny at 'ucx://127.0.0.1:58533'. Reason: nanny-close
INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close
INFO:distributed.nanny:Closing Nanny at 'ucx://127.0.0.1:44161'. Reason: nanny-close
INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close
INFO:distributed.core:Received 'close-stream' from ucx://127.0.0.1:51819; closing.
INFO:distributed.core:Received 'close-stream' from ucx://127.0.0.1:51819; closing.
INFO:distributed.core:Received 'close-stream' from ucx://127.0.0.1:51819; closing.
INFO:distributed.core:Received 'close-stream' from ucx://127.0.0.1:51819; closing.
INFO:distributed.core:Received 'close-stream' from ucx://127.0.0.1:51819; closing.
INFO:distributed.scheduler:Remove worker <WorkerState 'ucx://127.0.0.1:39339', name: 0, status: closing, memory: 0, processing: 0>
INFO:distributed.core:Removing comms to ucx://127.0.0.1:39339
INFO:distributed.scheduler:Remove worker <WorkerState 'ucx://127.0.0.1:54669', name: 1, status: closing, memory: 3, processing: 0>
INFO:distributed.core:Removing comms to ucx://127.0.0.1:54669
INFO:distributed.scheduler:Remove worker <WorkerState 'ucx://127.0.0.1:32911', name: 2, status: closing, memory: 4, processing: 0>
INFO:distributed.core:Removing comms to ucx://127.0.0.1:32911
INFO:distributed.scheduler:Remove worker <WorkerState 'ucx://127.0.0.1:46503', name: 3, status: closing, memory: 2, processing: 0>
INFO:distributed.core:Removing comms to ucx://127.0.0.1:46503
INFO:distributed.scheduler:Remove worker <WorkerState 'ucx://127.0.0.1:40081', name: 4, status: closing, memory: 3, processing: 0>
INFO:distributed.core:Removing comms to ucx://127.0.0.1:40081
INFO:distributed.core:Received 'close-stream' from ucx://127.0.0.1:51819; closing.
INFO:distributed.core:Received 'close-stream' from ucx://127.0.0.1:51819; closing.
INFO:distributed.core:Received 'close-stream' from ucx://127.0.0.1:51819; closing.
INFO:distributed.scheduler:Remove worker <WorkerState 'ucx://127.0.0.1:60727', name: 7, status: closing, memory: 1, processing: 4>
INFO:distributed.core:Removing comms to ucx://127.0.0.1:60727
INFO:distributed.scheduler:Remove worker <WorkerState 'ucx://127.0.0.1:36305', name: 5, status: closing, memory: 2, processing: 4>
INFO:distributed.core:Removing comms to ucx://127.0.0.1:36305
INFO:distributed.scheduler:Remove worker <WorkerState 'ucx://127.0.0.1:56345', name: 6, status: closing, memory: 2, processing: 4>
INFO:distributed.core:Removing comms to ucx://127.0.0.1:56345
INFO:distributed.scheduler:Lost all workers
INFO:distributed.scheduler:Scheduler closing...
INFO:distributed.scheduler:Scheduler closing all comms
INFO:distributed.core:Connection to ucx://127.0.0.1:51819 has been closed.

But, nvidia-smi shows memory still allocated to workers:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:01:00.0 Off |                    0 |
| N/A   49C    P0    76W / 300W |    770MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:23:00.0 Off |                    0 |
| N/A   41C    P0    55W / 300W |      2MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   42C    P0    53W / 300W |      2MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  On   | 00000000:61:00.0 Off |                    0 |
| N/A   41C    P0    53W / 300W |      2MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100 80G...  On   | 00000000:81:00.0 Off |                    0 |
| N/A   42C    P0    52W / 300W |      2MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100 80G...  On   | 00000000:A1:00.0 Off |                    0 |
| N/A   41C    P0    54W / 300W |      2MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100 80G...  On   | 00000000:C1:00.0 Off |                    0 |
| N/A   43C    P0    57W / 300W |      2MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100 80G...  On   | 00000000:E1:00.0 Off |                    0 |
| N/A   42C    P0    55W / 300W |      2MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   3917295      C   python                            768MiB |
+-----------------------------------------------------------------------------+

@wence-
Copy link
Contributor

wence- commented Jan 9, 2023

0 N/A N/A 3917295 C python 768MiB

Are you sure this is a worker process? (And not, say, the client?)

@randerzander
Copy link
Contributor Author

0 N/A N/A 3917295 C python 768MiB

Are you sure this is a worker process? (And not, say, the client?)

I think you're right that the pid shown under nvidia-smi is the client process. However, I suspect some aspect of worker exit/restart is not functioning properly given all GPUs still have some small allocation.

@wence-
Copy link
Contributor

wence- commented Jan 9, 2023

given all GPUs still have some small allocation.

A cuda context on the system I have access to is around 300MiB, so 2MiB is a bit small I think.

@randerzander
Copy link
Contributor Author

randerzander commented Jan 9, 2023

given all GPUs still have some small allocation.

A cuda context on the system I have access to is around 300MiB, so 2MiB is a bit small I think.

Agreed, 2MiB is small, but there's no other work happening on this machine, so it seems it's still related to Dask-CUDA workers.

That said, I'm most concerned here with workers not successfully exiting or restarting in the reproducer. The nvidia-smi report showing small allocations on all GPUs just seems like a side-effect of that.

@wence-
Copy link
Contributor

wence- commented Jan 9, 2023

given all GPUs still have some small allocation.

A cuda context on the system I have access to is around 300MiB, so 2MiB is a bit small I think.

Agreed, 2MiB is small. That said, there's no other work happening on this machine, so it seems it's still related to Dask-CUDA workers.

That said, I'm most concerned here with workers not successfully exiting or restarting in the reproducer. The nvidia-smi report showing small allocations on all GPUs just seems like a side-effect of that.

Sure. I'll try and look to see if I can replicate.

@wence- wence- self-assigned this Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants