-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LocalCUDACluster doesn't come up cleanly after client.restart #1075
Comments
As a workaround for my lengthy jobs, I've broken it up into a series of bash commands that does something like:
Unfortunately, it seems that workers do not terminate cleanly, leaving jobs intermittently frozen at completion:
But, nvidia-smi shows memory still allocated to workers:
|
Are you sure this is a worker process? (And not, say, the client?) |
I think you're right that the pid shown under nvidia-smi is the client process. However, I suspect some aspect of worker exit/restart is not functioning properly given all GPUs still have some small allocation. |
A cuda context on the system I have access to is around 300MiB, so 2MiB is a bit small I think. |
Agreed, 2MiB is small, but there's no other work happening on this machine, so it seems it's still related to Dask-CUDA workers. That said, I'm most concerned here with workers not successfully exiting or restarting in the reproducer. The nvidia-smi report showing small allocations on all GPUs just seems like a side-effect of that. |
Sure. I'll try and look to see if I can replicate. |
I'm seeing some strange behavior w/ client.restart() on a LocalCUDACluster.
Usually the cluster comes back up fine and I can execute DAGs as expected. But sometimes, I immediately OOM (or hang).
Note: this also happens for
protocol='tcp'
.Result after a couple of successful restarts:
It seems like processes are not exiting cleanly, sometimes causing a hang, and sometimes causing OOMs:
The text was updated successfully, but these errors were encountered: