-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workers keep running regardless of retire all workers and close the client #1235
Comments
The following log is the case of exiting the job successfully, but still with a error log
|
Could you include some more information to help us understand the problem a little better? See below.
The above is in the order I'd try things out, so if, for example, with 5 the problem doesn't happen we could skip 6. I also understand particularly 5 and 6 may be hard for you to try out depending on how the problem manifests, but unfortunately there are so many moving pieces that it is often difficult to go to the source of the problem without us narrowing down the issue first. Also note that we've been battling issues with shutting down processes for quite a while, and this is a very time-consuming, entangled task, this is why it's been hard to resolve issues like this one. Just for reference, an example of such issue is dask/distributed#7726 . |
This issue has occurred since I started using rapids-22.08, so I add the following codes at the end of my program.
It works (but about 10% probability fails) until rapids-23.06. But after installing rapids-23.08, this issue occurs more frequently than before nevertheless adding the above code, maybe about 90% of the jobs. So I have to send the delete job command manually to force the jobs exiting.
The log shows that even I retire all workers and close the client, and exit the program manually, the worker will still be resetted and wait to connect to client.
The system is Ubuntu 20.04LTS 5.4.0-155-generic, running on supercomputer cluster.
log
The text was updated successfully, but these errors were encountered: