-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to schedule task: the container name is already in use by container #24940
Comments
Thanks for the report @gscho. The allocation ID in the container name I'm looking at this section of the driver code: https://github.com/hashicorp/nomad/blob/v1.9.5/drivers/docker/driver.go#L391-L415 High level, I think this can happen if
After that, if your job group is configured to reschedule, then Nomad will place a new allocation, with a new allocation ID, and probably (since you say this is very rare) succeed. To narrow this down, are you able to check the Nomad client agent logs during one of these occurrences? If the sequence of events I describe above is what's happening, then I would expect to see these logs:
I mentioned on step 4 that we don't log the container removal attempt (line 410). We can add logging for that, which may yield new helpful logs. And if you can acquire agent logs for us, it may help narrow down whether it's something worth retrying, or if something else entirely may be going on. |
@gulducat thanks for the reply. We set up our logging system to collect the nomad client logs and are now watching for the next time it happens. Will report back at that time. |
@gulducat we had another occurrence of the issue. The image for the job started downloading at Looking at the client logs, it seems like the client briefly left and rejoined the cluster during this time. Any guidance on where to look next would be helpful. |
Nomad version
v1.9.5 but was happening with v1.9.1. Also this never happens on windows and those hosts are running v1.7.7
Docker version
Docker version 27.5.1, build 9f9e405
Operating system and Environment details
22.04.5 LTS (Jammy Jellyfish)
Issue
Jobs fail intermittently with:
Seems to be the same as an older issue: #2084
Reproduction steps
Unclear. We run hundreds or thousands of batch jobs per day and an unknown percentage of them fail with this error.
The text was updated successfully, but these errors were encountered: