Failure to schedule task: the container name is already in use by container #24940

gscho · 2025-01-24T14:04:52Z

Nomad version

v1.9.5 but was happening with v1.9.1. Also this never happens on windows and those hosts are running v1.7.7

Docker version

Docker version 27.5.1, build 9f9e405

Operating system and Environment details

22.04.5 LTS (Jammy Jellyfish)

Issue

Jobs fail intermittently with:

failed to create container: Error response from daemon: Conflict. The container name "/build-6cc59855-66d6-6db4-9e87-0e3a6c669e88" is already in use by container "e0c1fd8704656777b88eb06bd01637e348ee1909dcb6c2384a6ad849f8b8b9cb". You have to remove (or rename) that container to be able to reuse that name.

Seems to be the same as an older issue: #2084

Reproduction steps

Unclear. We run hundreds or thousands of batch jobs per day and an unknown percentage of them fail with this error.

The text was updated successfully, but these errors were encountered:

gulducat · 2025-01-28T20:52:37Z

Thanks for the report @gscho.

The allocation ID in the container name {task-name}-{allocation-id} should be very unique, so really should not collide. I suspect it's some retry behavior in our docker plugin.

I'm looking at this section of the driver code: https://github.com/hashicorp/nomad/blob/v1.9.5/drivers/docker/driver.go#L391-L415

High level, I think this can happen if

Nomad docker driver creates the container
but it isn't Running
it doesn't start when the driver asks it to
driver tries to remove the non-starting container (we do not handle or log if this errors, at present)
the start-container error is an ErrConflict (from docker library)
driver tries this all again, from the top
container fails to create, because the name is taken

After that, if your job group is configured to reschedule, then Nomad will place a new allocation, with a new allocation ID, and probably (since you say this is very rare) succeed.

To narrow this down, are you able to check the Nomad client agent logs during one of these occurrences? If the sequence of events I describe above is what's happening, then I would expect to see these logs:

INFO created container container_id={docker container ID}
ERROR failed to start container container_id={} error={some potentially informative error}
DEBUG reattempting container create/start sequence attempt={a number} container_id={}
ERROR failed to create container error={an error like the one you've reported here}

I mentioned on step 4 that we don't log the container removal attempt (line 410). We can add logging for that, which may yield new helpful logs.

And if you can acquire agent logs for us, it may help narrow down whether it's something worth retrying, or if something else entirely may be going on.

gscho · 2025-01-31T14:58:43Z

@gulducat thanks for the reply. We set up our logging system to collect the nomad client logs and are now watching for the next time it happens. Will report back at that time.

gscho · 2025-02-04T14:19:58Z

@gulducat we had another occurrence of the issue.

The image for the job started downloading at 2025-02-04T13:14:52.043909705Z and then we saw the driver failure message at 2025-02-04T13:16:41.072824481Z.

Looking at the client logs, it seems like the client briefly left and rejoined the cluster during this time.

Any guidance on where to look next would be helpful.

gscho added the type/bug label Jan 24, 2025

jrasell added this to Nomad - Community Issues Triage Jan 28, 2025

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Jan 28, 2025

gulducat added the stage/waiting-reply label Jan 28, 2025

gulducat moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Jan 28, 2025

gulducat added the hcc/jira label Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to schedule task: the container name is already in use by container #24940

Failure to schedule task: the container name is already in use by container #24940

gscho commented Jan 24, 2025

gulducat commented Jan 28, 2025

gscho commented Jan 31, 2025

gscho commented Feb 4, 2025

Failure to schedule task: the container name is already in use by container #24940

Failure to schedule task: the container name is already in use by container #24940

Comments

gscho commented Jan 24, 2025

Nomad version

Docker version

Operating system and Environment details

Issue

Reproduction steps

gulducat commented Jan 28, 2025

gscho commented Jan 31, 2025

gscho commented Feb 4, 2025