Nomad allocation stuck in pending state after docker "error reading image pull progress" #24955

sijo13 · 2025-01-27T11:31:43Z

Nomad allocation stuck in pending state for over 2 days in multiple Nomad clients.

Nomad version

Nomad v1.9.3
BuildDate 2024-11-11T16:35:41Z
Revision d92bf1014886c0ff9f882f4a2691d5ae8ad8131c

Docker version:

$docker info
Client: Docker Engine - Community
 Version:    27.2.0
 Context:    default
 Debug Mode: false
Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.16.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.29.2
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
 Server Version: 26.1.2
 Storage Driver: overlay2

During the deployment of a job, one of the allocations went into a pending state as it failed while pulling the images from the artifactory.

Docker logs:

Jan 25 01:17:58 bvrprdnia201.iggroup.local dockerd[1388]: time="2025-01-25T01:17:58.339685393Z" level=info msg="ignoring event" container=92fc5f2131d0232d1e5742bc3bc64786b565390bce8ad28f493b3af3236b9622 module>
Jan 25 02:14:34 bvrprdnia201.iggroup.local dockerd[1388]: time="2025-01-25T02:14:34.649799719Z" level=error msg="Download failed after 1 attempts: unknown blob"
Jan 25 02:14:34 bvrprdnia201.iggroup.local dockerd[1388]: time="2025-01-25T02:14:34.776031716Z" level=error msg="Download failed after 1 attempts: unknown blob"
Jan 25 02:16:00 bvrprdnia201.iggroup.local dockerd[1388]: time="2025-01-25T02:16:00.439802618Z" level=info msg="Download failed, retrying (1/5): received unexpected HTTP status: 504 Gateway Time-out"
Jan 25 02:17:16 bvrprdnia201.iggroup.local dockerd[1388]: time="2025-01-25T02:17:16.225877795Z" level=info msg="Attempting next endpoint for pull after error: unknown blob" spanID=3bf6c0b72df894d3 traceID=e0dc>
Jan 25 02:17:16 bvrprdnia201.iggroup.local dockerd[1388]: time="2025-01-25T02:17:16.232949682Z" level=info msg="Layer sha256:6ba15ab16ceb7de70bd1c619de0fd612ea4a5fb6204355fddde49e7e03ef2847 cleaned up"
Jan 25 02:17:16 bvrprdnia201.iggroup.local dockerd[1388]: time="2025-01-25T02:17:16.247618126Z" level=info msg="Layer sha256:f98070a3b84d357ad0cf97b5b45269e39fed6066e193fb2800d3d276144fbff7 cleaned up"
Jan 25 02:17:16 bvrprdnia201.iggroup.local dockerd[1388]: time="2025-01-25T02:17:16.506213640Z" level=info msg="Layer sha256:aeaf0cb6e1d27b9e0fc411df025c94bef10c0e38c82ab6e4d093737524834535 cleaned up"
Jan 25 02:17:16 bvrprdnia201.iggroup.local dockerd[1388]: time="2025-01-25T02:17:16.506243175Z" level=info msg="Layer

Nomad Logs:

{"@level":"info","@message":"(runner) creating new runner (dry: false, once: false)","@module":"agent","@timestamp":"2025-01-25T02:12:26.732710Z"}
{"@level":"info","@message":"(runner) creating watcher","@module":"agent","@timestamp":"2025-01-25T02:12:26.733240Z"}
{"@level":"info","@message":"(runner) starting","@module":"agent","@timestamp":"2025-01-25T02:12:26.733475Z"}
{"@level":"info","@message":"(runner) rendered \"(dynamic)\" =\u003e \"/data/nomad/alloc/5ea2d54c-4500-9a93-4608-f8bf3e2b984a/kafka011-consumer-monitor/secrets/kafka011-consumer-monitor/environment.env\"","@module":"agent","@timestamp":"2025-01-25T02:12:26.747186Z"}
{"@level":"info","@message":"(runner) rendered \"(dynamic)\" =\u003e \"/data/nomad/alloc/5ea2d54c-4500-9a93-4608-f8bf3e2b984a/kafka011-consumer-monitor/secrets/kafka011-consumer-monitor/kafka.properties\"","@module":"agent","@timestamp":"2025-01-25T02:12:26.848208Z"}
{"@level":"info","@message":"(runner) rendered \"(dynamic)\" =\u003e \"/data/nomad/alloc/5ea2d54c-4500-9a93-4608-f8bf3e2b984a/kafka011-consumer-monitor/secrets/kafka011-consumer-monitor/influx.properties\"","@module":"agent","@timestamp":"2025-01-25T02:12:26.882476Z"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-25T02:12:26.894835Z","alloc_id":"5ea2d54c-4500-9a93-4608-f8bf3e2b984a","failed":false,"msg":"Downloading image","task":"kafka011-consumer-monitor","type":"Driver"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-25T02:14:26.895459Z","alloc_id":"5ea2d54c-4500-9a93-4608-f8bf3e2b984a","failed":false,"msg":"Docker image pull progress: Pulled 18/21 (353.8MiB/353.8MiB) layers: 3 waiting/0 pulling","task":"kafka011-consumer-monitor","type":"Driver"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-25T02:16:26.895756Z","alloc_id":"5ea2d54c-4500-9a93-4608-f8bf3e2b984a","failed":false,"msg":"Docker image pull progress: Pulled 18/21 (353.8MiB/353.8MiB) layers: 0 waiting/3 pulling","task":"kafka011-consumer-monitor","type":"Driver"}
{"@level":"error","@message":"error reading image pull progress","@module":"client.driver_mgr.docker","@timestamp":"2025-01-25T02:17:16.229379Z","driver":"docker","error":"unknown blob"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-25T02:17:26.396716Z","alloc_id":"5ea2d54c-4500-9a93-4608-f8bf3e2b984a","failed":false,"msg":"Task not running by healthy_deadline of 5m0s","task":"kafka011-consumer-monitor","type":"Alloc Unhealthy"}
{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2025-01-25T02:21:13.896466Z","alloc_id":"797bf822-2c75-061f-0725-523c649e3f12","reason":"forced collection"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-25T02:21:13.898730Z","alloc_id":"797bf822-2c75-061f-0725-523c649e3f12","failed":false,"msg":"Sent interrupt. Waiting 5s before force killing","task":"kafka011-consumer-monitor","type":"Killing"}

Allocation was in a pending state for 2 days. The issue was fixed only post-restart of Nomad client service.

Please note that the same issue was observed in another cluster where the Nomad version is v1.7.3 . all the pending allocations were self-healed, which was not the case with nomad clients with version v1.9.3

Nomad job file:

variable "app_image_id" {
  type = string
}
variable "version" {
  type = string
}
variable "namespace" {
  type    = string
  default = "messaging-development"
}
variable "app_name" {
  default = "kafka011-consumer-monitor"
  type    = string
}
variable "vault_root_path" {
  default = "messaging-development-kv"
  type    = string
}
variable "site" {
  type = string
}
variable "extra_java_opts" {
  type = string
}
variable "property_path" {
  type = string
}
variable "cpu" {
  type = string
}
variable "memory" {
  type = string
}

job "kafka011-consumer-monitor" {
  namespace = var.namespace
  node_pool = var.namespace

  meta = {
    version             = "${var.version}"
    ig_datacenter_flip  = "non-core"
  }

  group "kafka011-consumer-monitor" {
    count = 2

    update {
      max_parallel     = 1
      min_healthy_time = "30s"
      auto_revert      = true
    }

    network {
      port "http" {
        to = 8080
      }
    }

    vault {
      policies = ["pol-messaging-development-ldap"]
    }

    task "kafka011-consumer-monitor" {
      driver = "docker"

      config {
        image = var.app_image_id
        ports = ["http"]
      }

      resources {
        cpu    = var.cpu
        memory = var.memory
      }

      restart {
        attempts = 3
        interval = "10m"
        delay    = "30s"
        mode     = "fail"
      }

      service {
        name     = "${var.namespace}-${var.app_name}"
        port     = "http"
        provider = "consul"
        tags = [
          "http",
          "haproxy.router.name=data-and-spectrum",
          "haproxy.service.name=${var.app_name}",
          "haproxy.service.domain=data-platforms"
        ]

        meta  {
          team = "Messaging Development"
        }

        check {
          name      = "Check version"
          type      = "http"
          port      = "http"
          path      = "/${var.app_name}/monitor/version"
          interval  = "15s"
          timeout   = "2s"
        }
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

sijo13 · 2025-01-28T07:10:20Z

I was able to reproduce the problem as it is explained below.

Nomad version: v1.9.3
Docker version: Docker version 26.1.2, build 211e74b

Deployed below job, which uses an image with 7+GB size.

job "nginx" {
  namespace = "platforms"
  node_pool = "platforms"
  group "nginx-group1" {
    count = 1
    network {
      port "http" {
        to = "80"
      }
    }
    task "nginx-task" {
     driver = "docker"
     config {
        image = "download-artifacts.inaut.io/ig-docker-dtu/pytorch/pytorch"
        ports = ["http"]

    }

      service {
        name     = "platforms-nginx-service"
        port     = "http"
        provider = "consul"

        check {
          type     = "http"
          port     = "http"
          path     = "/"
          interval = "2s"
          timeout  = "2s"
        }
      }
    }
  }
}

deployed the job and, after some time, updated the /etc/hosts of the server to simulate the failure. So basically pointed the artifactory DNS to the wrong IP address.

Allocation status is stuck in pending state . Please refer below Screenshots and logs

Nomad logs:

k":"nginx-task","timestamp":"2025-01-28T06:31:55.708Z"}
{"@level":"info","@message":"plugin process exited","@module":"client.alloc_runner.task_runner.task_hook.logmon","@timestamp":"2025-01-28T06:31:55.711982Z","alloc_id":"e44e8109-7ad6-5918-ae0d-21204a3d15c6","id":"1088651","plugin":"/usr/bin/nomad","task":"nginx-task"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-28T06:32:41.859024Z","alloc_id":"a3c54419-959a-5c5c-55bb-fb69c2ff6c06","failed":false,"msg":"Task received by client","task":"nginx-task","type":"Received"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-28T06:32:41.864046Z","alloc_id":"a3c54419-959a-5c5c-55bb-fb69c2ff6c06","failed":false,"msg":"Building Task Directory","task":"nginx-task","type":"Task Setup"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-28T06:32:41.920077Z","alloc_id":"a3c54419-959a-5c5c-55bb-fb69c2ff6c06","failed":false,"msg":"Downloading image","task":"nginx-task","type":"Driver"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-28T06:34:41.920418Z","alloc_id":"a3c54419-959a-5c5c-55bb-fb69c2ff6c06","failed":false,"msg":"Docker image pull progress: Pulled 4/5 (35.94MiB/35.94MiB) layers: 0 waiting/1 pulling","task":"nginx-task","type":"Driver"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-28T06:36:41.920647Z","alloc_id":"a3c54419-959a-5c5c-55bb-fb69c2ff6c06","failed":false,"msg":"Docker image pull progress: Pulled 4/5 (35.94MiB/35.94MiB) layers: 0 waiting/1 pulling","task":"nginx-task","type":"Driver"}
{"@level":"error","@message":"error reading image pull progress","@module":"client.driver_mgr.docker","@timestamp":"2025-01-28T06:36:48.327447Z","driver":"docker","error":"net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-28T06:37:41.863186Z","alloc_id":"a3c54419-959a-5c5c-55bb-fb69c2ff6c06","failed":false,"msg":"Task not running by healthy_deadline of 5m0s","task":"nginx-task","type":"Alloc Unhealthy"}

Docker logs:

Jan 28 06:33:58 vrtstnia354.iggroup.local dockerd[1401]: time="2025-01-28T06:33:58.311041092Z" level=info msg="Download failed, retrying (1/5): net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Jan 28 06:34:33 vrtstnia354.iggroup.local dockerd[1401]: time="2025-01-28T06:34:33.315329727Z" level=info msg="Download failed, retrying (2/5): net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Jan 28 06:35:13 vrtstnia354.iggroup.local dockerd[1401]: time="2025-01-28T06:35:13.317304789Z" level=info msg="Download failed, retrying (3/5): net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Jan 28 06:35:58 vrtstnia354.iggroup.local dockerd[1401]: time="2025-01-28T06:35:58.319391213Z" level=info msg="Download failed, retrying (4/5): net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Jan 28 06:36:48 vrtstnia354.iggroup.local dockerd[1401]: time="2025-01-28T06:36:48.323672082Z" level=error msg="Download failed after 5 attempts: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Jan 28 06:36:48 vrtstnia354.iggroup.local dockerd[1401]: time="2025-01-28T06:36:48.324387492Z" level=info msg="Attempting next endpoint for pull after error: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" spanID=622e5e13d20b6c54 traceID=cf3af4f17a83cf41af27e58cead79893
Jan 28 06:36:48 vrtstnia354.iggroup.local dockerd[1401]: time="2025-01-28T06:36:48.367041194Z" level=info msg="Layer sha256:87d9bdf07ad1a0012391974d139b3a7b709cd16dfaeca768631120e0cc408609 cleaned up"
Jan 28 06:36:48 vrtstnia354.iggroup.local dockerd[1401]: time="2025-01-28T06:36:48.436061532Z" level=info msg="Layer sha256:d101c9453715a978a2a520f553588e77dfb4236762175eba61c5c264a449c75d cleaned up"

gulducat · 2025-01-28T22:46:54Z

Heya @sijo13, thanks for another report!

I am able to reproduce this one by inducing an error in the image pull code. We did a big refactor of the docker driver in Nomad 1.9.0, and looks like this error mode slipped through the cracks, specifically: "error reading image pull progress". We're catching and logging the error, but then not also reporting it as a recoverable error as we should.

I think we can patch this one up to go out in Nomad 1.9.6. Thanks for letting us know!

sijo13 added the type/bug label Jan 27, 2025

jrasell added this to Nomad - Community Issues Triage Jan 28, 2025

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Jan 28, 2025

gulducat self-assigned this Jan 28, 2025

gulducat added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/docker labels Jan 28, 2025

gulducat moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jan 28, 2025

gulducat added the hcc/jira label Jan 28, 2025

gulducat changed the title ~~Nomad allocation stuck in pending state~~ Nomad allocation stuck in pending state after docker "error reading image pull progress" Jan 28, 2025

gulducat linked a pull request Jan 30, 2025 that will close this issue

docker: properly handle "error reading image pull progress" #24981

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad allocation stuck in pending state after docker "error reading image pull progress" #24955

Nomad allocation stuck in pending state after docker "error reading image pull progress" #24955

sijo13 commented Jan 27, 2025 •

edited

Loading

sijo13 commented Jan 28, 2025

gulducat commented Jan 28, 2025 •

edited

Loading

Nomad allocation stuck in pending state after docker "error reading image pull progress" #24955

Nomad allocation stuck in pending state after docker "error reading image pull progress" #24955

Comments

sijo13 commented Jan 27, 2025 • edited Loading

Nomad version

sijo13 commented Jan 28, 2025

gulducat commented Jan 28, 2025 • edited Loading

sijo13 commented Jan 27, 2025 •

edited

Loading

gulducat commented Jan 28, 2025 •

edited

Loading