Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad allocation stuck in pending state after docker "error reading image pull progress" #24955

Open
sijo13 opened this issue Jan 27, 2025 · 2 comments · May be fixed by #24981
Open

Nomad allocation stuck in pending state after docker "error reading image pull progress" #24955

sijo13 opened this issue Jan 27, 2025 · 2 comments · May be fixed by #24981
Assignees
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/docker type/bug

Comments

@sijo13
Copy link

sijo13 commented Jan 27, 2025

Nomad allocation stuck in pending state for over 2 days in multiple Nomad clients.

Nomad version

Nomad v1.9.3
BuildDate 2024-11-11T16:35:41Z
Revision d92bf1014886c0ff9f882f4a2691d5ae8ad8131c

Docker version:

$docker info
Client: Docker Engine - Community
 Version:    27.2.0
 Context:    default
 Debug Mode: false
Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.16.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.29.2
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
 Server Version: 26.1.2
 Storage Driver: overlay2

During the deployment of a job, one of the allocations went into a pending state as it failed while pulling the images from the artifactory.

Docker logs:

Jan 25 01:17:58 bvrprdnia201.iggroup.local dockerd[1388]: time="2025-01-25T01:17:58.339685393Z" level=info msg="ignoring event" container=92fc5f2131d0232d1e5742bc3bc64786b565390bce8ad28f493b3af3236b9622 module>
Jan 25 02:14:34 bvrprdnia201.iggroup.local dockerd[1388]: time="2025-01-25T02:14:34.649799719Z" level=error msg="Download failed after 1 attempts: unknown blob"
Jan 25 02:14:34 bvrprdnia201.iggroup.local dockerd[1388]: time="2025-01-25T02:14:34.776031716Z" level=error msg="Download failed after 1 attempts: unknown blob"
Jan 25 02:16:00 bvrprdnia201.iggroup.local dockerd[1388]: time="2025-01-25T02:16:00.439802618Z" level=info msg="Download failed, retrying (1/5): received unexpected HTTP status: 504 Gateway Time-out"
Jan 25 02:17:16 bvrprdnia201.iggroup.local dockerd[1388]: time="2025-01-25T02:17:16.225877795Z" level=info msg="Attempting next endpoint for pull after error: unknown blob" spanID=3bf6c0b72df894d3 traceID=e0dc>
Jan 25 02:17:16 bvrprdnia201.iggroup.local dockerd[1388]: time="2025-01-25T02:17:16.232949682Z" level=info msg="Layer sha256:6ba15ab16ceb7de70bd1c619de0fd612ea4a5fb6204355fddde49e7e03ef2847 cleaned up"
Jan 25 02:17:16 bvrprdnia201.iggroup.local dockerd[1388]: time="2025-01-25T02:17:16.247618126Z" level=info msg="Layer sha256:f98070a3b84d357ad0cf97b5b45269e39fed6066e193fb2800d3d276144fbff7 cleaned up"
Jan 25 02:17:16 bvrprdnia201.iggroup.local dockerd[1388]: time="2025-01-25T02:17:16.506213640Z" level=info msg="Layer sha256:aeaf0cb6e1d27b9e0fc411df025c94bef10c0e38c82ab6e4d093737524834535 cleaned up"
Jan 25 02:17:16 bvrprdnia201.iggroup.local dockerd[1388]: time="2025-01-25T02:17:16.506243175Z" level=info msg="Layer 

Nomad Logs:

{"@level":"info","@message":"(runner) creating new runner (dry: false, once: false)","@module":"agent","@timestamp":"2025-01-25T02:12:26.732710Z"}
{"@level":"info","@message":"(runner) creating watcher","@module":"agent","@timestamp":"2025-01-25T02:12:26.733240Z"}
{"@level":"info","@message":"(runner) starting","@module":"agent","@timestamp":"2025-01-25T02:12:26.733475Z"}
{"@level":"info","@message":"(runner) rendered \"(dynamic)\" =\u003e \"/data/nomad/alloc/5ea2d54c-4500-9a93-4608-f8bf3e2b984a/kafka011-consumer-monitor/secrets/kafka011-consumer-monitor/environment.env\"","@module":"agent","@timestamp":"2025-01-25T02:12:26.747186Z"}
{"@level":"info","@message":"(runner) rendered \"(dynamic)\" =\u003e \"/data/nomad/alloc/5ea2d54c-4500-9a93-4608-f8bf3e2b984a/kafka011-consumer-monitor/secrets/kafka011-consumer-monitor/kafka.properties\"","@module":"agent","@timestamp":"2025-01-25T02:12:26.848208Z"}
{"@level":"info","@message":"(runner) rendered \"(dynamic)\" =\u003e \"/data/nomad/alloc/5ea2d54c-4500-9a93-4608-f8bf3e2b984a/kafka011-consumer-monitor/secrets/kafka011-consumer-monitor/influx.properties\"","@module":"agent","@timestamp":"2025-01-25T02:12:26.882476Z"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-25T02:12:26.894835Z","alloc_id":"5ea2d54c-4500-9a93-4608-f8bf3e2b984a","failed":false,"msg":"Downloading image","task":"kafka011-consumer-monitor","type":"Driver"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-25T02:14:26.895459Z","alloc_id":"5ea2d54c-4500-9a93-4608-f8bf3e2b984a","failed":false,"msg":"Docker image pull progress: Pulled 18/21 (353.8MiB/353.8MiB) layers: 3 waiting/0 pulling","task":"kafka011-consumer-monitor","type":"Driver"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-25T02:16:26.895756Z","alloc_id":"5ea2d54c-4500-9a93-4608-f8bf3e2b984a","failed":false,"msg":"Docker image pull progress: Pulled 18/21 (353.8MiB/353.8MiB) layers: 0 waiting/3 pulling","task":"kafka011-consumer-monitor","type":"Driver"}
{"@level":"error","@message":"error reading image pull progress","@module":"client.driver_mgr.docker","@timestamp":"2025-01-25T02:17:16.229379Z","driver":"docker","error":"unknown blob"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-25T02:17:26.396716Z","alloc_id":"5ea2d54c-4500-9a93-4608-f8bf3e2b984a","failed":false,"msg":"Task not running by healthy_deadline of 5m0s","task":"kafka011-consumer-monitor","type":"Alloc Unhealthy"}
{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2025-01-25T02:21:13.896466Z","alloc_id":"797bf822-2c75-061f-0725-523c649e3f12","reason":"forced collection"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-25T02:21:13.898730Z","alloc_id":"797bf822-2c75-061f-0725-523c649e3f12","failed":false,"msg":"Sent interrupt. Waiting 5s before force killing","task":"kafka011-consumer-monitor","type":"Killing"}

Allocation was in a pending state for 2 days. The issue was fixed only post-restart of Nomad client service.

Please note that the same issue was observed in another cluster where the Nomad version is v1.7.3 . all the pending allocations were self-healed, which was not the case with nomad clients with version v1.9.3

Nomad job file:

variable "app_image_id" {
  type = string
}
variable "version" {
  type = string
}
variable "namespace" {
  type    = string
  default = "messaging-development"
}
variable "app_name" {
  default = "kafka011-consumer-monitor"
  type    = string
}
variable "vault_root_path" {
  default = "messaging-development-kv"
  type    = string
}
variable "site" {
  type = string
}
variable "extra_java_opts" {
  type = string
}
variable "property_path" {
  type = string
}
variable "cpu" {
  type = string
}
variable "memory" {
  type = string
}

job "kafka011-consumer-monitor" {
  namespace = var.namespace
  node_pool = var.namespace

  meta = {
    version             = "${var.version}"
    ig_datacenter_flip  = "non-core"
  }

  group "kafka011-consumer-monitor" {
    count = 2

    update {
      max_parallel     = 1
      min_healthy_time = "30s"
      auto_revert      = true
    }

    network {
      port "http" {
        to = 8080
      }
    }

    vault {
      policies = ["pol-messaging-development-ldap"]
    }

    task "kafka011-consumer-monitor" {
      driver = "docker"

      config {
        image = var.app_image_id
        ports = ["http"]
      }

      resources {
        cpu    = var.cpu
        memory = var.memory
      }

      restart {
        attempts = 3
        interval = "10m"
        delay    = "30s"
        mode     = "fail"
      }

      service {
        name     = "${var.namespace}-${var.app_name}"
        port     = "http"
        provider = "consul"
        tags = [
          "http",
          "haproxy.router.name=data-and-spectrum",
          "haproxy.service.name=${var.app_name}",
          "haproxy.service.domain=data-platforms"
        ]

        meta  {
          team = "Messaging Development"
        }

        check {
          name      = "Check version"
          type      = "http"
          port      = "http"
          path      = "/${var.app_name}/monitor/version"
          interval  = "15s"
          timeout   = "2s"
        }
      }
    }
  }
}
@sijo13
Copy link
Author

sijo13 commented Jan 28, 2025

I was able to reproduce the problem as it is explained below.

Nomad version: v1.9.3
Docker version: Docker version 26.1.2, build 211e74b

Deployed below job, which uses an image with 7+GB size.

job "nginx" {
  namespace = "platforms"
  node_pool = "platforms"
  group "nginx-group1" {
    count = 1
    network {
      port "http" {
        to = "80"
      }
    }
    task "nginx-task" {
     driver = "docker"
     config {
        image = "download-artifacts.inaut.io/ig-docker-dtu/pytorch/pytorch"
        ports = ["http"]

    }

      service {
        name     = "platforms-nginx-service"
        port     = "http"
        provider = "consul"

        check {
          type     = "http"
          port     = "http"
          path     = "/"
          interval = "2s"
          timeout  = "2s"
        }
      }
    }
  }
}

deployed the job and, after some time, updated the /etc/hosts of the server to simulate the failure. So basically pointed the artifactory DNS to the wrong IP address.

Allocation status is stuck in pending state . Please refer below Screenshots and logs

Image

Nomad logs:

k":"nginx-task","timestamp":"2025-01-28T06:31:55.708Z"}
{"@level":"info","@message":"plugin process exited","@module":"client.alloc_runner.task_runner.task_hook.logmon","@timestamp":"2025-01-28T06:31:55.711982Z","alloc_id":"e44e8109-7ad6-5918-ae0d-21204a3d15c6","id":"1088651","plugin":"/usr/bin/nomad","task":"nginx-task"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-28T06:32:41.859024Z","alloc_id":"a3c54419-959a-5c5c-55bb-fb69c2ff6c06","failed":false,"msg":"Task received by client","task":"nginx-task","type":"Received"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-28T06:32:41.864046Z","alloc_id":"a3c54419-959a-5c5c-55bb-fb69c2ff6c06","failed":false,"msg":"Building Task Directory","task":"nginx-task","type":"Task Setup"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-28T06:32:41.920077Z","alloc_id":"a3c54419-959a-5c5c-55bb-fb69c2ff6c06","failed":false,"msg":"Downloading image","task":"nginx-task","type":"Driver"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-28T06:34:41.920418Z","alloc_id":"a3c54419-959a-5c5c-55bb-fb69c2ff6c06","failed":false,"msg":"Docker image pull progress: Pulled 4/5 (35.94MiB/35.94MiB) layers: 0 waiting/1 pulling","task":"nginx-task","type":"Driver"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-28T06:36:41.920647Z","alloc_id":"a3c54419-959a-5c5c-55bb-fb69c2ff6c06","failed":false,"msg":"Docker image pull progress: Pulled 4/5 (35.94MiB/35.94MiB) layers: 0 waiting/1 pulling","task":"nginx-task","type":"Driver"}
{"@level":"error","@message":"error reading image pull progress","@module":"client.driver_mgr.docker","@timestamp":"2025-01-28T06:36:48.327447Z","driver":"docker","error":"net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-28T06:37:41.863186Z","alloc_id":"a3c54419-959a-5c5c-55bb-fb69c2ff6c06","failed":false,"msg":"Task not running by healthy_deadline of 5m0s","task":"nginx-task","type":"Alloc Unhealthy"}

Docker logs:

Jan 28 06:33:58 vrtstnia354.iggroup.local dockerd[1401]: time="2025-01-28T06:33:58.311041092Z" level=info msg="Download failed, retrying (1/5): net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Jan 28 06:34:33 vrtstnia354.iggroup.local dockerd[1401]: time="2025-01-28T06:34:33.315329727Z" level=info msg="Download failed, retrying (2/5): net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Jan 28 06:35:13 vrtstnia354.iggroup.local dockerd[1401]: time="2025-01-28T06:35:13.317304789Z" level=info msg="Download failed, retrying (3/5): net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Jan 28 06:35:58 vrtstnia354.iggroup.local dockerd[1401]: time="2025-01-28T06:35:58.319391213Z" level=info msg="Download failed, retrying (4/5): net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Jan 28 06:36:48 vrtstnia354.iggroup.local dockerd[1401]: time="2025-01-28T06:36:48.323672082Z" level=error msg="Download failed after 5 attempts: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Jan 28 06:36:48 vrtstnia354.iggroup.local dockerd[1401]: time="2025-01-28T06:36:48.324387492Z" level=info msg="Attempting next endpoint for pull after error: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" spanID=622e5e13d20b6c54 traceID=cf3af4f17a83cf41af27e58cead79893
Jan 28 06:36:48 vrtstnia354.iggroup.local dockerd[1401]: time="2025-01-28T06:36:48.367041194Z" level=info msg="Layer sha256:87d9bdf07ad1a0012391974d139b3a7b709cd16dfaeca768631120e0cc408609 cleaned up"
Jan 28 06:36:48 vrtstnia354.iggroup.local dockerd[1401]: time="2025-01-28T06:36:48.436061532Z" level=info msg="Layer sha256:d101c9453715a978a2a520f553588e77dfb4236762175eba61c5c264a449c75d cleaned up"

@gulducat
Copy link
Member

gulducat commented Jan 28, 2025

Heya @sijo13, thanks for another report!

I am able to reproduce this one by inducing an error in the image pull code. We did a big refactor of the docker driver in Nomad 1.9.0, and looks like this error mode slipped through the cracks, specifically: "error reading image pull progress". We're catching and logging the error, but then not also reporting it as a recoverable error as we should.

I think we can patch this one up to go out in Nomad 1.9.6. Thanks for letting us know!

@gulducat gulducat self-assigned this Jan 28, 2025
@gulducat gulducat added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/docker labels Jan 28, 2025
@gulducat gulducat moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jan 28, 2025
@gulducat gulducat changed the title Nomad allocation stuck in pending state Nomad allocation stuck in pending state after docker "error reading image pull progress" Jan 28, 2025
@gulducat gulducat linked a pull request Jan 30, 2025 that will close this issue
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/docker type/bug
Projects
Development

Successfully merging a pull request may close this issue.

2 participants