Loads-broker stalls with message: No instances running, collection done #44

rpappalax · 2016-06-02T22:08:00Z

When starting a new test run with loads-broker, oftentimes the attack cluster spins up, but then loads-broker returns the message:

No instances running, collection done.

as if it can no longer see the attack cluster.

This also occasionally happens in the middle of a run.

The text was updated successfully, but these errors were encountered:

rpappalax · 2016-06-02T22:12:05Z

File "/Users/rpappalardo/git/loads-broker4/lib/python3.5/site-packages/docker_py-1.6.0-py3.5.egg/docker/client.py", line 110, in _get
return self.get(url, *_self._set_request_timeout(kwargs))
File "/Users/rpappalardo/git/loads-broker4/lib/python3.5/site-packages/requests-2.10.0-py3.5.egg/requests/sessions.py", line 487, in get
return self.request('GET', url, *_kwargs)
File "/Users/rpappalardo/git/loads-broker4/lib/python3.5/site-packages/requests-2.10.0-py3.5.egg/requests/sessions.py", line 475, in request
resp = self.send(prep, *_send_kwargs)
File "/Users/rpappalardo/git/loads-broker4/lib/python3.5/site-packages/requests-2.10.0-py3.5.egg/requests/sessions.py", line 585, in send
r = adapter.send(request, *_kwargs)
File "/Users/rpappalardo/git/loads-broker4/lib/python3.5/site-packages/requests-2.10.0-py3.5.egg/requests/adapters.py", line 479, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='54.152.105.72', port=2375): Read timed out. (read timeout=5)
[2016-06-02 14:32:12,010][6035] No instances running, collection done.
[2016-06-02 14:32:19,210][6035] Returning collections
[2016-06-02 14:32:23,902][6035] Got response of: True

rpappalax · 2016-06-21T21:45:02Z

This log message usually appears after 10 mins of waiting for the stack to spin up.

rpappalax · 2016-09-26T17:53:30Z

Since @pjenvey fixed ap-loadtester, I don't see this issue as much anymore. Seems like this usually happens now when there are too many attack nodes attached to 1 broker. I assume all the E.T. phone home activity probably causes a timeout so the broker thinks there are no instances running. Would be useful if the default timeout could be set higher, parameterized and documented.

rpappalax · 2016-10-03T23:42:37Z

happens approx. 1/3 of the time?

rpappalax · 2016-10-04T00:44:37Z




[2016-09-29 21:14:56,149][6525] Creating directory /home/core/heka...
{
[2016-09-29 21:14:56,150][6525] Creating directory /home/core/heka...
[2016-10-04 00:28:40,965][15509] Pulling CoreOS AMI info...
[2016-10-04 00:28:50,359][15509] Verifying database setup.
[2016-10-04 00:28:50,421][15509] Project ID: 65e9cdd9-841d-4f0c-81b0-3246f2d2ce09
[2016-10-04 00:28:50,433][15509] Added plan: Loadtest, UUID: 5db25162-b9e1-44e7-b434-c1596de30100
[2016-09-29 21:14:56,202][6525] Creating directory /home/core/heka...
[2016-09-29 21:14:56,202][6525] Creating directory /home/core/heka...
[2016-09-29 21:14:56,418][6525] Creating directory /home/core/heka...
[2016-09-29 21:14:56,427][6525] Launching Heka...
[2016-09-29 21:14:57,464][6525] Starting step: 04a103e2-e6f3-4d96-9831-ae1b4d55bee2
^C^Cubuntu@ip-172-31-11-105:~/git/loads-broker$ ^C
ubuntu@ip-172-31-11-105:~/git/loads-broker$ vim loads.json
ubuntu@ip-172-31-11-105:~/git/loads-broker$ ./run-broker.sh
[2016-10-04 00:28:40,965][15509] Pulling CoreOS AMI info...
[2016-10-04 00:28:50,359][15509] Verifying database setup.
[2016-10-04 00:28:50,421][15509] Project ID: 65e9cdd9-841d-4f0c-81b0-3246f2d2ce09
[2016-10-04 00:28:50,433][15509] Added plan: Loadtest, UUID: 5db25162-b9e1-44e7-b434-c1596de30100
[2016-10-04 00:28:50,433][15509] Finished database setup.
[2016-10-04 00:28:50,433][15509] Listening on port 8080...
[2016-10-04 00:28:51,576][15509] Found 1 instances to recover.
[2016-10-04 00:28:51,576][15509] Finished initializing: None.
[2016-10-04 00:30:56,591][15509] Msg: Running strategy: 5db25162-b9e1-44e7-b434-c1596de30100, ThreadID: 140541631178560
[2016-10-04 00:30:56,608][15509] Msg: Committed new session., ThreadID: 140541631178560
[2016-10-04 00:30:56,609][15509] Getting steps
[2016-10-04 00:30:57,540][15509] Allocated instances: [Instance:i-3723e021]
[2016-10-04 00:30:57,690][15509] Waiting for running instances.
[2016-10-04 00:30:57,690][15509] Check pending
[2016-10-04 00:31:13,302][15509] Removing 0 dead instances that wouldn't run.
[2016-10-04 00:31:13,303][15509] Waiting for docker
[2016-10-04 00:31:18,312][15509] Got exception: HTTPConnectionPool(host='54.226.12.195', port=2375): Max retries exceeded with url: /v1.21/containers/json?limit=-1&size=0&trunc_cmd=0&all=0 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd25c1b05f8>, 'Connection to 54.226.12.195 timed out. (connect timeout=5)'))
[2016-10-04 00:31:28,325][15509] Got exception: HTTPConnectionPool(host='54.226.12.195', port=2375): Max retries exceeded with url: /v1.21/containers/json?limit=-1&size=0&trunc_cmd=0&all=0 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd25c1aabe0>, 'Connection to 54.226.12.195 timed out. (connect timeout=5)'))
[2016-10-04 00:31:33,333][15509] Got exception: HTTPConnectionPool(host='54.226.12.195', port=2375): Max retries exceeded with url: /v1.21/containers/json?limit=-1&size=0&trunc_cmd=0&all=0 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd25c1a7470>: Failed to establish a new connection: [Errno 111] Connection refused',))
[2016-10-04 00:31:43,346][15509] Got exception: HTTPConnectionPool(host='54.226.12.195', port=2375): Max retries exceeded with url: /v1.21/containers/json?limit=-1&size=0&trunc_cmd=0&all=0 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd25c1bb0f0>, 'Connection to 54.226.12.195 timed out. (connect timeout=5)'))
[2016-10-04 00:31:53,358][15509] Got exception: HTTPConnectionPool(host='54.226.12.195', port=2375): Max retries exceeded with url: /v1.21/containers/json?limit=-1&size=0&trunc_cmd=0&all=0 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd25c1bb6d8>, 'Connection to 54.226.12.195 timed out. (connect timeout=5)'))
[2016-10-04 00:32:03,372][15509] Got exception: HTTPConnectionPool(host='54.226.12.195', port=2375): Max retries exceeded with url: /v1.21/containers/json?limit=-1&size=0&trunc_cmd=0&all=0 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd25c1bb390>, 'Connection to 54.226.12.195 timed out. (connect timeout=5)'))
[2016-10-04 00:32:13,385][15509] Got exception: HTTPConnectionPool(host='54.226.12.195', port=2375): Max retries exceeded with url: /v1.21/containers/json?limit=-1&size=0&trunc_cmd=0&all=0 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd25c1bb1d0>, 'Connection to 54.226.12.195 timed out. (connect timeout=5)'))
[2016-10-04 00:32:23,398][15509] Got exception: HTTPConnectionPool(host='54.226.12.195', port=2375): Max retries exceeded with url: /v1.21/containers/json?limit=-1&size=0&trunc_cmd=0&all=0 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd25c1aabe0>, 'Connection to 54.226.12.195 timed out. (connect timeout=5)'))
[2016-10-04 00:32:33,411][15509] Got exception: HTTPConnectionPool(host='54.226.12.195', port=2375): Read timed out. (read timeout=5)
[2016-10-04 00:32:38,439][15509] Pruning 0 non-responding instances.
[2016-10-04 00:32:38,439][15509] Pulling base containers: heka
[2016-10-04 00:32:38,439][15509] Pulling base container images
[2016-10-04 00:35:14,855][15509] Pulling containers for this step.
[2016-10-04 00:35:14,855][15509] Pulling step images
[2016-10-04 00:36:51,448][15509] Msg: Now running., ThreadID: 140541631178560
[2016-10-04 00:36:51,968][15509] Creating directory /home/core/heka...
[2016-10-04 00:36:51,972][15509] Launching Heka...
[2016-10-04 00:36:53,232][15509] Starting step: 0bfc4023-ca94-4d8e-a478-a7a95fa1c0bf
[2016-10-04 00:36:59,954][15509] No instances running, collection done.
[2016-10-04 00:37:01,756][15509] Returning collections
[2016-10-04 00:37:01,862][15509] Got response of: True

rpappalax · 2016-10-04T15:49:00Z

@pjenvey OK, i ran a long-running test last night and after a few hours I got the same error below (normally I only see this when trying to initiate a new attack cluster, but not in the middle of a test).

NOTE:
where it says: "starting step...." is usually when all the attack nodes are fully initialized and the attack begins. Seeing: "No instances running..." much later appears to me that loads-broker at some point can no longer see the attack cluster and is giving up on it.

[2016-10-04 00:52:53,394][16566] Pulling base containers: heka
[2016-10-04 00:52:53,394][16566] Pulling base container images
[2016-10-04 00:53:13,651][16566] Pulling containers for this step.
[2016-10-04 00:53:13,651][16566] Pulling step images
[2016-10-04 00:53:14,004][16566] Msg: Now running., ThreadID: 140450120337216
[2016-10-04 00:53:14,337][16566] Launching Heka...
[2016-10-04 00:53:14,886][16566] Starting step: 9658688f-aef2-4de6-a0fd-4f1f2d6e3f0f
[2016-10-04 00:53:20,020][16566] No instances running, collection done.
[2016-10-04 00:53:21,855][16566] Returning collections
[2016-10-04 00:53:22,461][16566] Got response of: True

pjenvey · 2016-10-04T23:22:00Z

Here's a full log of the previous error happening after hours of running:

no_instances_fail_after_hours.txt

dumps a verbose docker ps like output to help debug intermittent/unexpected "No instances running" issue #44

rpappalax · 2016-10-24T18:20:00Z

@pjenvey running loads-broker w/ latest PRs (including updated debug logging).
hopefully this is useful:
https://gist.github.com/rpappalax/9aee79e6bbdae6aa6f1b46306d7e3def#file-loads-broker-log-2016-10-24-L655-L684

rpappalax · 2016-10-24T22:38:45Z

TEST RUN 2:
shows new verbose debugging on collections returned:
https://gist.github.com/rpappalax/466e2c0fdd72f7b857e7e823e4ffb261

looks like the run generated some traffic for a bit, though i hopped on a few attack nodes and saw no docker activity.
NOTE: may have been because my connection count was too low and activity stopped.

core@ip-172-31-22-23 ~ $ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
core@ip-172-31-22-23 ~ $ docker images
REPOSITORY             TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
loadswatch             latest              f1b0468c37ee        4 days ago          731.7 MB
<none>                 <none>              767584930cea        12 weeks ago        169.4 MB
kitcambridge/heka      0.8.1               9587b048f324        22 months ago       326.2 MB
kitcambridge/dnsmasq   latest              13638311c471        24 months ago       97.12 MB

rpappalax · 2016-10-24T23:04:28Z

TEST RUNS 3,4,5:
looks like we stall on these runs with a tornado.acces:404 before we get out of the gate.
(attack cluster never launches)
https://gist.github.com/rpappalax/310dedd415c04ba19e2dbc3b1397f2e2

rpappalax · 2016-10-24T23:11:23Z

@tarekziade @pjenvey
in loads.json, I'm specifying the region for my attack cluster in us-east-1. Actually, when the attack cluster successfully launches, I do see the correct number of attack nodes I specify in us-east-1 only.

Is there a reason why the loads-broker log says: "working"/"populate" in 3 other regions(us-west-1, us-west-2, eu-west-1) as well?

[2016-10-24 23:06:12,385][22821] Working in eu-west-1
[2016-10-24 23:06:12,388][22821] Working in us-east-1
[2016-10-24 23:06:12,388][22821] Working in us-west-1
[2016-10-24 23:06:12,388][22821] Working in us-west-2
[2016-10-24 23:06:13,802][22821] us-east-1 populated
[2016-10-24 23:06:13,862][22821] us-west-1 populated
[2016-10-24 23:06:14,103][22821] us-west-2 populated
[2016-10-24 23:06:14,598][22821] eu-west-1 populated

pjenvey · 2016-10-25T00:13:38Z

It's populating AMI information for all regions. It's actually always done this, it's just more verbose about it now.

AFAIK it's pointless to load it for the other regions in this case, but #55 will improve how that's being done anyway

pjenvey · 2016-10-25T00:17:45Z

I'm wondering if the issue w/ test run #3 is due to the attempt to recover 10 instances?

rpappalax · 2016-10-25T19:48:45Z

it's possible. i have made a habit of killing all the instances manually on each run. perhaps the broker assumes those resources would never be terminated that way (or always left behind for re-use)?

tarekziade · 2016-10-26T11:35:44Z

@rpappalax says: it's possible. i have made a habit of killing all the instances manually on each run

by the way, you can do this with loads now, and have their status updated in the broker.

$ loads terminate_all

rpappalax · 2016-10-26T23:03:57Z

@tarekziade $ loads terminate_all works like a charm

rpappalax · 2016-10-27T18:07:02Z

@pjenvey is the new debugging output helpful in debugging this? let me know if i can do some further runs here

rpappalax · 2016-10-27T19:59:16Z

here's what a working run looks like now with the new changes:
https://gist.github.com/rpappalax/2d798271bd9c09ce1c1056105e28b4bb

rpappalax · 2016-11-01T20:52:26Z

@pjenvey: OK, interesting. when @chartjes was missing a container on dockerhub (for a kinto loadtest), loads-broker throws same error. Could it be that occasionally dockerhub may be slow to respond and loads-broker times out?

rpappalax · 2016-11-01T21:02:23Z

we made sure the docker image is now in place in dockerhub and even tried pulling down a fresh image and running it locally (works OK). however, when we run it with loads-broker, we get this returning collections error.
kinto-loadtest-with-broker.txt

pjenvey · 2016-11-01T22:58:47Z

What I can tell from the extra debug output is that ailoads died immediately and returned an error code 127.

If you can get into these instances, you can run docker logs on the container and hopefully see some output.

Possibly the ailoads command isn't configured correctly? It attempted to run:

/bin/sh -c 'venv/bin/ailoads -v -d $KINTO_DURATION -u $KINTO_NB_USERS'

rpappalax · 2016-11-01T23:35:55Z

@chartjes those vars need to match what's being passed in from loads.json. I believe we changed them to: $TEST_DURATION & $CONNECTIONS and perhaps also in the Kinto Dockerfile. It could be that the image didn't get properly updated?

rpappalax · 2016-11-16T00:31:43Z

@pjenvey Installed the latest PR and able to repro. log here:
https://gist.github.com/rpappalax/cb9225ba57902fb2faaf9c7856e7dd5c

rpappalax · 2016-11-22T01:57:23Z

@penjenvy: was able to throw error when trying to run a large connection test:
https://s3.amazonaws.com/fx-test-loadtest/run-broker-20161122013208.out

rpappalax added the bug label Jun 2, 2016

rpappalax modified the milestone: p1 Oct 3, 2016

rpappalax added p1 and removed p1 labels Oct 3, 2016

rpappalax changed the title ~~Exception with run_container: HTTPConnectionPool: Read timed out.~~ Loads-broker stalls with message: No instances running, collection done Oct 4, 2016

rpappalax mentioned this issue Oct 10, 2016

Run instance reaper #21

Open

pjenvey self-assigned this Oct 17, 2016

pjenvey added a commit that referenced this issue Oct 20, 2016

fix: add temporary debugging output for is_done

dbbe691

dumps a verbose docker ps like output to help debug intermittent/unexpected "No instances running" issue #44

pjenvey mentioned this issue Oct 20, 2016

fix: add temporary debugging output for is_done #56

Merged

rpappalax mentioned this issue Oct 24, 2016

5.1 Improve the loads-broker for basic QA requirements mozilla-services/push-okr#14

Closed

pjenvey mentioned this issue Oct 25, 2016

gracefully handle more docker ConnectionErrors #58

Closed

pjenvey added the in progress label Nov 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loads-broker stalls with message: No instances running, collection done #44

Loads-broker stalls with message: No instances running, collection done #44

rpappalax commented Jun 2, 2016 •

edited

Loading

rpappalax commented Jun 2, 2016

rpappalax commented Jun 21, 2016 •

edited

Loading

rpappalax commented Sep 26, 2016

rpappalax commented Oct 3, 2016 •

edited

Loading

rpappalax commented Oct 4, 2016

rpappalax commented Oct 4, 2016 •

edited

Loading

pjenvey commented Oct 4, 2016

rpappalax commented Oct 24, 2016

rpappalax commented Oct 24, 2016 •

edited

Loading

rpappalax commented Oct 24, 2016 •

edited

Loading

rpappalax commented Oct 24, 2016 •

edited

Loading

pjenvey commented Oct 25, 2016 •

edited

Loading

pjenvey commented Oct 25, 2016

rpappalax commented Oct 25, 2016

tarekziade commented Oct 26, 2016

rpappalax commented Oct 26, 2016

rpappalax commented Oct 27, 2016 •

edited

Loading

rpappalax commented Oct 27, 2016

rpappalax commented Nov 1, 2016

rpappalax commented Nov 1, 2016

pjenvey commented Nov 1, 2016

rpappalax commented Nov 1, 2016

rpappalax commented Nov 16, 2016

rpappalax commented Nov 22, 2016

Loads-broker stalls with message: No instances running, collection done #44

Loads-broker stalls with message: No instances running, collection done #44

Comments

rpappalax commented Jun 2, 2016 • edited Loading

rpappalax commented Jun 2, 2016

rpappalax commented Jun 21, 2016 • edited Loading

rpappalax commented Sep 26, 2016

rpappalax commented Oct 3, 2016 • edited Loading

rpappalax commented Oct 4, 2016

rpappalax commented Oct 4, 2016 • edited Loading

pjenvey commented Oct 4, 2016

rpappalax commented Oct 24, 2016

rpappalax commented Oct 24, 2016 • edited Loading

rpappalax commented Oct 24, 2016 • edited Loading

rpappalax commented Oct 24, 2016 • edited Loading

pjenvey commented Oct 25, 2016 • edited Loading

pjenvey commented Oct 25, 2016

rpappalax commented Oct 25, 2016

tarekziade commented Oct 26, 2016

rpappalax commented Oct 26, 2016

rpappalax commented Oct 27, 2016 • edited Loading

rpappalax commented Oct 27, 2016

rpappalax commented Nov 1, 2016

rpappalax commented Nov 1, 2016

pjenvey commented Nov 1, 2016

rpappalax commented Nov 1, 2016

rpappalax commented Nov 16, 2016

rpappalax commented Nov 22, 2016

rpappalax commented Jun 2, 2016 •

edited

Loading

rpappalax commented Jun 21, 2016 •

edited

Loading

rpappalax commented Oct 3, 2016 •

edited

Loading

rpappalax commented Oct 4, 2016 •

edited

Loading

rpappalax commented Oct 24, 2016 •

edited

Loading

rpappalax commented Oct 24, 2016 •

edited

Loading

rpappalax commented Oct 24, 2016 •

edited

Loading

pjenvey commented Oct 25, 2016 •

edited

Loading

rpappalax commented Oct 27, 2016 •

edited

Loading