Skip to content
This repository has been archived by the owner on Sep 26, 2022. It is now read-only.

Loads-broker stalls with message: No instances running, collection done #44

Open
rpappalax opened this issue Jun 2, 2016 · 24 comments
Open

Comments

@rpappalax
Copy link

rpappalax commented Jun 2, 2016

When starting a new test run with loads-broker, oftentimes the attack cluster spins up, but then loads-broker returns the message:

No instances running, collection done.

as if it can no longer see the attack cluster.

This also occasionally happens in the middle of a run.

@rpappalax rpappalax added the bug label Jun 2, 2016
@rpappalax
Copy link
Author

File "/Users/rpappalardo/git/loads-broker4/lib/python3.5/site-packages/docker_py-1.6.0-py3.5.egg/docker/client.py", line 110, in _get
return self.get(url, *_self._set_request_timeout(kwargs))
File "/Users/rpappalardo/git/loads-broker4/lib/python3.5/site-packages/requests-2.10.0-py3.5.egg/requests/sessions.py", line 487, in get
return self.request('GET', url, *_kwargs)
File "/Users/rpappalardo/git/loads-broker4/lib/python3.5/site-packages/requests-2.10.0-py3.5.egg/requests/sessions.py", line 475, in request
resp = self.send(prep, *_send_kwargs)
File "/Users/rpappalardo/git/loads-broker4/lib/python3.5/site-packages/requests-2.10.0-py3.5.egg/requests/sessions.py", line 585, in send
r = adapter.send(request, *_kwargs)
File "/Users/rpappalardo/git/loads-broker4/lib/python3.5/site-packages/requests-2.10.0-py3.5.egg/requests/adapters.py", line 479, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='54.152.105.72', port=2375): Read timed out. (read timeout=5)
[2016-06-02 14:32:12,010][6035] No instances running, collection done.
[2016-06-02 14:32:19,210][6035] Returning collections
[2016-06-02 14:32:23,902][6035] Got response of: True

@rpappalax
Copy link
Author

rpappalax commented Jun 21, 2016

This log message usually appears after 10 mins of waiting for the stack to spin up.

@rpappalax
Copy link
Author

Since @pjenvey fixed ap-loadtester, I don't see this issue as much anymore. Seems like this usually happens now when there are too many attack nodes attached to 1 broker. I assume all the E.T. phone home activity probably causes a timeout so the broker thinks there are no instances running. Would be useful if the default timeout could be set higher, parameterized and documented.

@rpappalax
Copy link
Author

rpappalax commented Oct 3, 2016

happens approx. 1/3 of the time?

@rpappalax rpappalax modified the milestone: p1 Oct 3, 2016
@rpappalax rpappalax added p1 and removed p1 labels Oct 3, 2016
@rpappalax
Copy link
Author




[2016-09-29 21:14:56,149][6525] Creating directory /home/core/heka...
{
[2016-09-29 21:14:56,150][6525] Creating directory /home/core/heka...
[2016-10-04 00:28:40,965][15509] Pulling CoreOS AMI info...
[2016-10-04 00:28:50,359][15509] Verifying database setup.
[2016-10-04 00:28:50,421][15509] Project ID: 65e9cdd9-841d-4f0c-81b0-3246f2d2ce09
[2016-10-04 00:28:50,433][15509] Added plan: Loadtest, UUID: 5db25162-b9e1-44e7-b434-c1596de30100
[2016-09-29 21:14:56,202][6525] Creating directory /home/core/heka...
[2016-09-29 21:14:56,202][6525] Creating directory /home/core/heka...
[2016-09-29 21:14:56,418][6525] Creating directory /home/core/heka...
[2016-09-29 21:14:56,427][6525] Launching Heka...
[2016-09-29 21:14:57,464][6525] Starting step: 04a103e2-e6f3-4d96-9831-ae1b4d55bee2
^C^Cubuntu@ip-172-31-11-105:~/git/loads-broker$ ^C
ubuntu@ip-172-31-11-105:~/git/loads-broker$ vim loads.json
ubuntu@ip-172-31-11-105:~/git/loads-broker$ ./run-broker.sh
[2016-10-04 00:28:40,965][15509] Pulling CoreOS AMI info...
[2016-10-04 00:28:50,359][15509] Verifying database setup.
[2016-10-04 00:28:50,421][15509] Project ID: 65e9cdd9-841d-4f0c-81b0-3246f2d2ce09
[2016-10-04 00:28:50,433][15509] Added plan: Loadtest, UUID: 5db25162-b9e1-44e7-b434-c1596de30100
[2016-10-04 00:28:50,433][15509] Finished database setup.
[2016-10-04 00:28:50,433][15509] Listening on port 8080...
[2016-10-04 00:28:51,576][15509] Found 1 instances to recover.
[2016-10-04 00:28:51,576][15509] Finished initializing: None.
[2016-10-04 00:30:56,591][15509] Msg: Running strategy: 5db25162-b9e1-44e7-b434-c1596de30100, ThreadID: 140541631178560
[2016-10-04 00:30:56,608][15509] Msg: Committed new session., ThreadID: 140541631178560
[2016-10-04 00:30:56,609][15509] Getting steps
[2016-10-04 00:30:57,540][15509] Allocated instances: [Instance:i-3723e021]
[2016-10-04 00:30:57,690][15509] Waiting for running instances.
[2016-10-04 00:30:57,690][15509] Check pending
[2016-10-04 00:31:13,302][15509] Removing 0 dead instances that wouldn't run.
[2016-10-04 00:31:13,303][15509] Waiting for docker
[2016-10-04 00:31:18,312][15509] Got exception: HTTPConnectionPool(host='54.226.12.195', port=2375): Max retries exceeded with url: /v1.21/containers/json?limit=-1&size=0&trunc_cmd=0&all=0 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd25c1b05f8>, 'Connection to 54.226.12.195 timed out. (connect timeout=5)'))
[2016-10-04 00:31:28,325][15509] Got exception: HTTPConnectionPool(host='54.226.12.195', port=2375): Max retries exceeded with url: /v1.21/containers/json?limit=-1&size=0&trunc_cmd=0&all=0 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd25c1aabe0>, 'Connection to 54.226.12.195 timed out. (connect timeout=5)'))
[2016-10-04 00:31:33,333][15509] Got exception: HTTPConnectionPool(host='54.226.12.195', port=2375): Max retries exceeded with url: /v1.21/containers/json?limit=-1&size=0&trunc_cmd=0&all=0 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd25c1a7470>: Failed to establish a new connection: [Errno 111] Connection refused',))
[2016-10-04 00:31:43,346][15509] Got exception: HTTPConnectionPool(host='54.226.12.195', port=2375): Max retries exceeded with url: /v1.21/containers/json?limit=-1&size=0&trunc_cmd=0&all=0 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd25c1bb0f0>, 'Connection to 54.226.12.195 timed out. (connect timeout=5)'))
[2016-10-04 00:31:53,358][15509] Got exception: HTTPConnectionPool(host='54.226.12.195', port=2375): Max retries exceeded with url: /v1.21/containers/json?limit=-1&size=0&trunc_cmd=0&all=0 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd25c1bb6d8>, 'Connection to 54.226.12.195 timed out. (connect timeout=5)'))
[2016-10-04 00:32:03,372][15509] Got exception: HTTPConnectionPool(host='54.226.12.195', port=2375): Max retries exceeded with url: /v1.21/containers/json?limit=-1&size=0&trunc_cmd=0&all=0 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd25c1bb390>, 'Connection to 54.226.12.195 timed out. (connect timeout=5)'))
[2016-10-04 00:32:13,385][15509] Got exception: HTTPConnectionPool(host='54.226.12.195', port=2375): Max retries exceeded with url: /v1.21/containers/json?limit=-1&size=0&trunc_cmd=0&all=0 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd25c1bb1d0>, 'Connection to 54.226.12.195 timed out. (connect timeout=5)'))
[2016-10-04 00:32:23,398][15509] Got exception: HTTPConnectionPool(host='54.226.12.195', port=2375): Max retries exceeded with url: /v1.21/containers/json?limit=-1&size=0&trunc_cmd=0&all=0 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd25c1aabe0>, 'Connection to 54.226.12.195 timed out. (connect timeout=5)'))
[2016-10-04 00:32:33,411][15509] Got exception: HTTPConnectionPool(host='54.226.12.195', port=2375): Read timed out. (read timeout=5)
[2016-10-04 00:32:38,439][15509] Pruning 0 non-responding instances.
[2016-10-04 00:32:38,439][15509] Pulling base containers: heka
[2016-10-04 00:32:38,439][15509] Pulling base container images
[2016-10-04 00:35:14,855][15509] Pulling containers for this step.
[2016-10-04 00:35:14,855][15509] Pulling step images
[2016-10-04 00:36:51,448][15509] Msg: Now running., ThreadID: 140541631178560
[2016-10-04 00:36:51,968][15509] Creating directory /home/core/heka...
[2016-10-04 00:36:51,972][15509] Launching Heka...
[2016-10-04 00:36:53,232][15509] Starting step: 0bfc4023-ca94-4d8e-a478-a7a95fa1c0bf
[2016-10-04 00:36:59,954][15509] No instances running, collection done.
[2016-10-04 00:37:01,756][15509] Returning collections
[2016-10-04 00:37:01,862][15509] Got response of: True

@rpappalax
Copy link
Author

rpappalax commented Oct 4, 2016

@pjenvey OK, i ran a long-running test last night and after a few hours I got the same error below (normally I only see this when trying to initiate a new attack cluster, but not in the middle of a test).

NOTE:
where it says: "starting step...." is usually when all the attack nodes are fully initialized and the attack begins. Seeing: "No instances running..." much later appears to me that loads-broker at some point can no longer see the attack cluster and is giving up on it.

[2016-10-04 00:52:53,394][16566] Pulling base containers: heka
[2016-10-04 00:52:53,394][16566] Pulling base container images
[2016-10-04 00:53:13,651][16566] Pulling containers for this step.
[2016-10-04 00:53:13,651][16566] Pulling step images
[2016-10-04 00:53:14,004][16566] Msg: Now running., ThreadID: 140450120337216
[2016-10-04 00:53:14,337][16566] Launching Heka...
[2016-10-04 00:53:14,886][16566] Starting step: 9658688f-aef2-4de6-a0fd-4f1f2d6e3f0f
[2016-10-04 00:53:20,020][16566] No instances running, collection done.
[2016-10-04 00:53:21,855][16566] Returning collections
[2016-10-04 00:53:22,461][16566] Got response of: True

@pjenvey
Copy link
Member

pjenvey commented Oct 4, 2016

Here's a full log of the previous error happening after hours of running:

no_instances_fail_after_hours.txt

@rpappalax rpappalax changed the title Exception with run_container: HTTPConnectionPool: Read timed out. Loads-broker stalls with message: No instances running, collection done Oct 4, 2016
@pjenvey pjenvey self-assigned this Oct 17, 2016
pjenvey added a commit that referenced this issue Oct 20, 2016
dumps a verbose docker ps like output to help debug
intermittent/unexpected "No instances running"

issue #44
@rpappalax
Copy link
Author

@pjenvey running loads-broker w/ latest PRs (including updated debug logging).
hopefully this is useful:
https://gist.github.com/rpappalax/9aee79e6bbdae6aa6f1b46306d7e3def#file-loads-broker-log-2016-10-24-L655-L684

@rpappalax
Copy link
Author

rpappalax commented Oct 24, 2016

TEST RUN 2:
shows new verbose debugging on collections returned:
https://gist.github.com/rpappalax/466e2c0fdd72f7b857e7e823e4ffb261

looks like the run generated some traffic for a bit, though i hopped on a few attack nodes and saw no docker activity.
NOTE: may have been because my connection count was too low and activity stopped.

core@ip-172-31-22-23 ~ $ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
core@ip-172-31-22-23 ~ $ docker images
REPOSITORY             TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
loadswatch             latest              f1b0468c37ee        4 days ago          731.7 MB
<none>                 <none>              767584930cea        12 weeks ago        169.4 MB
kitcambridge/heka      0.8.1               9587b048f324        22 months ago       326.2 MB
kitcambridge/dnsmasq   latest              13638311c471        24 months ago       97.12 MB

push_test_1

@rpappalax
Copy link
Author

rpappalax commented Oct 24, 2016

TEST RUNS 3,4,5:
looks like we stall on these runs with a tornado.acces:404 before we get out of the gate.
(attack cluster never launches)
https://gist.github.com/rpappalax/310dedd415c04ba19e2dbc3b1397f2e2

@rpappalax
Copy link
Author

rpappalax commented Oct 24, 2016

@tarekziade @pjenvey
in loads.json, I'm specifying the region for my attack cluster in us-east-1. Actually, when the attack cluster successfully launches, I do see the correct number of attack nodes I specify in us-east-1 only.

Is there a reason why the loads-broker log says: "working"/"populate" in 3 other regions(us-west-1, us-west-2, eu-west-1) as well?

[2016-10-24 23:06:12,385][22821] Working in eu-west-1
[2016-10-24 23:06:12,388][22821] Working in us-east-1
[2016-10-24 23:06:12,388][22821] Working in us-west-1
[2016-10-24 23:06:12,388][22821] Working in us-west-2
[2016-10-24 23:06:13,802][22821] us-east-1 populated
[2016-10-24 23:06:13,862][22821] us-west-1 populated
[2016-10-24 23:06:14,103][22821] us-west-2 populated
[2016-10-24 23:06:14,598][22821] eu-west-1 populated

@pjenvey
Copy link
Member

pjenvey commented Oct 25, 2016

It's populating AMI information for all regions. It's actually always done this, it's just more verbose about it now.

AFAIK it's pointless to load it for the other regions in this case, but #55 will improve how that's being done anyway

@pjenvey
Copy link
Member

pjenvey commented Oct 25, 2016

I'm wondering if the issue w/ test run #3 is due to the attempt to recover 10 instances?

@rpappalax
Copy link
Author

it's possible. i have made a habit of killing all the instances manually on each run. perhaps the broker assumes those resources would never be terminated that way (or always left behind for re-use)?

@tarekziade
Copy link
Contributor

@rpappalax says: it's possible. i have made a habit of killing all the instances manually on each run

by the way, you can do this with loads now, and have their status updated in the broker.

$ loads terminate_all

@rpappalax
Copy link
Author

@tarekziade $ loads terminate_all works like a charm

@rpappalax
Copy link
Author

rpappalax commented Oct 27, 2016

@pjenvey is the new debugging output helpful in debugging this? let me know if i can do some further runs here

@rpappalax
Copy link
Author

here's what a working run looks like now with the new changes:
https://gist.github.com/rpappalax/2d798271bd9c09ce1c1056105e28b4bb

@rpappalax
Copy link
Author

@pjenvey: OK, interesting. when @chartjes was missing a container on dockerhub (for a kinto loadtest), loads-broker throws same error. Could it be that occasionally dockerhub may be slow to respond and loads-broker times out?

@rpappalax
Copy link
Author

we made sure the docker image is now in place in dockerhub and even tried pulling down a fresh image and running it locally (works OK). however, when we run it with loads-broker, we get this returning collections error.
kinto-loadtest-with-broker.txt

@pjenvey
Copy link
Member

pjenvey commented Nov 1, 2016

What I can tell from the extra debug output is that ailoads died immediately and returned an error code 127.

If you can get into these instances, you can run docker logs on the container and hopefully see some output.

Possibly the ailoads command isn't configured correctly? It attempted to run:

/bin/sh -c 'venv/bin/ailoads -v -d $KINTO_DURATION -u $KINTO_NB_USERS'

@rpappalax
Copy link
Author

@chartjes those vars need to match what's being passed in from loads.json. I believe we changed them to: $TEST_DURATION & $CONNECTIONS and perhaps also in the Kinto Dockerfile. It could be that the image didn't get properly updated?

@rpappalax
Copy link
Author

@pjenvey Installed the latest PR and able to repro. log here:
https://gist.github.com/rpappalax/cb9225ba57902fb2faaf9c7856e7dd5c

@rpappalax
Copy link
Author

@penjenvy: was able to throw error when trying to run a large connection test:
https://s3.amazonaws.com/fx-test-loadtest/run-broker-20161122013208.out

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants