Run instance reaper #21

bbangert · 2015-01-29T19:51:19Z

AWSPool's reaper method currently doesn't properly terminate idle (> 1 hour) instances based on their time. It should determine if an instance hasn't been used for an hour, and terminate it if its idle too long.

The reap method should also be automatically run every minute or so to check, so it will need to be scheduled on startup into the event loop.

tarekziade · 2016-10-10T09:45:18Z

How do we want to determine that an instance has been idling for too long ? Sounds like we want to detect an instance that's not really producing any tests results anymore.

One option would be to parse the logs produced by the tests, to see if the box is still producing something. Another empirical option is to connect to that box and to see if somethings happening in it.

I am not 100% sure what's the best option. I think 2

@rpappalax do you know what's the usual behavior when a slave box gets stuck ?

tarekziade · 2016-10-10T14:53:45Z

I think I have a better idea: we could self-terminate the instance with a script that runs in it

see http://stackoverflow.com/questions/10541363/self-terminating-aws-ec2-instance

The script would periodically check if the docker container is busy, if not start a coutdown before terminating the instance.

rpappalax · 2016-10-10T16:53:25Z

@tarekziade Yea, I think this is a great idea. I wonder if we could also automatically spin up a fresh node if one is lost due to self-termination.

rpappalax · 2016-10-10T16:59:37Z

Also, in regards to the attack nodes idling to long, I'm not sure how / if these are related but I know that @pjenvey is also investigating:
[1]. Issue #44: Loads-broker stalls with message: No instances running, collection done
[2]. connection timeout w/ the following message:

[2016-10-04 00:24:09,993][19217] Got exception: HTTPConnectionPool(host='54.162.133.99', port=2375): Max retries exceeded with url: /v1.21/containers/json?trunc_cmd=0&size=0&all=0&limit=-1 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7f10e17b9e80>, 'Connection to 54.162.133.99 timed out. (connect timeout=5)'))

tarekziade · 2016-10-10T18:42:35Z

I wonder if we could also automatically spin up a fresh node if one is lost due to self-termination.

If the broker is doing correctly its job, it should do it once it detected the node was terminated. I don't know if this is happening live during a test though. I doubt it.

tarekziade · 2016-10-11T11:49:44Z

Here's the approach: https://github.com/loads/awskiller/blob/master/awskiller/killer.py

That could be deployed on the CoreOS alongside docker, and monitor the docker process.
If the docker process is idling for an hour, given a threshold, the script can trigger termination.

Another possibility is to have the script send to statsd the docker mean cpu usage/hour, that would let the broker get the info and decide wether t reap it or not.

I think this loose approach avoids any extra work between the nodes and the broken to reap idling instances.

bbangert · 2016-10-11T18:43:52Z

I really like this approach, it should probably check once every 5 mins, and if it goes for a whole hour without being used, then terminate itself.

The only problem I see is that it means the loads-broker will have to refresh node data more frequently from AWS to ensure it doesn't try to use a node that has removed itself from the pool.

tarekziade · 2016-10-11T19:04:39Z

the loads-broker will have to refresh node data more frequently from AWS

I guess we can look at the hot spots. In theory, checking on node health should be doable asynchronously in the event loop, and even with hundreds of nodes I suspect that should not be impossible to optimize.

I propose we try that.

bbangert · 2016-10-11T19:29:55Z

@pjenvey points out that its alive status can already be queried directly via docker, so we shouldn't worry about that.

tarekziade · 2016-10-12T13:40:32Z

I've finished a first version of "awskiller" a small command that will run a system command

https://github.com/loads/awskiller

$ .tox/py35/bin/awskiller --help
usage: awskiller [-h] [--pid PID] [--name NAME] [--killer KILLER]
                 [--threshold THRESHOLD] [--duration DURATION]
                 [--interval INTERVAL] [--verbose]

AWS Killer

optional arguments:
  -h, --help            show this help message and exit
  --pid PID             Process PID
  --name NAME           Process Name
  --killer KILLER       Command to run
  --threshold THRESHOLD
                        CPU Threshold
  --duration DURATION   Watch Duration
  --interval INTERVAL   Watch Interval
  --verbose             Display more info

$ .tox/py35/bin/awskiller --name firefox --verbose
Watching firefox 13%  ^C

Will now see how to hook it in CoreOS so it runs alongside docker and watch it

tarekziade · 2016-10-13T08:12:39Z

So. CoreOS does not have Python installed or a package manager for installing it, which means that I need to create a binary distribution for this awskiller project in order to make it a "unit" we can add into cloud-config.

That's going to be a huge binary dist just to ship a small 100-lines scripts. It's not worth it.

I will re-write it in Rust and create a single binary file we can use as a CoreOS unit in loads-broker.

See https://github.com/loads/awskiller/issues/1

bbangert · 2016-10-13T15:06:02Z

Right, the CoreOS package manager is docker. Everything is supposed to run in containers. I'd suggest having this run in a container as well, you can run containers in a special mode that gives them access to query the docker in the main host. That should let it determine whether its active or not.

tarekziade · 2016-10-14T14:03:46Z

You're absolutely right. After poking at the things I have a working prototype:

https://gist.github.com/tarekziade/649f9ea7bb514ec0cf88369c8c5c4c48

Will change the script accordingly

tarekziade · 2016-10-17T13:16:26Z

The docker image is ready and located at https://s3.amazonaws.com/loads-docker-images/loadswatch.tar.bz2

source code: https://github.com/loads/loadswatch

It will terminate the box if there's no containers runnings for one hour.
I will extend the broker so it uses it when running a CoreOS box.

https://github.com/loads/loads-broker/blob/master/loadsbroker/broker.py#L73

adding the watcher extension - related to #21

tarekziade · 2016-10-21T09:01:38Z

The Watcher is now running in each CoreOS node.

rpappalax · 2017-03-09T05:49:45Z

@tarekziade @pjenvey I'm not sure this is working as expected. I noticed several folks left a broker running for a day or more. Long after the test finished, the watcher should have terminated the attack nodes I believe, but that doesn't seem to be happening.

rpappalax added the bug label Jun 2, 2016

rpappalax modified the milestone: p1 Oct 3, 2016

rpappalax added the p1 label Oct 3, 2016

tarekziade self-assigned this Oct 10, 2016

tarekziade added a commit that referenced this issue Oct 17, 2016

adding the watcher extension - related to #21

742f65a

tarekziade added a commit that referenced this issue Oct 19, 2016

Merge pull request #49 from loads/watcher

7ea72b3

adding the watcher extension - related to #21

tarekziade closed this as completed Oct 21, 2016

rpappalax reopened this Mar 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run instance reaper #21

Run instance reaper #21

bbangert commented Jan 29, 2015

tarekziade commented Oct 10, 2016

tarekziade commented Oct 10, 2016

rpappalax commented Oct 10, 2016

rpappalax commented Oct 10, 2016

tarekziade commented Oct 10, 2016

tarekziade commented Oct 11, 2016

bbangert commented Oct 11, 2016

tarekziade commented Oct 11, 2016

bbangert commented Oct 11, 2016

tarekziade commented Oct 12, 2016 •

edited

Loading

tarekziade commented Oct 13, 2016

bbangert commented Oct 13, 2016

tarekziade commented Oct 14, 2016

tarekziade commented Oct 17, 2016

tarekziade commented Oct 21, 2016

rpappalax commented Mar 9, 2017

Run instance reaper #21

Run instance reaper #21

Comments

bbangert commented Jan 29, 2015

tarekziade commented Oct 10, 2016

tarekziade commented Oct 10, 2016

rpappalax commented Oct 10, 2016

rpappalax commented Oct 10, 2016

tarekziade commented Oct 10, 2016

tarekziade commented Oct 11, 2016

bbangert commented Oct 11, 2016

tarekziade commented Oct 11, 2016

bbangert commented Oct 11, 2016

tarekziade commented Oct 12, 2016 • edited Loading

tarekziade commented Oct 13, 2016

bbangert commented Oct 13, 2016

tarekziade commented Oct 14, 2016

tarekziade commented Oct 17, 2016

tarekziade commented Oct 21, 2016

rpappalax commented Mar 9, 2017

tarekziade commented Oct 12, 2016 •

edited

Loading