Skip to content
This repository has been archived by the owner on Sep 26, 2022. It is now read-only.

Run instance reaper #21

Open
bbangert opened this issue Jan 29, 2015 · 16 comments
Open

Run instance reaper #21

bbangert opened this issue Jan 29, 2015 · 16 comments
Assignees
Labels

Comments

@bbangert
Copy link
Member

AWSPool's reaper method currently doesn't properly terminate idle (> 1 hour) instances based on their time. It should determine if an instance hasn't been used for an hour, and terminate it if its idle too long.

The reap method should also be automatically run every minute or so to check, so it will need to be scheduled on startup into the event loop.

@rpappalax rpappalax added the bug label Jun 2, 2016
@rpappalax rpappalax modified the milestone: p1 Oct 3, 2016
@rpappalax rpappalax added the p1 label Oct 3, 2016
@tarekziade tarekziade self-assigned this Oct 10, 2016
@tarekziade
Copy link
Contributor

How do we want to determine that an instance has been idling for too long ? Sounds like we want to detect an instance that's not really producing any tests results anymore.

One option would be to parse the logs produced by the tests, to see if the box is still producing something. Another empirical option is to connect to that box and to see if somethings happening in it.

I am not 100% sure what's the best option. I think 2

@rpappalax do you know what's the usual behavior when a slave box gets stuck ?

@tarekziade
Copy link
Contributor

I think I have a better idea: we could self-terminate the instance with a script that runs in it

see http://stackoverflow.com/questions/10541363/self-terminating-aws-ec2-instance

The script would periodically check if the docker container is busy, if not start a coutdown before terminating the instance.

@rpappalax
Copy link

@tarekziade Yea, I think this is a great idea. I wonder if we could also automatically spin up a fresh node if one is lost due to self-termination.

@rpappalax
Copy link

Also, in regards to the attack nodes idling to long, I'm not sure how / if these are related but I know that @pjenvey is also investigating:
[1]. Issue #44: Loads-broker stalls with message: No instances running, collection done
[2]. connection timeout w/ the following message:

[2016-10-04 00:24:09,993][19217] Got exception: HTTPConnectionPool(host='54.162.133.99', port=2375): Max retries exceeded with url: /v1.21/containers/json?trunc_cmd=0&size=0&all=0&limit=-1 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7f10e17b9e80>, 'Connection to 54.162.133.99 timed out. (connect timeout=5)'))

@tarekziade
Copy link
Contributor

I wonder if we could also automatically spin up a fresh node if one is lost due to self-termination.

If the broker is doing correctly its job, it should do it once it detected the node was terminated. I don't know if this is happening live during a test though. I doubt it.

@tarekziade
Copy link
Contributor

Here's the approach: https://github.com/loads/awskiller/blob/master/awskiller/killer.py

That could be deployed on the CoreOS alongside docker, and monitor the docker process.
If the docker process is idling for an hour, given a threshold, the script can trigger termination.

Another possibility is to have the script send to statsd the docker mean cpu usage/hour, that would let the broker get the info and decide wether t reap it or not.

I think this loose approach avoids any extra work between the nodes and the broken to reap idling instances.

@bbangert
Copy link
Member Author

I really like this approach, it should probably check once every 5 mins, and if it goes for a whole hour without being used, then terminate itself.

The only problem I see is that it means the loads-broker will have to refresh node data more frequently from AWS to ensure it doesn't try to use a node that has removed itself from the pool.

@tarekziade
Copy link
Contributor

the loads-broker will have to refresh node data more frequently from AWS

I guess we can look at the hot spots. In theory, checking on node health should be doable asynchronously in the event loop, and even with hundreds of nodes I suspect that should not be impossible to optimize.

I propose we try that.

@bbangert
Copy link
Member Author

@pjenvey points out that its alive status can already be queried directly via docker, so we shouldn't worry about that.

@tarekziade
Copy link
Contributor

tarekziade commented Oct 12, 2016

I've finished a first version of "awskiller" a small command that will run a system command

https://github.com/loads/awskiller

$ .tox/py35/bin/awskiller --help
usage: awskiller [-h] [--pid PID] [--name NAME] [--killer KILLER]
                 [--threshold THRESHOLD] [--duration DURATION]
                 [--interval INTERVAL] [--verbose]

AWS Killer

optional arguments:
  -h, --help            show this help message and exit
  --pid PID             Process PID
  --name NAME           Process Name
  --killer KILLER       Command to run
  --threshold THRESHOLD
                        CPU Threshold
  --duration DURATION   Watch Duration
  --interval INTERVAL   Watch Interval
  --verbose             Display more info

$ .tox/py35/bin/awskiller --name firefox --verbose
Watching firefox 13%  ^C

Will now see how to hook it in CoreOS so it runs alongside docker and watch it

@tarekziade
Copy link
Contributor

So. CoreOS does not have Python installed or a package manager for installing it, which means that I need to create a binary distribution for this awskiller project in order to make it a "unit" we can add into cloud-config.

That's going to be a huge binary dist just to ship a small 100-lines scripts. It's not worth it.

I will re-write it in Rust and create a single binary file we can use as a CoreOS unit in loads-broker.

See https://github.com/loads/awskiller/issues/1

@bbangert
Copy link
Member Author

Right, the CoreOS package manager is docker. Everything is supposed to run in containers. I'd suggest having this run in a container as well, you can run containers in a special mode that gives them access to query the docker in the main host. That should let it determine whether its active or not.

@tarekziade
Copy link
Contributor

You're absolutely right. After poking at the things I have a working prototype:

https://gist.github.com/tarekziade/649f9ea7bb514ec0cf88369c8c5c4c48

Will change the script accordingly

@tarekziade
Copy link
Contributor

The docker image is ready and located at https://s3.amazonaws.com/loads-docker-images/loadswatch.tar.bz2

source code: https://github.com/loads/loadswatch

It will terminate the box if there's no containers runnings for one hour.
I will extend the broker so it uses it when running a CoreOS box.

https://github.com/loads/loads-broker/blob/master/loadsbroker/broker.py#L73

tarekziade added a commit that referenced this issue Oct 19, 2016
adding the watcher extension - related to #21
@tarekziade
Copy link
Contributor

The Watcher is now running in each CoreOS node.

@rpappalax rpappalax reopened this Mar 9, 2017
@rpappalax
Copy link

@tarekziade @pjenvey I'm not sure this is working as expected. I noticed several folks left a broker running for a day or more. Long after the test finished, the watcher should have terminated the attack nodes I believe, but that doesn't seem to be happening.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants