-
Notifications
You must be signed in to change notification settings - Fork 3
Run instance reaper #21
Comments
How do we want to determine that an instance has been idling for too long ? Sounds like we want to detect an instance that's not really producing any tests results anymore. One option would be to parse the logs produced by the tests, to see if the box is still producing something. Another empirical option is to connect to that box and to see if somethings happening in it. I am not 100% sure what's the best option. I think 2 @rpappalax do you know what's the usual behavior when a slave box gets stuck ? |
I think I have a better idea: we could self-terminate the instance with a script that runs in it see http://stackoverflow.com/questions/10541363/self-terminating-aws-ec2-instance The script would periodically check if the docker container is busy, if not start a coutdown before terminating the instance. |
@tarekziade Yea, I think this is a great idea. I wonder if we could also automatically spin up a fresh node if one is lost due to self-termination. |
Also, in regards to the attack nodes idling to long, I'm not sure how / if these are related but I know that @pjenvey is also investigating:
|
If the broker is doing correctly its job, it should do it once it detected the node was terminated. I don't know if this is happening live during a test though. I doubt it. |
Here's the approach: https://github.com/loads/awskiller/blob/master/awskiller/killer.py That could be deployed on the CoreOS alongside docker, and monitor the docker process. Another possibility is to have the script send to statsd the docker mean cpu usage/hour, that would let the broker get the info and decide wether t reap it or not. I think this loose approach avoids any extra work between the nodes and the broken to reap idling instances. |
I really like this approach, it should probably check once every 5 mins, and if it goes for a whole hour without being used, then terminate itself. The only problem I see is that it means the loads-broker will have to refresh node data more frequently from AWS to ensure it doesn't try to use a node that has removed itself from the pool. |
I guess we can look at the hot spots. In theory, checking on node health should be doable asynchronously in the event loop, and even with hundreds of nodes I suspect that should not be impossible to optimize. I propose we try that. |
@pjenvey points out that its alive status can already be queried directly via docker, so we shouldn't worry about that. |
I've finished a first version of "awskiller" a small command that will run a system command https://github.com/loads/awskiller $ .tox/py35/bin/awskiller --help
usage: awskiller [-h] [--pid PID] [--name NAME] [--killer KILLER]
[--threshold THRESHOLD] [--duration DURATION]
[--interval INTERVAL] [--verbose]
AWS Killer
optional arguments:
-h, --help show this help message and exit
--pid PID Process PID
--name NAME Process Name
--killer KILLER Command to run
--threshold THRESHOLD
CPU Threshold
--duration DURATION Watch Duration
--interval INTERVAL Watch Interval
--verbose Display more info
$ .tox/py35/bin/awskiller --name firefox --verbose
Watching firefox 13% ^C Will now see how to hook it in CoreOS so it runs alongside docker and watch it |
So. CoreOS does not have Python installed or a package manager for installing it, which means that I need to create a binary distribution for this awskiller project in order to make it a "unit" we can add into cloud-config. That's going to be a huge binary dist just to ship a small 100-lines scripts. It's not worth it. I will re-write it in Rust and create a single binary file we can use as a CoreOS unit in loads-broker. |
Right, the CoreOS package manager is docker. Everything is supposed to run in containers. I'd suggest having this run in a container as well, you can run containers in a special mode that gives them access to query the docker in the main host. That should let it determine whether its active or not. |
You're absolutely right. After poking at the things I have a working prototype: https://gist.github.com/tarekziade/649f9ea7bb514ec0cf88369c8c5c4c48 Will change the script accordingly |
The docker image is ready and located at https://s3.amazonaws.com/loads-docker-images/loadswatch.tar.bz2 source code: https://github.com/loads/loadswatch It will terminate the box if there's no containers runnings for one hour. https://github.com/loads/loads-broker/blob/master/loadsbroker/broker.py#L73 |
adding the watcher extension - related to #21
The Watcher is now running in each CoreOS node. |
@tarekziade @pjenvey I'm not sure this is working as expected. I noticed several folks left a broker running for a day or more. Long after the test finished, the watcher should have terminated the attack nodes I believe, but that doesn't seem to be happening. |
AWSPool's reaper method currently doesn't properly terminate idle (> 1 hour) instances based on their time. It should determine if an instance hasn't been used for an hour, and terminate it if its idle too long.
The reap method should also be automatically run every minute or so to check, so it will need to be scheduled on startup into the event loop.
The text was updated successfully, but these errors were encountered: