Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint upon AWS spot instance termination notice #29

Open
garkenyon opened this issue Jan 24, 2016 · 2 comments
Open

Checkpoint upon AWS spot instance termination notice #29

garkenyon opened this issue Jan 24, 2016 · 2 comments

Comments

@garkenyon
Copy link
Contributor

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html#spot-instance-termination-notices

the above link states that AWS provides a 2 minute warning before termination. Can we use this warning the same way we use

$: killall -SIGUSR1 <name_of_executable>

to write a final checkpoint before termination? In fact, we almost don't even have to formally checkpoint with the above mechanism.

@peteschultz
Copy link
Contributor

In HyPerCol::advanceTime(), in the same spot we call sigpending to check for SIGUSR1, we could do the curl statement (or maybe it would be wget), and if there is a termination warning we could set checkpointSignal to 2 (sending SIGUSR1 sets checkpointSignal to 1). I think we'd want to make sure we don't fetch the URL more often than the Amazon-recommended 5 seconds, but it should be pretty straightforward to add.

Alternatively, we could have PV_Init launch a simple script that runs the curl statement every 5 seconds, and sends SIGUSR1 to the PetaVision process when necessary. One thing about that is we might want to be able to see in the log file whether the job terminated from Amazon killing the instance or from the user running killall -SIGUSR1.

@garkenyon
Copy link
Contributor Author

the first approach seems easier to implement. maybe we could keep track of the last wget/curl AWS termination check to make sure we don't check too often. 2 minutes is a long time. Just ask Peyton Manning! Since we would at most only be checking as often as we check sigusr1, there's no reason to check the termination condition more often that that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants