You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Instead of terminating these hosts, it might be better to do a shutdown.
For single instance experiments (where the playbook does not need to run to progress to the next job), an idea would be to insert a task at the end of the task-spooler to shut down the machine.
When creating an ec2-instance, it is possible to set instance_initiated_shutdown_behavior to stop. Afterward, when we run shutdown -h now as root, the ec2 instance stops. Later we would have to ensure that the machine can be started again for fetching the results.
The open question is when to insert this task, and how can we prevent that the machine does a shutdown even though the playbook still interacts with ec2 instance and wants to download results. An option to consider is not doing a shutdown immediately but with a certain delay such that the playbook interacting would still be able to get the results and stop the early shutdown.
A second open question is if we can configure / change the behaviour while a suite is running. For example, the tsp task could just run a service, but the service could also be deactivated and only when it is active, the shutdown would happen.
For multi-instance experiments, the shutdown is more challenging and needs to be controlled by the playbook. After all jobs are done, we could initiate the shutdown on all instances that belong to this experiment.
Note, the whole shutdown behaviour must be cloud-specific to a certain degree because e.g., we cannot shut down Leonhard.
More generally, it would make sense to consider options between shutdown/terminate after the experiment is over. Maybe by changing the awsclean commandline argument.
At the moment, EC2 instances are terminated at the end of the suite.
The problem is that when experiments in the same suite do not take all a similar time, then we have machines idling until the last experiment is over.
The text was updated successfully, but these errors were encountered: