Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminating hosts per experiment instead of per suite (i.e., when all experiments are done) #42

Open
nicolas-kuechler opened this issue Jun 1, 2022 · 1 comment

Comments

@nicolas-kuechler
Copy link
Owner

At the moment, EC2 instances are terminated at the end of the suite.

The problem is that when experiments in the same suite do not take all a similar time, then we have machines idling until the last experiment is over.

@nicolas-kuechler
Copy link
Owner Author

Instead of terminating these hosts, it might be better to do a shutdown.

For single instance experiments (where the playbook does not need to run to progress to the next job), an idea would be to insert a task at the end of the task-spooler to shut down the machine.

When creating an ec2-instance, it is possible to set instance_initiated_shutdown_behavior to stop. Afterward, when we run shutdown -h now as root, the ec2 instance stops. Later we would have to ensure that the machine can be started again for fetching the results.

  • The open question is when to insert this task, and how can we prevent that the machine does a shutdown even though the playbook still interacts with ec2 instance and wants to download results. An option to consider is not doing a shutdown immediately but with a certain delay such that the playbook interacting would still be able to get the results and stop the early shutdown.
  • A second open question is if we can configure / change the behaviour while a suite is running. For example, the tsp task could just run a service, but the service could also be deactivated and only when it is active, the shutdown would happen.

For multi-instance experiments, the shutdown is more challenging and needs to be controlled by the playbook. After all jobs are done, we could initiate the shutdown on all instances that belong to this experiment.

Note, the whole shutdown behaviour must be cloud-specific to a certain degree because e.g., we cannot shut down Leonhard.

More generally, it would make sense to consider options between shutdown/terminate after the experiment is over. Maybe by changing the awsclean commandline argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant