-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Release #40
Merged
Merged
New Release #40
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Potentially Breaking Changes (Migration Guide)
If you want to migrate an old project to the most recent version, there are a few breaking changes that you have to consider.
In the future, I would like to prevent breaking changes as much as possible but this PR changes the overall structure significantly and so I did not want to introduce a lot of overhead to also support the previous version.
A change in the design extension (separation from Ansible) has led to some of the filters not being available anymore. If you've used the
json_query
filter to find another machine DNS in a multi-instance experiment, then you should check theexample04-multi.yml
for the correct usage of quotation marks.Renaming of folders
does_config
anddoes_results
todoe-suite-config
anddoe-suite-results
Change of config folder structure -> the folder should be generated using the new process and then you can migrate old roles.
The
def get_output_dir(self, etl_info)
function of theLoader
changed the signature.General Workflow (Usability)
We renamed the folder
does_config
todoe-suite-config
to match the repo's name, same withdoes_results
todoe-suite-results
.Makefile
A big usability feature is the introduction of a
Makefile
.The complete interaction with the
doe-suite
should be usingmake
Before commands were getting more and more complex to remember and you most likely relied on the bash history to start experiments etc.
Now all the important commands are available as a make target and you can simply use
make
ormake help
in the root of thedoe-suite
repo to see an overview of all commands:Multi-User Setup
We improve the usability for multiple people working on the same project.
Previously,
prj_id
andssh_key_name
, and also the Euler username were variables set ingroup_vars/all
. As a result, when multiple people wanted to work on the same project they had to have different versions of the group_vars file.We extracted these variables to the environment variables.
Everything that needs to be different for two people working on the same project should now be in environment variables.
For example:
Extracting these user-specific configs to environment variables also allowed to commit the
group_vars
from the demo_project and overall simplify the "Getting Started Process"Getting Started Process
We remove the
repotemplate.py
functionality and instead rely on cookiecutter for initializing a new doe-suite project (does_config
folder).The template of the
does_config
folder can be found incookiecutter-does_config
.The cookiecutter template process can be started with
make new
and it considers theDOES_PROJECT_DIR
environment variable to see whether adoes_config
folder already exists.Among other things, cookiecutter provides hooks that allow executing arbitrary python code before and after creating the files. This is used to replace the feature with setting up-to-date ec2 images in the config.
Github Repo
Instead of a single repo, now it's possible to define a list of repositories.
For each repo, it's now also possible to set a specific branch or commit for checkout.
Testing
After changes in the
doe-suite
, it was very tedious to check if the core functionality is still working.Basically, we would have to run all examples and then manually check that the results and etl_results are as we would expect.
We replace this manual workflow with a simple command:
For each example experiment, we keep now a results folder
demo_project/does_results/example01-minimal_$expected
in the repository that shows how we expect the results from this example.When we run the simple command above, all experiments are run sequentially and after completion, the produced results are compared with the expected.
If the two result files have differences (except for the suite id), then an error is raised.
This is the first step toward providing CI functionality for the doe-suite.
Design Enhancements
Filter Experiments
It's now possible to run only a subset of the experiments defined in a suite.
You can filter experiment names with a regex provided when running a new suite:
make run suite=<SUITE> id=new expfilter=<REGEX>
Developing Designs
The process of developing designs has become easier.
Two make targets can take the design and convert it into the list of jobs that they define
For example, running
make design suite=example01-minimal
results in the following list of commands on stdout:Self-Referencing Variables
In the design, it should be possible to use arbitrary nested, self-referencing variables.
For example:
a refers to b
b refers to c
-> after resolutiona uses c
This allows for writing repetitive parts of a design more concisely.
Various
We introduced the possibility of a custom range syntax. Now we removed it again because we noticed that you can already use the default jinja2 syntax for this:
{{ range(10) | list }}
produces[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
ETL Enhancements
Running ETL Locally
The
Makefile
includes targets for running the ETL pipeline independent of thedoe-suite
on existing results.For convenience, we also provide the
id=last
feature known from continuing running experiment suites.Package for your own Steps
In the
does_config
folder, there is now a python package calleddoes_config
. In this python package, you can define custom ETL steps and they should then be available for the designs of this project.The advantage over the previous solution for providing custom ETL steps is that now since you have your own poetry package for this, you can also introduce new custom dependencies that are not present in the
doe-suite
.does_config/etl
#17Error Handling
A failure in the ETL pipeline does not stop an experiment.
However, we should still notify the user that an error occurred and provide information on the error.
We will not implement a complex error handling in this PR. The current error handling is not elegant but sufficient for debugging.
Include ETL Pipelines and ETL Stages
Sometimes we want to reuse complete ETL Pipelines or at least ETL Stages (e.g., the extract stage is always the same) in different suites or for different experiments.
Before we had to duplicate the definition of each ETL pipeline. Now with this feature, we can
INCLUDE
a pipeline from another design or from a folder dedicated to ETL templates:does_config/designs/etl_templates
"All" Experiments for ETL Pipeline
Before all experiments had to be listed by name to say that their results should be used in an ETL pipeline.
You can still do this but for pipelines that should just use all experiments, we can simply use
*
instead of the experiment name list.ETL and Super ETL Examples
In the current designs, there are not that many examples of ETL pipelines and no example for a super ETL that allows combining results from different suites.
The super etl example should show how this can be used for a paper.
$expected
results for this exampleVarious
when running etl locally, we should be able to say that etl config of current design should be used instead of the old one present at the time when the experiment was initially run.
TODO: not present in makefile
for existing ETL pipelines, provide a way to re-run all ETL pipelines on the results folder. The idea is to be able to check whether after changing e.g., transformer, loader that everything still works.
when running etl locally, we should be able to say that etl config of current design should be used instead of the old one present at the time when the experiment was initially run.
support
id=last
in manual etl callby default do not require GPU for running the design examples