New Release #40

nicolas-kuechler · 2022-05-06T19:50:59Z

Potentially Breaking Changes (Migration Guide)

If you want to migrate an old project to the most recent version, there are a few breaking changes that you have to consider.
In the future, I would like to prevent breaking changes as much as possible but this PR changes the overall structure significantly and so I did not want to introduce a lot of overhead to also support the previous version.

A change in the design extension (separation from Ansible) has led to some of the filters not being available anymore. If you've used the json_query filter to find another machine DNS in a multi-instance experiment, then you should check the example04-multi.yml for the correct usage of quotation marks.
Renaming of folders does_config and does_results to doe-suite-config and doe-suite-results
Change of config folder structure -> the folder should be generated using the new process and then you can migrate old roles.
The def get_output_dir(self, etl_info) function of the Loader changed the signature.

General Workflow (Usability)

We renamed the folder does_config to doe-suite-config to match the repo's name, same with does_results to doe-suite-results.

Makefile

A big usability feature is the introduction of a Makefile.
The complete interaction with the doe-suite should be using make
Before commands were getting more and more complex to remember and you most likely relied on the bash history to start experiments etc.
Now all the important commands are available as a make target and you can simply use make or make help in the root of the doe-suite repo to see an overview of all commands:

Running Experiments
  make run suite=<SUITE> id=new                       - run the experiments in the suite
  make run suite=<SUITE> id=<ID>                      - continue with the experiments in the suite with <ID> (often id=last)
  make run suite=<SUITE> id=<ID> cloud=<CLOUD>        - run suite on non-default cloud ([aws], euler)
  make run suite=<SUITE> id=<ID> expfilter=<REGEX>    - run only subset of experiments in suite where name matches the <REGEX>
Clean
  make clean                                          - terminate running cloud instances belonging to the project and local cleanup
  make clean-result                                   - delete all results in doe-suite-results except for the last (complete) suite run per suite
Running ETL Locally
  make etl suite=<SUITE> id=<ID>                      - run the etl pipeline of the suite (locally) to process results (often id=last)
  make etl-design suite=<SUITE> id=<ID>               - same as `make etl ...` but uses the pipeline from the suite design instead of results
  make etl-all                                        - run etl pipelines of all results
  make etl-super config=<CONFIG> out=<PATH>           - run the super etl to combine results of multiple suites  (for <CONFIG> e.g., demo_plots)
Clean ETL
  make etl-clean suite=<SUITE> id=<ID>                - delete etl results from specific suite (can be regenerated with make etl ...)
  make etl-clean-all                                  - delete etl results from all suites (can be regenerated with make etl-all)
Gather Information
  make info                                           - list available suite designs
  make status suite=<SUITE> id=<ID>                   - show the status of a specific suite run (often id=last)
Design of Experiment Suites
  make design suite=<SUITE>                           - list all the run commands defined by the suite
  make design-validate suite=<SUITE>                  - validate suite design and show with default values
Setting up a Suite
  make new                                            - initialize doe-suite-config from a template
Running Tests
  make test                                           - running all suites (seq) and comparing results to expected (on aws)
  make euler-test cloud=euler                         - running all single instance suites on euler and compare results to expected
  make etl-test-all                                   - re-run all etl pipelines and compare results to current state (useful after update of etl step)

Multi-User Setup

We improve the usability for multiple people working on the same project.
Previously, prj_id and ssh_key_name, and also the Euler username were variables set in group_vars/all. As a result, when multiple people wanted to work on the same project they had to have different versions of the group_vars file.

We extracted these variables to the environment variables.
Everything that needs to be different for two people working on the same project should now be in environment variables.
For example:

export DOES_PROJECT_DIR=/home/kuenico/dev/doe-suite/demo_project
export DOES_SSH_KEY_NAME=id_rsa_zeph
export DOES_EULER_USER=kunicola
export DOES_PROJECT_ID_SUFFIX=nku

Extracting these user-specific configs to environment variables also allowed to commit the group_vars from the demo_project and overall simplify the "Getting Started Process"

Getting Started Process

We remove the repotemplate.py functionality and instead rely on cookiecutter for initializing a new doe-suite project (does_config folder).

The template of the does_config folder can be found in cookiecutter-does_config.

The cookiecutter template process can be started with make new and it considers the DOES_PROJECT_DIR environment variable to see whether a does_configfolder already exists.

Among other things, cookiecutter provides hooks that allow executing arbitrary python code before and after creating the files. This is used to replace the feature with setting up-to-date ec2 images in the config.

Github Repo

Instead of a single repo, now it's possible to define a list of repositories.
For each repo, it's now also possible to set a specific branch or commit for checkout.

Testing

After changes in the doe-suite, it was very tedious to check if the core functionality is still working.
Basically, we would have to run all examples and then manually check that the results and etl_results are as we would expect.

We replace this manual workflow with a simple command:

Running Tests
  make test                                           - running all suites sequentially and comparing results to expected

For each example experiment, we keep now a results folder demo_project/does_results/example01-minimal_$expected in the repository that shows how we expect the results from this example.
When we run the simple command above, all experiments are run sequentially and after completion, the produced results are compared with the expected.
If the two result files have differences (except for the suite id), then an error is raised.

This is the first step toward providing CI functionality for the doe-suite.

Design Enhancements

Filter Experiments

It's now possible to run only a subset of the experiments defined in a suite.
You can filter experiment names with a regex provided when running a new suite:
make run suite=<SUITE> id=new expfilter=<REGEX>

Developing Designs

The process of developing designs has become easier.
Two make targets can take the design and convert it into the list of jobs that they define

Design of Experiment Suites
  make design suite=<SUITE>                           - list all the run commands defined by the suite
  make design-validate suite=<SUITE>                  - validate suite design and show with default values

For example, running make design suite=example01-minimal results in the following list of commands on stdout:

Experiment=minimal
  run=000 host=small-0: echo "hello world. "
  run=001 host=small-0: echo "hello world! "
  run=002 host=small-0: echo "hello universe. "
  run=003 host=small-0: echo "hello universe! "

Self-Referencing Variables

In the design, it should be possible to use arbitrary nested, self-referencing variables.
For example: a refers to b b refers to c -> after resolution a uses c
This allows for writing repetitive parts of a design more concisely.

Various

We introduced the possibility of a custom range syntax. Now we removed it again because we noticed that you can already use the default jinja2 syntax for this: {{ range(10) | list }} produces [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Experiment Design: Remove Custom Range Syntax #45

ETL Enhancements

Running ETL Locally

TheMakefile includes targets for running the ETL pipeline independent of the doe-suite on existing results.

Running ETL
  make etl suite=<SUITE> id=<ID>                      - run the etl pipeline of the suite (locally) to process results (often id=last)

For convenience, we also provide the id=last feature known from continuing running experiment suites.

Package for your own Steps

In the does_config folder, there is now a python package called does_config. In this python package, you can define custom ETL steps and they should then be available for the designs of this project.
The advantage over the previous solution for providing custom ETL steps is that now since you have your own poetry package for this, you can also introduce new custom dependencies that are not present in the doe-suite.

Error Handling

A failure in the ETL pipeline does not stop an experiment.
However, we should still notify the user that an error occurred and provide information on the error.

option to investigate: write a module that allows raising ansible warnings, on each "output progress info" should have a flag whether etl failed or worked.

We will not implement a complex error handling in this PR. The current error handling is not elegant but sufficient for debugging.

Include ETL Pipelines and ETL Stages

Sometimes we want to reuse complete ETL Pipelines or at least ETL Stages (e.g., the extract stage is always the same) in different suites or for different experiments.

Before we had to duplicate the definition of each ETL pipeline. Now with this feature, we can INCLUDE a pipeline from another design or from a folder dedicated to ETL templates: does_config/designs/etl_templates

TODO: Work in Progress
ETL: Include pipelines #37

"All" Experiments for ETL Pipeline

Before all experiments had to be listed by name to say that their results should be used in an ETL pipeline.
You can still do this but for pipelines that should just use all experiments, we can simply use * instead of the experiment name list.

ETL: Experiments wildcard #38

ETL and Super ETL Examples

In the current designs, there are not that many examples of ETL pipelines and no example for a super ETL that allows combining results from different suites.

The super etl example should show how this can be used for a paper.

ETL: Meaningful example for Super ETL #39
- migrating the super etl to the new code structure
- use the $expected results for this example

Various

when running etl locally, we should be able to say that etl config of current design should be used instead of the old one present at the time when the experiment was initially run.
TODO: not present in makefile
for existing ETL pipelines, provide a way to re-run all ETL pipelines on the results folder. The idea is to be able to check whether after changing e.g., transformer, loader that everything still works.
when running etl locally, we should be able to say that etl config of current design should be used instead of the old one present at the time when the experiment was initially run.
support id=last in manual etl call
by default do not require GPU for running the design examples

.vscode/settings.json

cookiecutter-doe-suite-config/cookiecutter.json

cookiecutter-doe-suite-config/hooks/post_gen_project.py

demo_project/doe-suite-config/designs/example01-minimal.yml

demo_project/doe-suite-config/designs/example02-single.yml

doespy/tests/test_etl_pipeline.py

doespy/doespy/etl/etl_base.py

doespy/tests/test_does_results.py

doespy/doespy/info.py

doespy/doespy/result_clean.py

nicolas-kuechler added 30 commits May 6, 2022 21:45

setup for better support for own etl dependencies

8f3e84c

run etl only if > last + delay, error handling

653a74b

add support for multi github repo download

b98a3db

prj_id suffix + key name as env variables

fc3f8cb

can now use * placeholder in etl exps select

4f1f036

Merge branch 'main' into etl

fda2230

support multiple files under $INCLUDE_VARS$

c798091

move etl util + super etl to doespy project

ea54c1a

run etl pipeline locally with design etl def

7a8ec5a

etl run v1

46513f1

remove specific range syntax (can use jinja def)

b9d5d12

restructure (WIP use make file for cmds)

c2ea153

add cookiecutter template draft

e85968e

add missing cookiecutter config

ad33136

update make new cookiecutter

8cd9581

make demo_project deterministic

7dd647d

bring back demo_project

3d0b8c1

add expfilter (regex to filter experiments)

6e452eb

fix + cleanup

8abd623

add dir testing + info target

249f559

add status check

61a9e06

add testing to makefile + some fixes

b0ea022

rework makefile (help) + can re-run etl to check

64574dd

also create suite_design_ext file in suite dir

92133c4

extract design extend + etl clean

cfa7e54

adjust gitignore + poetry lock

f88ad74

Merge branch 'main' (with clouds) into etl

7c4fe88

remove group_vars in demo_project from gitignore

390ba74

cleanup and move

c40c672

bug fixes

578f7bf

nicolas-kuechler added 15 commits July 5, 2022 17:51

setup cloud make clean + cloud bug fix

d11fc6c

buf fixes

940bdce

renaming config + results + minor other things

472f18c

fix directory comparison

d670d6d

fixes

bbd280e

refactor etl locations of stages

b0c10b0

df transformers integration

ca6a7b4

visualize etl pipelines

4b09004

WIP: new etl example

b4b6ff9

document new example for etl

ff9f5fa

update designs with minimal etl pipeline

597c5cf

update pycode (linter) unify etl + super etl

b36ac2f

add expected results

d0a7ae1

update make file + minor doc

82ee0ae

adding changes

56baae7

nicolas-kuechler requested review from e-opel, hiddely and Miro-H July 22, 2022 21:01

nicolas-kuechler changed the title ~~[WIP] ETL Extension~~ ETL Extension Jul 22, 2022

nicolas-kuechler changed the title ~~ETL Extension~~ New Release Jul 22, 2022

e-opel reviewed Aug 16, 2022

View reviewed changes

nicolas-kuechler added 2 commits August 17, 2022 17:51

feedback emanuel + minor fixes

1d82a89

change branch to main

0c4dce5

nicolas-kuechler merged commit f2a06e1 into main Aug 17, 2022

nicolas-kuechler deleted the etl branch March 13, 2023 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Release #40

New Release #40

nicolas-kuechler commented May 6, 2022 •

edited

Loading

New Release #40

New Release #40

Conversation

nicolas-kuechler commented May 6, 2022 • edited Loading

Potentially Breaking Changes (Migration Guide)

General Workflow (Usability)

Makefile

Multi-User Setup

Getting Started Process

Github Repo

Testing

Design Enhancements

Filter Experiments

Developing Designs

Self-Referencing Variables

Various

ETL Enhancements

Running ETL Locally

Package for your own Steps

Error Handling

Include ETL Pipelines and ETL Stages

"All" Experiments for ETL Pipeline

ETL and Super ETL Examples

Various

nicolas-kuechler commented May 6, 2022 •

edited

Loading