-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmarks #491
base: main
Are you sure you want to change the base?
Add benchmarks #491
Conversation
@Hrovatin |
""" | ||
# TODO change path | ||
data_dir = ( | ||
"/Users/karinhrovatin/Documents/code/" + "BayBE_benchmark/domains/ArylHalides/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where to read data from?
This data is from BayBE_benchmark
- should I instead use the data in example/Backtesting folder in lookup.xlsx and copy missing files of other datasets there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I think it would be consistent to move all datasets into a new subfolder in the benchmark
module
and change the paths in the corresponding example(s)
One other thought: we used .xlsx
for that particular example, but it doesnt render nicely on Github and the size afaik doesnt really require that compression. So we might want to consider simply turning that into a simple .csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added data to benchmarks/data
but didnt adjust the file format and examples yet. - So let me know if the above consideration wqas meant as should be done or just discussion idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would do it but lets wait for the others opinions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that there is now the corresponding utility, is this comment here still relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, we have two copies of the arylhalides dataset in the repo, which certainly makes no sense. So I guess the one from the examples folder should vanish now. This means, we need a clean way to access benchmarking data from all places in the repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be done in this PR already or can we consider this a clean-up in a follow up PR?
objective=objective, | ||
) | ||
|
||
results = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should TL benchmarks be set up? E.g. should we always include comparison to no TL and different source data proportions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far we had TL benchmarks show the effect of different amounts of source data. I dont think there is a general recipe of how many points it should be in the gernal case, we simply have to decide it case by case and I would fully leave this to you. To me, eg for the chemistry cases using 1,2,5,10 and 20 % of source data seemed reasonable to me. We can discuss whether it needs that many variants or if just 3 would also suffice.
Baseline 1: No TL
Then this should be compared to some baseline, and I also have a new though for that. We used to compare this simply to 0% source data used
and I still think this is reasonable. When we add 0 data, the task parameter is not even required. This allows for 2 variants: 0 data added and no task parameter
and 0 data added with a task parameter present
. Perhaps one could save one of the two variants, but I still think if these two variants performance differs strongly, then it might indicate fit problems we need to solve. So I would vote for having both of these variants also in the plot.
Baseline 2: Naive TL
And here now comes the additonal idea, although I dont really know yet whether we can simply integrate taht in a plot together with the plot above: Thinking about what is the baseline for TL, no TL is one possible choice, as discussed above. But there is another option: naive TL. Naive TL would just mean: do not use a task parameter and completely discard the parameter that corresponds to the task. just add the soruce data, acting like there is no difference between the tasks. This comes with complications:
- Here we could also vary the amount of source data used, so tis not just 1 line added to the above plot
- We could copy the settings from above though and produce a stacked plot consisting of actual TL on top and naive TL on bottom, thoughts?
- you might have to be careful with the flags
allow_recommending_*
as it in a TL dataset that discards the task parameter, several data points are degenerate. Recommendation should be allowed to recommend previously recommended and measured points again
Baseline 3: Explicitly Modelled TL
This last thought can even be extended: there is also the variant of doing explicit TL, by modelling the task parameter as a numerical parameter or substance parameter with restricting active_values
, whatever corresponds to the task (of course this is not possible for a complex example where we dont know the parameters differentiating the tasks). This wuld then even correspond to a third task... Not sure if we really want that but conceptually it would make sense to me. Please share your thoughts @AdrianSosic @AVHopp @Hrovatin
Note on cost
From my experience so far the baselines 2 and 3 are much cheaper compared to the campaigns with task parameters present, I think they would maybe make up 10-20% of the runtime / cost even if they fully replicate the settings like number of source points used etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Baseline 1: Both are already added
We could copy the settings from above though and produce a stacked plot consisting of actual TL on top and naive TL on bottom, thoughts?
Not sure what you mean here
I think Baseline 3 could be useful if such data was generally available. But since it isn't I don't find such benchmark practically useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this thread still relevant or can it be resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have any changes been made to baselines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not check - was just going through all of the open comments and saw that there was no update for this since three weeks, so I'd thought I'd simply ask :D Let me digest this in more detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, now I get what you mean here. My opinion: These benchmarks should only contain baselines of type 1 here.
Reason: This then makes the scope of these benchmarks very clear: Compare the influence of Transfer Learning with respect to the TaskParameter, that is, investigate the influence of this parameter and the how actually using it by adding more data influences the behavior.
I think the other baselines are also interesting, but they have a different scope, and if we decide to add them, I would rather add them in a separate benchmark which might then e.g. contain a more focused test - like "Adding {10,20,30} points naively and adding {10,20,30} points properly". This would then somehow test the "proper" TL with the "naive" variants being basically the baselines.
Opinions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the actual baseline for the general TL setting is variant 2 and not any form of variant 1. However I can see that it is beyond this current PR to add it for all TL benchmarks and we can postpone that
variant 3 imo optional and also benchmark-dependent if even possible
One downside if we dont change it now tho: plots wont be 100% comparible after the change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also open to change this to variant 2 - we just need to align on a definition of "baseline", and I think multiple are feasible. What is @AdrianSosic opinion here?
|
||
test_task = "1-iodo-4-methoxybenzene" | ||
source_task = [ | ||
# Dissimilar source task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not saying we should run every possible combo here: but how was this particular choice selected? why not several source tasks? is there any chance this is a lucky/unlucky example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this thread still relevant or can it be resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i somehow have a bad feeling just picking one combination, but also dont know a solution other than expanding benchmarks on this for more combinations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could rename the benchmark accordingly - this could be the ArylHalides_1I4M-1C4(T)B
-Benchmark (or similar). This would make it clear what is being tested, we could easily extend this and it would actually open up the way to more benchmarks here that could be added later and would acutually reuse the utilities. Opinions @Scienfitz and also @AdrianSosic ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cant parse that string as apparently I forgot the rule already, but ing eneral subnaming the benchmark(s) makes sense
so as a compromise: can we select 3 interesting variants based on the cluster plot above and just have three ArylHalides
benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will propose something
0ea9be1
to
87d9deb
Compare
2406c87
to
33b6e3d
Compare
ConvergenceBenchmarkSettings, | ||
) | ||
|
||
DIMENSION = 4 # input dimensionality of the function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AdrianSosic , @Scienfitz : There was the wish to also have a continuous TL example, and that Michalewicz might be the best choice for that.
My question: How do we set this example up? Given that this is continuous, we cannot do something with "X % of the available points". My suggestion: Take a given number of source points (tbd which numbers are reasonable) and have "No TaskParameter, Task Parameter with {0, 5, 15, 20, 50}" points. Gucci?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
15 and 20 seems like a strange stride, I suggest
if it doesnt converge fast: {0, 10, 25, 50, 100}
or {0, 5, 10, 20, 50}
if it is already converged and theres no difference between 50 and 100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specific numbers are still tbd, will test a bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First batch of comments, only regarding one benchmark file, but probably apply to the others as well
@@ -36,6 +36,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | |||
- `SubstanceParameter`, `CustomDisreteParameter` and `CategoricalParameter` now also | |||
support restricting the search space via `active_values`, while `values` continue to | |||
identify allowed measurement inputs | |||
- Additional benchmarks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also list the names after Additional benchmarks:
benchmarks/data/utils.py
Outdated
|
||
from pathlib import Path | ||
|
||
DATA_PATH = Path(*Path(__file__).parts[:-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very weird way to construct the path. There's for example the parent
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it might be necessary to have a somehow weird construction since this needs to be run both locally and in the pipeline - I know that I had strange paths as a result of that previously. But I will try to come up with something more reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add a version that works locally but keep this open until it has been confirmed to also work in the pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this file called data_raw.csv
but the other data.csv
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will rename files properly
Data for benchmark. | ||
""" | ||
data_path = DATA_PATH / "ArylHalides" | ||
data = pd.read_table(data_path / "data_raw.csv", sep=",").dropna( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't there many unused columns? they could be dropped right at data loading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will have changed them in the upcoming suggestion.
data = get_data() | ||
|
||
target_tasks = ["1-iodo-4-methoxybenzene"] | ||
source_tasks = [ | ||
# Dissimilar source task | ||
"1-chloro-4-(trifluoromethyl)benzene" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Executing these lines directly in the module scope is suboptimal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be gone in the upcoming suggestion
] | ||
|
||
|
||
def space_data() -> ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a typical example the Now, that my prototype is done, I wrap my spaghetti-style script as a function and just return everything that is needed
-style of function 🙃 .... which pretty much defeats the purpose of having it as a function in the first place. Can we structure this a bit more elegantly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lookup = data.query("aryl_halide.isin(@target_tasks)").copy(deep=True) | ||
initial_data = data.query("aryl_halide.isin(@source_tasks)", engine="python").copy( | ||
deep=True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I don't get the purpose of these copies here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that it won't be necessary, will invsetigate
for p in [0.01, 0.05, 0.1]: | ||
results.append( | ||
simulate_scenarios( | ||
{f"{int(100 * p)}": campaign}, | ||
lookup, | ||
initial_data=[ | ||
initial_data.sample(frac=p) for _ in range(settings.n_mc_iterations) | ||
], | ||
batch_size=settings.batch_size, | ||
n_doe_iterations=settings.n_doe_iterations, | ||
impute_mode="error", | ||
) | ||
) | ||
# No training data | ||
results.append( | ||
simulate_scenarios( | ||
{"0": campaign}, | ||
lookup, | ||
batch_size=settings.batch_size, | ||
n_doe_iterations=settings.n_doe_iterations, | ||
n_mc_iterations=settings.n_mc_iterations, | ||
impute_mode="error", | ||
) | ||
) | ||
# Non-TL campaign | ||
results.append( | ||
simulate_scenarios( | ||
{"non-TL": Campaign(searchspace=searchspace_nontl, objective=objective)}, | ||
lookup, | ||
batch_size=settings.batch_size, | ||
n_doe_iterations=settings.n_doe_iterations, | ||
n_mc_iterations=settings.n_mc_iterations, | ||
impute_mode="error", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this boilerplate code can be condensed by a lot using partials and some nicer list constructions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already on it :)
277ccee
to
e994f44
Compare
Rebase onto main
Otherwise get the below error when running benchmark actions: `Missing optional dependency 'openpyxl'.`
Co-authored-by: Martin Fitzner <[email protected]>
e994f44
to
e35c0d0
Compare
5deb353
to
238b082
Compare
e29bc15
to
31fa3c0
Compare
EDIT:
0
) and one with non-TL model using only task data (nonTL
).TL
), TL with no source data (TL-noSource
), and non-TL model with only target data (non-TL
).OLD:
Started by adding direct arylation TL campaign with Temp as task - adapted from the paper.
Will continue adding more. If someone could already check if this goes in the right direction that would be great so that I do not repeat the mistakes for the others.
Things to discuss: