Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmarks #491

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open

Add benchmarks #491

wants to merge 29 commits into from

Conversation

Hrovatin
Copy link
Collaborator

@Hrovatin Hrovatin commented Feb 19, 2025

EDIT:

  • Added 2 chemical data TL benchmarks and 3 synthetic ones.
  • Lines:
    • For chemical ones and Hartmann: TL model with 0-10% of data (e.g. 0) and one with non-TL model using only task data (nonTL).
    • For Michaelewicz and Easom: TL model with single source data size(TL), TL with no source data (TL-noSource), and non-TL model with only target data (non-TL).

OLD:
Started by adding direct arylation TL campaign with Temp as task - adapted from the paper.

Will continue adding more. If someone could already check if this goes in the right direction that would be great so that I do not repeat the mistakes for the others.

Things to discuss:

  • Where to read data from?
  • How should TL benchmarks be set up? E.g. should we always include comparison to no TL and different source data proportions? - We didnt do this before on some datasets as we mainly focused on comparing different TL methods.

@Hrovatin Hrovatin changed the title Add direct arylation benchmark for TL with temperature as a task Add benchmarks Feb 19, 2025
@Scienfitz
Copy link
Collaborator

@Hrovatin
if there are discussion items before we start the review, can you open one thread per item so we can collect thoughts? if needed we can also have meetings

"""
# TODO change path
data_dir = (
"/Users/karinhrovatin/Documents/code/" + "BayBE_benchmark/domains/ArylHalides/"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where to read data from?
This data is from BayBE_benchmark - should I instead use the data in example/Backtesting folder in lookup.xlsx and copy missing files of other datasets there?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I think it would be consistent to move all datasets into a new subfolder in the benchmark module
and change the paths in the corresponding example(s)

One other thought: we used .xlsx for that particular example, but it doesnt render nicely on Github and the size afaik doesnt really require that compression. So we might want to consider simply turning that into a simple .csv

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added data to benchmarks/data but didnt adjust the file format and examples yet. - So let me know if the above consideration wqas meant as should be done or just discussion idea

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would do it but lets wait for the others opinions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there is now the corresponding utility, is this comment here still relevant?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, we have two copies of the arylhalides dataset in the repo, which certainly makes no sense. So I guess the one from the examples folder should vanish now. This means, we need a clean way to access benchmarking data from all places in the repo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be done in this PR already or can we consider this a clean-up in a follow up PR?

objective=objective,
)

results = []
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should TL benchmarks be set up? E.g. should we always include comparison to no TL and different source data proportions?

Copy link
Collaborator

@Scienfitz Scienfitz Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far we had TL benchmarks show the effect of different amounts of source data. I dont think there is a general recipe of how many points it should be in the gernal case, we simply have to decide it case by case and I would fully leave this to you. To me, eg for the chemistry cases using 1,2,5,10 and 20 % of source data seemed reasonable to me. We can discuss whether it needs that many variants or if just 3 would also suffice.

Baseline 1: No TL
Then this should be compared to some baseline, and I also have a new though for that. We used to compare this simply to 0% source data used and I still think this is reasonable. When we add 0 data, the task parameter is not even required. This allows for 2 variants: 0 data added and no task parameter and 0 data added with a task parameter present. Perhaps one could save one of the two variants, but I still think if these two variants performance differs strongly, then it might indicate fit problems we need to solve. So I would vote for having both of these variants also in the plot.

Baseline 2: Naive TL
And here now comes the additonal idea, although I dont really know yet whether we can simply integrate taht in a plot together with the plot above: Thinking about what is the baseline for TL, no TL is one possible choice, as discussed above. But there is another option: naive TL. Naive TL would just mean: do not use a task parameter and completely discard the parameter that corresponds to the task. just add the soruce data, acting like there is no difference between the tasks. This comes with complications:

  • Here we could also vary the amount of source data used, so tis not just 1 line added to the above plot
  • We could copy the settings from above though and produce a stacked plot consisting of actual TL on top and naive TL on bottom, thoughts?
  • you might have to be careful with the flags allow_recommending_* as it in a TL dataset that discards the task parameter, several data points are degenerate. Recommendation should be allowed to recommend previously recommended and measured points again

Baseline 3: Explicitly Modelled TL
This last thought can even be extended: there is also the variant of doing explicit TL, by modelling the task parameter as a numerical parameter or substance parameter with restricting active_values, whatever corresponds to the task (of course this is not possible for a complex example where we dont know the parameters differentiating the tasks). This wuld then even correspond to a third task... Not sure if we really want that but conceptually it would make sense to me. Please share your thoughts @AdrianSosic @AVHopp @Hrovatin

Note on cost
From my experience so far the baselines 2 and 3 are much cheaper compared to the campaigns with task parameters present, I think they would maybe make up 10-20% of the runtime / cost even if they fully replicate the settings like number of source points used etc.

Copy link
Collaborator Author

@Hrovatin Hrovatin Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Baseline 1: Both are already added


We could copy the settings from above though and produce a stacked plot consisting of actual TL on top and naive TL on bottom, thoughts?

Not sure what you mean here


I think Baseline 3 could be useful if such data was generally available. But since it isn't I don't find such benchmark practically useful.

Copy link
Collaborator

@Scienfitz Scienfitz Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

baselines 2 and 3 are entirely new plots with the same curves as the original plot
lik
image

combining all of that into one single plot will make it explode visually
maybe its also not the task of the benchmark to arrange the plots, but rather just provide the results

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this thread still relevant or can it be resolved?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have any changes been made to baselines?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not check - was just going through all of the open comments and saw that there was no update for this since three weeks, so I'd thought I'd simply ask :D Let me digest this in more detail.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now I get what you mean here. My opinion: These benchmarks should only contain baselines of type 1 here.
Reason: This then makes the scope of these benchmarks very clear: Compare the influence of Transfer Learning with respect to the TaskParameter, that is, investigate the influence of this parameter and the how actually using it by adding more data influences the behavior.

I think the other baselines are also interesting, but they have a different scope, and if we decide to add them, I would rather add them in a separate benchmark which might then e.g. contain a more focused test - like "Adding {10,20,30} points naively and adding {10,20,30} points properly". This would then somehow test the "proper" TL with the "naive" variants being basically the baselines.

Opinions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the actual baseline for the general TL setting is variant 2 and not any form of variant 1. However I can see that it is beyond this current PR to add it for all TL benchmarks and we can postpone that

variant 3 imo optional and also benchmark-dependent if even possible

One downside if we dont change it now tho: plots wont be 100% comparible after the change

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also open to change this to variant 2 - we just need to align on a definition of "baseline", and I think multiple are feasible. What is @AdrianSosic opinion here?


test_task = "1-iodo-4-methoxybenzene"
source_task = [
# Dissimilar source task
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not saying we should run every possible combo here: but how was this particular choice selected? why not several source tasks? is there any chance this is a lucky/unlucky example?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I selected it as the yield outcomes (color) seemed relatively distinct between the two across params (x-axis). But I tested a few different combinations before and can not say that things that seem different on the plot are also more challenging for TL
image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this thread still relevant or can it be resolved?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i somehow have a bad feeling just picking one combination, but also dont know a solution other than expanding benchmarks on this for more combinations

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could rename the benchmark accordingly - this could be the ArylHalides_1I4M-1C4(T)B-Benchmark (or similar). This would make it clear what is being tested, we could easily extend this and it would actually open up the way to more benchmarks here that could be added later and would acutually reuse the utilities. Opinions @Scienfitz and also @AdrianSosic ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cant parse that string as apparently I forgot the rule already, but ing eneral subnaming the benchmark(s) makes sense
so as a compromise: can we select 3 interesting variants based on the cluster plot above and just have three ArylHalides benchmarks?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will propose something

@Hrovatin Hrovatin force-pushed the feature/benchmark_cases branch from 0ea9be1 to 87d9deb Compare March 3, 2025 12:42
ConvergenceBenchmarkSettings,
)

DIMENSION = 4 # input dimensionality of the function
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AdrianSosic , @Scienfitz : There was the wish to also have a continuous TL example, and that Michalewicz might be the best choice for that.
My question: How do we set this example up? Given that this is continuous, we cannot do something with "X % of the available points". My suggestion: Take a given number of source points (tbd which numbers are reasonable) and have "No TaskParameter, Task Parameter with {0, 5, 15, 20, 50}" points. Gucci?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 and 20 seems like a strange stride, I suggest

if it doesnt converge fast: {0, 10, 25, 50, 100}
or {0, 5, 10, 20, 50} if it is already converged and theres no difference between 50 and 100

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specific numbers are still tbd, will test a bit

Copy link
Collaborator

@AdrianSosic AdrianSosic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First batch of comments, only regarding one benchmark file, but probably apply to the others as well

@@ -36,6 +36,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `SubstanceParameter`, `CustomDisreteParameter` and `CategoricalParameter` now also
support restricting the search space via `active_values`, while `values` continue to
identify allowed measurement inputs
- Additional benchmarks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also list the names after Additional benchmarks:


from pathlib import Path

DATA_PATH = Path(*Path(__file__).parts[:-1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very weird way to construct the path. There's for example the parent method

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it might be necessary to have a somehow weird construction since this needs to be run both locally and in the pipeline - I know that I had strange paths as a result of that previously. But I will try to come up with something more reasonable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a version that works locally but keep this open until it has been confirmed to also work in the pipeline.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this file called data_raw.csv but the other data.csv?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will rename files properly

Data for benchmark.
"""
data_path = DATA_PATH / "ArylHalides"
data = pd.read_table(data_path / "data_raw.csv", sep=",").dropna(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't there many unused columns? they could be dropped right at data loading

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have changed them in the upcoming suggestion.

Comment on lines +33 to +35
data = get_data()

target_tasks = ["1-iodo-4-methoxybenzene"]
source_tasks = [
# Dissimilar source task
"1-chloro-4-(trifluoromethyl)benzene"
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Executing these lines directly in the module scope is suboptimal

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be gone in the upcoming suggestion

]


def space_data() -> (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a typical example the Now, that my prototype is done, I wrap my spaghetti-style script as a function and just return everything that is needed-style of function 🙃 .... which pretty much defeats the purpose of having it as a function in the first place. Can we structure this a bit more elegantly?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am currently designing it like that since this is a nice, elegant pattern that can then easily be used in other examples:
image
Is this more to your liking?

Comment on lines +77 to +76
lookup = data.query("aryl_halide.isin(@target_tasks)").copy(deep=True)
initial_data = data.query("aryl_halide.isin(@source_tasks)", engine="python").copy(
deep=True
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I don't get the purpose of these copies here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that it won't be necessary, will invsetigate

Comment on lines +121 to +154
for p in [0.01, 0.05, 0.1]:
results.append(
simulate_scenarios(
{f"{int(100 * p)}": campaign},
lookup,
initial_data=[
initial_data.sample(frac=p) for _ in range(settings.n_mc_iterations)
],
batch_size=settings.batch_size,
n_doe_iterations=settings.n_doe_iterations,
impute_mode="error",
)
)
# No training data
results.append(
simulate_scenarios(
{"0": campaign},
lookup,
batch_size=settings.batch_size,
n_doe_iterations=settings.n_doe_iterations,
n_mc_iterations=settings.n_mc_iterations,
impute_mode="error",
)
)
# Non-TL campaign
results.append(
simulate_scenarios(
{"non-TL": Campaign(searchspace=searchspace_nontl, objective=objective)},
lookup,
batch_size=settings.batch_size,
n_doe_iterations=settings.n_doe_iterations,
n_mc_iterations=settings.n_mc_iterations,
impute_mode="error",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this boilerplate code can be condensed by a lot using partials and some nicer list constructions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already on it :)

@AVHopp AVHopp force-pushed the feature/benchmark_cases branch from 277ccee to e994f44 Compare March 12, 2025 14:55
@AVHopp AVHopp force-pushed the feature/benchmark_cases branch from e994f44 to e35c0d0 Compare March 12, 2025 14:56
@AVHopp AVHopp force-pushed the feature/benchmark_cases branch from 5deb353 to 238b082 Compare March 12, 2025 15:48
@AVHopp AVHopp force-pushed the feature/benchmark_cases branch from e29bc15 to 31fa3c0 Compare March 12, 2025 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants