Add benchmarks #491

Hrovatin · 2025-02-19T15:14:37Z

EDIT:

Added 2 chemical data TL benchmarks and 3 synthetic ones.
Lines:
- For chemical ones and Hartmann: TL model with 0-10% of data (e.g. 0) and one with non-TL model using only task data (nonTL).
- For Michaelewicz and Easom: TL model with single source data size(TL), TL with no source data (TL-noSource), and non-TL model with only target data (non-TL).

OLD:
Started by adding direct arylation TL campaign with Temp as task - adapted from the paper.

Will continue adding more. If someone could already check if this goes in the right direction that would be great so that I do not repeat the mistakes for the others.

Things to discuss:

Where to read data from?
How should TL benchmarks be set up? E.g. should we always include comparison to no TL and different source data proportions? - We didnt do this before on some datasets as we mainly focused on comparing different TL methods.

Scienfitz · 2025-02-20T16:59:42Z

@Hrovatin
if there are discussion items before we start the review, can you open one thread per item so we can collect thoughts? if needed we can also have meetings

Hrovatin · 2025-02-21T11:06:55Z

benchmarks/domains/arylhalides_tl_substance.py

+    """
+    # TODO change path
+    data_dir = (
+        "/Users/karinhrovatin/Documents/code/" + "BayBE_benchmark/domains/ArylHalides/"


Where to read data from?
This data is from BayBE_benchmark - should I instead use the data in example/Backtesting folder in lookup.xlsx and copy missing files of other datasets there?

yes I think it would be consistent to move all datasets into a new subfolder in the benchmark module
and change the paths in the corresponding example(s)

One other thought: we used .xlsx for that particular example, but it doesnt render nicely on Github and the size afaik doesnt really require that compression. So we might want to consider simply turning that into a simple .csv

I added data to benchmarks/data but didnt adjust the file format and examples yet. - So let me know if the above consideration wqas meant as should be done or just discussion idea

i would do it but lets wait for the others opinions

Given that there is now the corresponding utility, is this comment here still relevant?

Now, we have two copies of the arylhalides dataset in the repo, which certainly makes no sense. So I guess the one from the examples folder should vanish now. This means, we need a clean way to access benchmarking data from all places in the repo

Should this be done in this PR already or can we consider this a clean-up in a follow up PR?

Hrovatin · 2025-02-21T11:07:23Z

benchmarks/domains/arylhalides_tl_substance.py

+        objective=objective,
+    )
+
+    results = []


How should TL benchmarks be set up? E.g. should we always include comparison to no TL and different source data proportions?

So far we had TL benchmarks show the effect of different amounts of source data. I dont think there is a general recipe of how many points it should be in the gernal case, we simply have to decide it case by case and I would fully leave this to you. To me, eg for the chemistry cases using 1,2,5,10 and 20 % of source data seemed reasonable to me. We can discuss whether it needs that many variants or if just 3 would also suffice.

Baseline 1: No TL
Then this should be compared to some baseline, and I also have a new though for that. We used to compare this simply to 0% source data used and I still think this is reasonable. When we add 0 data, the task parameter is not even required. This allows for 2 variants: 0 data added and no task parameter and 0 data added with a task parameter present. Perhaps one could save one of the two variants, but I still think if these two variants performance differs strongly, then it might indicate fit problems we need to solve. So I would vote for having both of these variants also in the plot.

Baseline 2: Naive TL
And here now comes the additonal idea, although I dont really know yet whether we can simply integrate taht in a plot together with the plot above: Thinking about what is the baseline for TL, no TL is one possible choice, as discussed above. But there is another option: naive TL. Naive TL would just mean: do not use a task parameter and completely discard the parameter that corresponds to the task. just add the soruce data, acting like there is no difference between the tasks. This comes with complications:

Here we could also vary the amount of source data used, so tis not just 1 line added to the above plot

We could copy the settings from above though and produce a stacked plot consisting of actual TL on top and naive TL on bottom, thoughts?

you might have to be careful with the flags allow_recommending_* as it in a TL dataset that discards the task parameter, several data points are degenerate. Recommendation should be allowed to recommend previously recommended and measured points again

Baseline 3: Explicitly Modelled TL
This last thought can even be extended: there is also the variant of doing explicit TL, by modelling the task parameter as a numerical parameter or substance parameter with restricting active_values, whatever corresponds to the task (of course this is not possible for a complex example where we dont know the parameters differentiating the tasks). This wuld then even correspond to a third task... Not sure if we really want that but conceptually it would make sense to me. Please share your thoughts @AdrianSosic @AVHopp @Hrovatin

Note on cost
From my experience so far the baselines 2 and 3 are much cheaper compared to the campaigns with task parameters present, I think they would maybe make up 10-20% of the runtime / cost even if they fully replicate the settings like number of source points used etc.

Baseline 1: Both are already added

We could copy the settings from above though and produce a stacked plot consisting of actual TL on top and naive TL on bottom, thoughts?

Not sure what you mean here

I think Baseline 3 could be useful if such data was generally available. But since it isn't I don't find such benchmark practically useful.

baselines 2 and 3 are entirely new plots with the same curves as the original plot
lik

combining all of that into one single plot will make it explode visually
maybe its also not the task of the benchmark to arrange the plots, but rather just provide the results

Is this thread still relevant or can it be resolved?

have any changes been made to baselines?

Did not check - was just going through all of the open comments and saw that there was no update for this since three weeks, so I'd thought I'd simply ask :D Let me digest this in more detail.

Ok, now I get what you mean here. My opinion: These benchmarks should only contain baselines of type 1 here.
Reason: This then makes the scope of these benchmarks very clear: Compare the influence of Transfer Learning with respect to the TaskParameter, that is, investigate the influence of this parameter and the how actually using it by adding more data influences the behavior.

I think the other baselines are also interesting, but they have a different scope, and if we decide to add them, I would rather add them in a separate benchmark which might then e.g. contain a more focused test - like "Adding {10,20,30} points naively and adding {10,20,30} points properly". This would then somehow test the "proper" TL with the "naive" variants being basically the baselines.

Opinions?

I think the actual baseline for the general TL setting is variant 2 and not any form of variant 1. However I can see that it is beyond this current PR to add it for all TL benchmarks and we can postpone that

variant 3 imo optional and also benchmark-dependent if even possible

One downside if we dont change it now tho: plots wont be 100% comparible after the change

Also open to change this to variant 2 - we just need to align on a definition of "baseline", and I think multiple are feasible. What is @AdrianSosic opinion here?

benchmarks/data/ArylHalides/data_raw.csv

benchmarks/data/utils.py

Scienfitz · 2025-02-24T11:50:22Z

benchmarks/domains/arylhalides_tl_substance.py

+
+test_task = "1-iodo-4-methoxybenzene"
+source_task = [
+    # Dissimilar source task


not saying we should run every possible combo here: but how was this particular choice selected? why not several source tasks? is there any chance this is a lucky/unlucky example?

I selected it as the yield outcomes (color) seemed relatively distinct between the two across params (x-axis). But I tested a few different combinations before and can not say that things that seem different on the plot are also more challenging for TL

Is this thread still relevant or can it be resolved?

i somehow have a bad feeling just picking one combination, but also dont know a solution other than expanding benchmarks on this for more combinations

We could rename the benchmark accordingly - this could be the ArylHalides_1I4M-1C4(T)B-Benchmark (or similar). This would make it clear what is being tested, we could easily extend this and it would actually open up the way to more benchmarks here that could be added later and would acutually reuse the utilities. Opinions @Scienfitz and also @AdrianSosic ?

I cant parse that string as apparently I forgot the rule already, but ing eneral subnaming the benchmark(s) makes sense
so as a compromise: can we select 3 interesting variants based on the cluster plot above and just have three ArylHalides benchmarks?

Will propose something

benchmarks/domains/arylhalides_tl_substance.py

benchmarks/domains/direct_arylation_tl_temp.py

benchmarks/domains/easom_tl_noise.py

benchmarks/domains/michalewicz_tl_noise.py

CHANGELOG.md

AVHopp · 2025-03-11T15:47:32Z

benchmarks/domains/michalewicz_tl_noise.py

+    ConvergenceBenchmarkSettings,
+)
+
+DIMENSION = 4  # input dimensionality of the function


@AdrianSosic , @Scienfitz : There was the wish to also have a continuous TL example, and that Michalewicz might be the best choice for that.
My question: How do we set this example up? Given that this is continuous, we cannot do something with "X % of the available points". My suggestion: Take a given number of source points (tbd which numbers are reasonable) and have "No TaskParameter, Task Parameter with {0, 5, 15, 20, 50}" points. Gucci?

15 and 20 seems like a strange stride, I suggest

if it doesnt converge fast: {0, 10, 25, 50, 100}
or {0, 5, 10, 20, 50} if it is already converged and theres no difference between 50 and 100

Specific numbers are still tbd, will test a bit

AdrianSosic

First batch of comments, only regarding one benchmark file, but probably apply to the others as well

AdrianSosic · 2025-03-11T20:28:17Z

CHANGELOG.md

@@ -36,6 +36,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - `SubstanceParameter`, `CustomDisreteParameter` and `CategoricalParameter` now also 
  support restricting the search space via `active_values`, while `values` continue to 
  identify allowed measurement inputs
+- Additional benchmarks


Can we also list the names after Additional benchmarks:

AdrianSosic · 2025-03-11T20:28:46Z

benchmarks/data/utils.py

+
+from pathlib import Path
+
+DATA_PATH = Path(*Path(__file__).parts[:-1])


This is a very weird way to construct the path. There's for example the parent method

Note that it might be necessary to have a somehow weird construction since this needs to be run both locally and in the pipeline - I know that I had strange paths as a result of that previously. But I will try to come up with something more reasonable.

Will add a version that works locally but keep this open until it has been confirmed to also work in the pipeline.

benchmarks/domains/__init__.py

AdrianSosic · 2025-03-11T20:30:42Z

benchmarks/data/ArylHalides/data_raw.csv

why is this file called data_raw.csv but the other data.csv?

Will rename files properly

benchmarks/domains/arylhalides_tl_substance.py

AdrianSosic · 2025-03-11T20:36:33Z

benchmarks/domains/arylhalides_tl_substance.py

+        Data for benchmark.
+    """
+    data_path = DATA_PATH / "ArylHalides"
+    data = pd.read_table(data_path / "data_raw.csv", sep=",").dropna(


aren't there many unused columns? they could be dropped right at data loading

Will have changed them in the upcoming suggestion.

AdrianSosic · 2025-03-11T20:37:47Z

benchmarks/domains/arylhalides_tl_substance.py

+data = get_data()
+
+target_tasks = ["1-iodo-4-methoxybenzene"]
+source_tasks = [
+    # Dissimilar source task
+    "1-chloro-4-(trifluoromethyl)benzene"
+]


Executing these lines directly in the module scope is suboptimal

Will be gone in the upcoming suggestion

AdrianSosic · 2025-03-11T20:41:47Z

benchmarks/domains/arylhalides_tl_substance.py

+]
+
+
+def space_data() -> (


This is a typical example the Now, that my prototype is done, I wrap my spaghetti-style script as a function and just return everything that is needed-style of function 🙃 .... which pretty much defeats the purpose of having it as a function in the first place. Can we structure this a bit more elegantly?

I am currently designing it like that since this is a nice, elegant pattern that can then easily be used in other examples:

Is this more to your liking?

AdrianSosic · 2025-03-11T20:42:49Z

benchmarks/domains/arylhalides_tl_substance.py

+    lookup = data.query("aryl_halide.isin(@target_tasks)").copy(deep=True)
+    initial_data = data.query("aryl_halide.isin(@source_tasks)", engine="python").copy(
+        deep=True
+    )


I think I don't get the purpose of these copies here

I also think that it won't be necessary, will invsetigate

AdrianSosic · 2025-03-11T20:45:12Z

benchmarks/domains/direct_arylation_tl_temperature.py

+    for p in [0.01, 0.05, 0.1]:
+        results.append(
+            simulate_scenarios(
+                {f"{int(100 * p)}": campaign},
+                lookup,
+                initial_data=[
+                    initial_data.sample(frac=p) for _ in range(settings.n_mc_iterations)
+                ],
+                batch_size=settings.batch_size,
+                n_doe_iterations=settings.n_doe_iterations,
+                impute_mode="error",
+            )
+        )
+    # No training data
+    results.append(
+        simulate_scenarios(
+            {"0": campaign},
+            lookup,
+            batch_size=settings.batch_size,
+            n_doe_iterations=settings.n_doe_iterations,
+            n_mc_iterations=settings.n_mc_iterations,
+            impute_mode="error",
+        )
+    )
+    # Non-TL campaign
+    results.append(
+        simulate_scenarios(
+            {"non-TL": Campaign(searchspace=searchspace_nontl, objective=objective)},
+            lookup,
+            batch_size=settings.batch_size,
+            n_doe_iterations=settings.n_doe_iterations,
+            n_mc_iterations=settings.n_mc_iterations,
+            impute_mode="error",
+        )


I guess this boilerplate code can be condensed by a lot using partials and some nicer list constructions

Already on it :)

Rebase onto main

…the benchmark

…ances as tasks

…hout source data

Otherwise get the below error when running benchmark actions: `Missing optional dependency 'openpyxl'.`

Co-authored-by: Martin Fitzner <[email protected]>

Hrovatin requested review from Scienfitz, AdrianSosic and AVHopp as code owners February 19, 2025 15:14

Hrovatin changed the title ~~Add direct arylation benchmark for TL with temperature as a task~~ Add benchmarks Feb 19, 2025

Scienfitz assigned Hrovatin Feb 19, 2025

Scienfitz added the benchmarking label Feb 19, 2025

Hrovatin commented Feb 21, 2025

View reviewed changes

Scienfitz requested changes Feb 24, 2025

View reviewed changes

Hrovatin force-pushed the feature/benchmark_cases branch from 0ea9be1 to 87d9deb Compare March 3, 2025 12:42

Scienfitz mentioned this pull request Mar 4, 2025

Add Michalewicz benchmark #464

Closed

AVHopp reviewed Mar 4, 2025

View reviewed changes

CHANGELOG.md Show resolved Hide resolved

This was referenced Mar 5, 2025

Use Botorch MultiTaskGP for transfer learning #484

Draft

Add BoTorch kernel preset, which uses dimensions-scaled prior #483

Draft

Scienfitz mentioned this pull request Mar 11, 2025

Benchmarks feature requests #493

Closed

3 tasks

AVHopp force-pushed the feature/benchmark_cases branch from 2406c87 to 33b6e3d Compare March 11, 2025 15:42

AVHopp reviewed Mar 11, 2025

View reviewed changes

AdrianSosic reviewed Mar 11, 2025

View reviewed changes

AVHopp force-pushed the feature/benchmark_cases branch from 277ccee to e994f44 Compare March 12, 2025 14:55

Hrovatin added 11 commits March 12, 2025 15:56

Add direct arylation benchmark for TL with temperature as a task

c211228

Update changelog

a4f0d1c

Rebase onto main

remove random seed that was set in the paper as it is redundant with …

667836a

…the benchmark

Benchmark for transfer learning on arylhalides with dissimilar susbst…

3bba1c1

…ances as tasks

Transfer learning benchmark with inverted Hartmann functions as tasks

b8a1a0e

Add non-transfer learning campaign and transfer learning campaign wit…

1f6ebde

…hout source data

Transfer learning benchmark with noisy Michalewicz functions as tasks

d8d23c9

Transfer learning benchmark with noisy Easom functions as tasks.

2775f9b

Move data to benchmark folder

943f3ce

restructure benchmark data access

ef4764b

Make data paths general

b1eafcc

Hrovatin and others added 15 commits March 12, 2025 15:56

Use csv instead of xlsx

874a2ca

Otherwise get the below error when running benchmark actions: `Missing optional dependency 'openpyxl'.`

Reset benchmark parameters for lower runtime

89fc66c

Add benchmarks to list to be run

301382a

Further adapt benchmark paraneters to reduce runtime

d45a7e7

Further reduce initial data size to speed up benchmark

3b32467

Increase number of benchmark repetitions

058b5c1

Remove unnecessary file for ArylHalides

809ff57

Replace os path parsing with pathlib

b7c0793

Update benchmarks/domains/arylhalides_tl_substance.py

8a56895

Co-authored-by: Martin Fitzner <[email protected]>

Correct formatting and typing

97c7679

Improve naming

eaa47d5

Fix path to data

75d6cbf

Rename data for ArylHalides example

7a27cbb

Simplify path retrieval for data

21a0db7

Rework of aryl_halides benchmark

e35c0d0

AVHopp force-pushed the feature/benchmark_cases branch from e994f44 to e35c0d0 Compare March 12, 2025 14:56

Add benchmark to github action

ff0b85b

AVHopp force-pushed the feature/benchmark_cases branch from 5deb353 to 238b082 Compare March 12, 2025 15:48

AVHopp added 2 commits March 12, 2025 17:15

Fix name of benchmark

2898fb6

Delete unnecessary content of init

31fa3c0

AVHopp force-pushed the feature/benchmark_cases branch from e29bc15 to 31fa3c0 Compare March 12, 2025 16:15

Scienfitz assigned AVHopp Mar 12, 2025


		from pathlib import Path

		DATA_PATH = Path(*Path(__file__).parts[:-1])

		]


		def space_data() -> (

Add benchmarks #491

Are you sure you want to change the base?

Add benchmarks #491

Conversation

Hrovatin commented Feb 19, 2025 • edited Loading

Scienfitz commented Feb 20, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Scienfitz Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

Hrovatin Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

Scienfitz Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AdrianSosic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Hrovatin commented Feb 19, 2025 •

edited

Loading

Scienfitz Feb 21, 2025 •

edited

Loading

Hrovatin Feb 21, 2025 •

edited

Loading

Scienfitz Feb 21, 2025 •

edited

Loading