Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardness benchmark #440

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
266 changes: 266 additions & 0 deletions benchmarks/domains/Hardness.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file should be called hardness.py

Original file line number Diff line number Diff line change
@@ -0,0 +1,266 @@
# Hardness benchmarking, a maximization task on experimental hardness dataset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a longer explanation in the __main__ section of the script. That should go here, and you can then print it from main via the __doc__ attribute of the file. Overall, it should become very clear from the text what is the goal of this benchmark. Details that are not relevant to understand the overall task, e.g. how exactly the data is loaded (for example, that you consider only contexts with more then 5 points etc) should not be mentioned here but in their respective code section. For example, for the data loading, you'd need to add some data loading function whose docstring/comments explain it.


from __future__ import annotations

import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy as sp
from pandas import DataFrame

from baybe.campaign import Campaign
from baybe.parameters import NumericalDiscreteParameter, TaskParameter
from baybe.recommenders.pure.nonpredictive.sampling import RandomRecommender
from baybe.searchspace import SearchSpace
from baybe.simulation import simulate_scenarios
from baybe.targets import NumericalTarget
from benchmarks.definition.convergence import (
ConvergenceBenchmark,
ConvergenceBenchmarkSettings,
)

# Set up directory and load datasets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment: you execute all these commands in the main scope of the module, which is suboptimal. Please split the logic up into meaningful pieces and extract them into reasonable functions, e.g. one for data loading, one for data pre-processing (spline interpolation) etc

home_dir = os.getcwd()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data should not live here but in a separate folder, probably under /benchmarks/data/hardness/

# Materials Project (MP) bulk modulus dataset
df_mp = pd.read_csv(
os.path.join(home_dir, "benchmarks", "domains", "mp_bulkModulus_goodOverlap.csv"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid os.path. Please use pathlib.Path instead for path manipulations

index_col=0,
)
# Experimental (Exp) hardness dataset
df_exp = pd.read_csv(
os.path.join(home_dir, "benchmarks", "domains", "exp_hardness_goodOverlap.csv"),
index_col=0,
)
element_cols = df_exp.columns.to_list()[4:]

# Initialize an empty dataframe to store the integrated hardness values
df_exp_integrated_hardness = pd.DataFrame()

# For each unique composition in df_exp, make a cubic spline interpolation of the hardness vs load curve
for composition_i in df_exp["composition"].unique():
composition_subset = df_exp[df_exp["composition"] == composition_i]
# Sort the data by load
composition_subset = composition_subset.sort_values(by="load")
composition_subset = composition_subset.drop_duplicates(subset="load")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if there are multiple identical load values where the other column values differ?

if len(composition_subset) < 5: # Continue to the next composition
continue

# Perform cubic spline interpolation of the hardness vs load curve
spline = sp.interpolate.CubicSpline(
composition_subset["load"], composition_subset["hardness"]
)
# Integrate the spline from the minimum load to the maximum load
integrated_value = spline.integrate(0.5, 5, extrapolate=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid hard-code magic numbers like 0.5 or 5. Either these are user-define constants and should be declared as such at the top of the file, or they are derived from some other quantities / data


# Make a new dataframe with the element_cols from composition_subset
composition_summary = composition_subset[
["strComposition", "composition"] + element_cols
]
composition_summary = composition_summary.drop_duplicates(subset="composition")
composition_summary["integratedHardness"] = integrated_value

df_exp_integrated_hardness = pd.concat(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loop-based dataframe concatenation is sort of an antipattern since inefficient. Collect the individual parts in a list and concatenate after the loop

[df_exp_integrated_hardness, composition_summary]
)

# ----- Target function (integrated hardness) -----
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have these "headings" all over the place, which is a result of having all logic dumped sequentially in to the main scope of the module. As mentioned in another comment, when you tear apart the logic into individual functions, the need for these headings should naturally disappear

df_searchspace_target = df_exp_integrated_hardness[element_cols]
df_searchspace_target["Function"] = "targetFunction"

# Make a lookup table for the task function (integrate hardness) - add the 'integratedHardness' column
df_lookup_target = pd.concat(
[df_searchspace_target, df_exp_integrated_hardness["integratedHardness"]], axis=1
)
df_lookup_target = df_lookup_target.rename(columns={"integratedHardness": "Target"})

# ----- Source function (voigt bulk modulus) -----
df_searchspace_source = df_mp[element_cols]
df_searchspace_source["Function"] = "sourceFunction"

# Make a lookup table for the source function (voigt bulk modulus) - add the 'vrh' column
df_lookup_source = pd.concat([df_searchspace_source, df_mp["vrh"]], axis=1)
df_lookup_source = df_lookup_source.rename(columns={"vrh": "Target"})

# Combine the search space
df_searchspace = pd.concat([df_searchspace_target, df_searchspace_source])


def hardness(settings: ConvergenceBenchmarkSettings) -> DataFrame:
"""Integrated hardness benchmark, compares across random, default, and no task parameter set up

Inputs:
B discrete {0.8, 0.66666667, 0.92307692 ...} |B| = 13
Sc discrete {0., 0.00384615, 0.01923077 ...} |Sc| = 26
Cr discrete {0.01, 0.06, 0.1 ...} |Cr| = 20
Y discrete {0., 0.07307692, 0.05769231 ...} |Y| = 31
Zr discrete {0., 0.07307692, 0.05769231 ...} |Zr| = 19
Gd discrete {0., 0.03968254, 0.01587302 ...} |Gd| = 12
Hf discrete {0., 0.008, 0.02 ...} |Hf| = 13
Ta discrete {0., 0.006, 0.008 ...} |Ta| = 17
W discrete {0.19, 0.14, 0.1 ...} |W| = 30
Re discrete {0., 0.2, 0.33333 ...} |Re| = 15
Output: discrete
Objective: maximization
"""
parameters = []
parameters_no_task = []

# For each column in df_searchspace except the last one, create a NumericalDiscreteParameter
for column in df_searchspace.columns[:-1]:
discrete_parameter_i = NumericalDiscreteParameter(
name=column,
values=np.unique(df_searchspace[column]),
tolerance=0.0,
)
parameters.append(discrete_parameter_i)
parameters_no_task.append(discrete_parameter_i)

task_parameter = TaskParameter(
name="Function",
values=["targetFunction", "sourceFunction"],
active_values=["targetFunction"],
)
parameters.append(task_parameter)

searchspace = SearchSpace.from_dataframe(df_searchspace, parameters=parameters)
searchspace_no_task = SearchSpace.from_dataframe(
df_searchspace_target[element_cols], parameters=parameters_no_task
)

objective = NumericalTarget(name="Target", mode="MAX").to_objective()

scenarios: dict[str, Campaign] = {
"Random Recommender": Campaign(
searchspace=SearchSpace.from_dataframe(
df_searchspace_target[element_cols], parameters=parameters_no_task
),
recommender=RandomRecommender(),
objective=objective,
),
"Default Recommender": Campaign(
searchspace=searchspace,
objective=objective,
),
"No Task Parameter": Campaign(
searchspace=searchspace_no_task,
objective=objective,
),
}

return simulate_scenarios(
scenarios,
df_lookup_target,
batch_size=settings.batch_size,
n_doe_iterations=settings.n_doe_iterations,
n_mc_iterations=settings.n_mc_iterations,
impute_mode="error",
)


def hardness_transfer_learning(settings: ConvergenceBenchmarkSettings) -> DataFrame:
"""Integrated hardness benchmark, transfer learning with different initialized data sizes

Inputs:
B discrete {0.8, 0.66666667, 0.92307692 ...} |B| = 13
Sc discrete {0., 0.00384615, 0.01923077 ...} |Sc| = 26
Cr discrete {0.01, 0.06, 0.1 ...} |Cr| = 20
Y discrete {0., 0.07307692, 0.05769231 ...} |Y| = 31
Zr discrete {0., 0.07307692, 0.05769231 ...} |Zr| = 19
Gd discrete {0., 0.03968254, 0.01587302 ...} |Gd| = 12
Hf discrete {0., 0.008, 0.02 ...} |Hf| = 13
Ta discrete {0., 0.006, 0.008 ...} |Ta| = 17
W discrete {0.19, 0.14, 0.1 ...} |W| = 30
Re discrete {0., 0.2, 0.33333 ...} |Re| = 15
Output: discrete
Objective: maximization
"""
parameters = []

# For each column in df_searchspace except the last one, create a NumericalDiscreteParameter
for column in df_searchspace.columns[:-1]:
discrete_parameter_i = NumericalDiscreteParameter(
name=column,
values=np.unique(df_searchspace[column]),
tolerance=0.0,
)
parameters.append(discrete_parameter_i)

task_parameter = TaskParameter(
name="Function",
values=["targetFunction", "sourceFunction"],
active_values=["targetFunction"],
)
parameters.append(task_parameter)

objective = NumericalTarget(name="Target", mode="MAX").to_objective()

searchspace = SearchSpace.from_dataframe(df_searchspace, parameters=parameters)

# Seperate campaign for different initial data size, preventing unintended data overwriting
scenarios: dict[str, Campaign] = {
f"{n} Initial Data": Campaign(searchspace=searchspace, objective=objective)
for n in (2, 4, 6, 30)
}

# Create an iterable of datasets with different initial sizes
initial_data_sets = [df_lookup_source.sample(n) for n in (2, 4, 6, 30)]

return simulate_scenarios(
scenarios,
df_lookup_target,
initial_data=initial_data_sets,
batch_size=settings.batch_size,
n_doe_iterations=settings.n_doe_iterations,
impute_mode="error",
)


benchmark_config = ConvergenceBenchmarkSettings(
batch_size=1,
n_doe_iterations=20,
n_mc_iterations=5,
)

hardness_benchmark = ConvergenceBenchmark(
function=hardness,
settings=benchmark_config,
optimal_target_values=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we don't know the optimal value? This should be clear from the dataset, no?

)

hardness_transfer_learning_benchmark = ConvergenceBenchmark(
function=hardness_transfer_learning,
settings=benchmark_config,
optimal_target_values=None,
)

if __name__ == "__main__":
# Describe the benchmark task
print(
"Hardness benchmark is a maximization task on experimental hardness dataset. "
)
print(
"The dataset is downselect to 94 composition with more than 5 hardness values. "
)
print(
"The hardness values are integrated using cubic spline interpolation, and the task is to maximize the integrated hardness. \n"
)
print(
"Hardness benchmark compares across random, default, and no task parameter set up. \n"
)
print(
"Hardness transfer learning benchmark compares across different initialized data sizes. "
)

# Visualize the Hardness value histogram
fig, ax = plt.subplots(
1, 1, figsize=(8, 5), facecolor="w", edgecolor="k", constrained_layout=True
)

# Plot a histogram of the hardness values
ax.hist(df_exp["hardness"], bins=20)
ax.set_xlabel("Hardness")
ax.set_ylabel("Frequency")
ax.set_title("Integrated Hardness Distribution")
ax.grid()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plt.show() is missing

26 changes: 16 additions & 10 deletions benchmarks/domains/__init__.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog entry is missing

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general comment: the basic code requirements are not yet met, because it seems you haven't installed the pre-commit hooks while developing. Please:

  • Run the hooks (you can also trigger them manually via pre-commit run --all-files) and fix the problems
  • Run mypy and fix the typing issues.

You can also find more information here: https://emdgroup.github.io/baybe/stable/misc/contributing_link.html

Original file line number Diff line number Diff line change
@@ -1,10 +1,16 @@
"""Benchmark domains."""

from benchmarks.definition.base import Benchmark
from benchmarks.domains.synthetic_2C1D_1C import synthetic_2C1D_1C_benchmark

BENCHMARKS: list[Benchmark] = [
synthetic_2C1D_1C_benchmark,
]

__all__ = ["BENCHMARKS"]
"""Benchmark domains."""

from benchmarks.definition.base import Benchmark
from benchmarks.domains.Hardness import (
hardness_benchmark,
hardness_transfer_learning_benchmark,
)
from benchmarks.domains.synthetic_2C1D_1C import synthetic_2C1D_1C_benchmark

BENCHMARKS: list[Benchmark] = [
synthetic_2C1D_1C_benchmark,
hardness_benchmark,
hardness_transfer_learning_benchmark,
]

__all__ = ["BENCHMARKS"]
Loading