Benchmarks feature requests #493

Hrovatin · 2025-02-24T06:18:34Z

Created issue to keep track of things I would like to see for benchmarks. May add more topics in the future as needed.

The planned 12h limit will not suffice for benchmarks that would ben as comprehensive as in my local tests wert N MC iterations and N domains. Instead running each benchmark case in parallel (with 6/12h limit) may be nice.
Would it be possible to log during benchmark which benchmark class is currently running and how long it ran - as else when the benchmark terminates due to time limit it is hard to figure out why it did.
Since same datasets may be used across benchmarks (e.g. features on different branches) I would really wish to see easier re-use for lookups&search. I made quick&dirty implementation in my own code, but having this more general could be beneficial:
E.g. general benchmark defining data domain for TL and this is then imported into a benchmark on new feature branch.

@AdrianSosic @Scienfitz @AVHopp @fabianliebig

Scienfitz · 2025-02-24T12:31:37Z

@AVHopp @fabianliebig can you chime in how we can increase the runtime or possibly achieve the parallelization requested above?

@Hrovatin
can you elaborate what you mean by the last point? Why is there a folder kernel_presets in the domains folder?

Hrovatin · 2025-02-24T12:51:30Z

kernel_presets is folder using botorch kernel presets for testing the botorch preset feature. As I understood we decided that I start running new feature benchmarks on branches instead of locally as I did before

Scienfitz · 2025-02-24T13:02:45Z

I dont think thats necessary. From what I understood: if we have all benchmarks implemented we will have two results from:

main: runs all benchmarks with current settings
another_branch: this branch jsut changes the default kernels in the code, it does not alter the benchmark code at all

Those two will be compared in the dashboard. No code adjustment for the benchmarks needed

Hrovatin · 2025-02-24T18:36:55Z

Some features add new arguments to code, so the benchmark must be changed. E.g. using botorch kernel factory was not added as default, but similar as one would use EDBO

Scienfitz · 2025-02-25T05:31:53Z

well this would result in a complicated way of being able to compare results, why would you prefer that instead of just changing the default + triggering the benchmark action on the feature branch? Then we check the result, and depending on that keep or do not keep the default. Also, even if the benchmark code changes, there is no reason to make copies and maintain the unchanged benchmarks in the same branch as they always have their comparison in the reference branch. I think it makes the third point somewhat obsolete.

Hrovatin · 2025-02-25T10:54:27Z

The issue for example arrises where we add many small changes, which would mean that for each we need to create a new branch and set it as default (e.g. StratifiedScaler that can be optionally used for botroch MultiTaskGP). Then branch management gets really hard, as we would in the above example need to create 2 branches with new MultiTaskGP feature, one with and one without StandardScaler. And then I would need to constantly make sure they are synchronised

Scienfitz · 2025-02-25T13:43:19Z

it is not intended to check for every small change. Once per PR / feature proposal is fine, eg once when the potential prior change is fully implemented

Im not entirely sure, but I think you can also compare them based on commits, so even if you wanted two snapshots from the same branch that should be no problem

fabianliebig · 2025-02-28T14:24:30Z

Hi @Hrovatin, many thanks for those ideas. Sorry for my late reply. I have to confess (even though we talked already) that I'm not sure if I understand the full load-bearing range of your requirements. My thought on your points are as follows:

Increasing runtime at least up to 24 H per job is possible and only require one additional line in jobs description. However, I can not say if more than 24 H is feasible since the GITHUB_TOKEN expires after that time period and I couldn't find clear documentation if that may impact our use case yet. If not, a runtime up to 35 day is theoretically possible.
Parallelization is certainly possible, from what I saw regarding the CPU utilization, we should be safe to run two benchmarks in one container. Beside that, we can also start as many container as we want since the workflow itself is completely independent, as long as the results are repeatable either by date, commit hash, name or branch. Otherwise, they override each other. I will have to look into details, but plan to come up with more concrete ideas in the upcoming week.
We can log the name of the benchmark right before it starts if that helps. The simulation will provide a progress bar, showing the number of performed iterations and the runtime afterwards. However, the benchmarks are executed in the order of the list, if you know how many are finished in time (by observing the progress bar of the simulation package for example) you can directly link that to the lists order.
You can also separate the results by each commit, might be hard to remember the hash tbh but it would be an alternative to branch management. Would it help to have some kind of a command line which can be used to separate things more clearly? YFYI: You can also change the function description (Docstring) as this will be stored and displayed in the dashboard if you need to describe a small code change for your observation.

Sorry for the long command. Please let me know if I miss something based on you requirements. We may also talk about your workflow at some point, as I have the impression that more local functionalities for the benchmarking module would also help :)

fabianliebig · 2025-03-01T17:56:49Z

I was curious and wanted to test what happens if a jobs exceed 24H. Well, the container just kept running. So I would guess it will work as long as the GITHUB_TOKEN is not used.

Hi everyone, as one of the requested changes from #493, this PR sets the GitHub-side runtime limit for the container to 24 hours. This refers to the GITHUB_TOKEN generated per job, which expires after that time period. I've also tested longer runtimes and found that we should also be safe to go up to 35 days if necessary. Further changes regarding parallelization may follow once discussed.

fabianliebig · 2025-03-03T15:38:16Z

I've added a PR for basic logging of benchmark information to the INFO channel, including the runtime. Just for the records: Please take the runtime with a grain of salt, as it may highly vary.

…499) This PR adds two lines for logging which benchmark is started and how long it took with regards to #493. I think the name and random seed is sufficient, but feel free to suggest the more or less information. An exemplary log looks like this: ![image](https://github.com/user-attachments/assets/3e627acb-e585-44f7-8c84-930ce5644edd)@Hrovatin Kindly ask you to comment if that fits your needs. Thanks :)

Scienfitz · 2025-03-11T13:20:16Z

I think we can close this issue (at the very latest after #491 merged)

@AVHopp will finish Add benchmarks #491 , optional was to add maybe a handful of non-TL benchmarks. There we dont have to reinvent the wheel, just take some of the examples we already have or some of the old benchmark repo. This should have prio as PRs like Acquisition function builder #490 are already excellent use cases for the while benchmarking idea
Thanks to @fabianliebig 's changes, I believe we can tick off point 1
I think we can also tick point 3, if correctly used the benchmark app allows for direct comparison of curves. There is no immediate need at the moment to have these reusable structures since re-coding existing benchmarks is not necessary when the benchmarking action and app are correctly used

Hi everyone, this PR adds benchmark parallelization as requested in #493 by using a matrix in the workflow. The container still needs to be deployed separately, so two matrixes are added. To start only a subset, the `__main__.py` was altered to take a CMD line argument so that existing domains don't have to be changed. The selection of different benchmarks was a bit tricky, so the separation of groups may be removed if it overcomplicates stuff. ![image](https://github.com/user-attachments/assets/1f83bb81-7c52-47f2-acaa-775ae99f5964) I also noticed that the average CPU utilization was at about 48% for the 15 H benchmarks which ran recently: ![image](https://github.com/user-attachments/assets/e2e67e46-120d-463c-994a-c68dec10bd08) So I changed CPUs requested compute from 16 vCPUs to 8.

fabianliebig mentioned this issue Mar 3, 2025

Increase timeout for benchmark test to 1440 minutes (24 hours) #498

Merged

fabianliebig mentioned this issue Mar 3, 2025

Add logging to indicate currently running benchmark and its runtime #499

Merged

fabianliebig self-assigned this Mar 5, 2025

fabianliebig mentioned this issue Mar 5, 2025

Pipeline benchmark parallelization #502

Merged

Hrovatin closed this as completed Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks feature requests #493

Benchmarks feature requests #493

Hrovatin commented Feb 24, 2025 •

edited by Scienfitz

Loading

Scienfitz commented Feb 24, 2025 •

edited

Loading

Hrovatin commented Feb 24, 2025

Scienfitz commented Feb 24, 2025

Hrovatin commented Feb 24, 2025

Scienfitz commented Feb 25, 2025

Hrovatin commented Feb 25, 2025

Scienfitz commented Feb 25, 2025

fabianliebig commented Feb 28, 2025

fabianliebig commented Mar 1, 2025

fabianliebig commented Mar 3, 2025

Scienfitz commented Mar 11, 2025

Benchmarks feature requests #493

Benchmarks feature requests #493

Comments

Hrovatin commented Feb 24, 2025 • edited by Scienfitz Loading

Scienfitz commented Feb 24, 2025 • edited Loading

Hrovatin commented Feb 24, 2025

Scienfitz commented Feb 24, 2025

Hrovatin commented Feb 24, 2025

Scienfitz commented Feb 25, 2025

Hrovatin commented Feb 25, 2025

Scienfitz commented Feb 25, 2025

fabianliebig commented Feb 28, 2025

fabianliebig commented Mar 1, 2025

fabianliebig commented Mar 3, 2025

Scienfitz commented Mar 11, 2025

Hrovatin commented Feb 24, 2025 •

edited by Scienfitz

Loading

Scienfitz commented Feb 24, 2025 •

edited

Loading