Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarks feature requests #493

Closed
3 tasks done
Hrovatin opened this issue Feb 24, 2025 · 11 comments
Closed
3 tasks done

Benchmarks feature requests #493

Hrovatin opened this issue Feb 24, 2025 · 11 comments
Assignees

Comments

@Hrovatin
Copy link
Collaborator

Hrovatin commented Feb 24, 2025

Created issue to keep track of things I would like to see for benchmarks. May add more topics in the future as needed.

  • The planned 12h limit will not suffice for benchmarks that would ben as comprehensive as in my local tests wert N MC iterations and N domains. Instead running each benchmark case in parallel (with 6/12h limit) may be nice.
  • Would it be possible to log during benchmark which benchmark class is currently running and how long it ran - as else when the benchmark terminates due to time limit it is hard to figure out why it did.
  • Since same datasets may be used across benchmarks (e.g. features on different branches) I would really wish to see easier re-use for lookups&search. I made quick&dirty implementation in my own code, but having this more general could be beneficial:
    E.g. general benchmark defining data domain for TL and this is then imported into a benchmark on new feature branch.

@AdrianSosic @Scienfitz @AVHopp @fabianliebig

@Scienfitz
Copy link
Collaborator

Scienfitz commented Feb 24, 2025

@AVHopp @fabianliebig can you chime in how we can increase the runtime or possibly achieve the parallelization requested above?

@Hrovatin
can you elaborate what you mean by the last point? Why is there a folder kernel_presets in the domains folder?

@Hrovatin
Copy link
Collaborator Author

kernel_presets is folder using botorch kernel presets for testing the botorch preset feature. As I understood we decided that I start running new feature benchmarks on branches instead of locally as I did before

@Scienfitz
Copy link
Collaborator

I dont think thats necessary. From what I understood: if we have all benchmarks implemented we will have two results from:

  • main: runs all benchmarks with current settings
  • another_branch: this branch jsut changes the default kernels in the code, it does not alter the benchmark code at all

Those two will be compared in the dashboard. No code adjustment for the benchmarks needed

@Hrovatin
Copy link
Collaborator Author

Some features add new arguments to code, so the benchmark must be changed. E.g. using botorch kernel factory was not added as default, but similar as one would use EDBO

@Scienfitz
Copy link
Collaborator

well this would result in a complicated way of being able to compare results, why would you prefer that instead of just changing the default + triggering the benchmark action on the feature branch? Then we check the result, and depending on that keep or do not keep the default. Also, even if the benchmark code changes, there is no reason to make copies and maintain the unchanged benchmarks in the same branch as they always have their comparison in the reference branch. I think it makes the third point somewhat obsolete.

@Hrovatin
Copy link
Collaborator Author

The issue for example arrises where we add many small changes, which would mean that for each we need to create a new branch and set it as default (e.g. StratifiedScaler that can be optionally used for botroch MultiTaskGP). Then branch management gets really hard, as we would in the above example need to create 2 branches with new MultiTaskGP feature, one with and one without StandardScaler. And then I would need to constantly make sure they are synchronised

@Scienfitz
Copy link
Collaborator

it is not intended to check for every small change. Once per PR / feature proposal is fine, eg once when the potential prior change is fully implemented

Im not entirely sure, but I think you can also compare them based on commits, so even if you wanted two snapshots from the same branch that should be no problem

@fabianliebig
Copy link
Collaborator

Hi @Hrovatin, many thanks for those ideas. Sorry for my late reply. I have to confess (even though we talked already) that I'm not sure if I understand the full load-bearing range of your requirements. My thought on your points are as follows:

  • Increasing runtime at least up to 24 H per job is possible and only require one additional line in jobs description. However, I can not say if more than 24 H is feasible since the GITHUB_TOKEN expires after that time period and I couldn't find clear documentation if that may impact our use case yet. If not, a runtime up to 35 day is theoretically possible.
  • Parallelization is certainly possible, from what I saw regarding the CPU utilization, we should be safe to run two benchmarks in one container. Beside that, we can also start as many container as we want since the workflow itself is completely independent, as long as the results are repeatable either by date, commit hash, name or branch. Otherwise, they override each other. I will have to look into details, but plan to come up with more concrete ideas in the upcoming week.
  • We can log the name of the benchmark right before it starts if that helps. The simulation will provide a progress bar, showing the number of performed iterations and the runtime afterwards. However, the benchmarks are executed in the order of the list, if you know how many are finished in time (by observing the progress bar of the simulation package for example) you can directly link that to the lists order.
  • You can also separate the results by each commit, might be hard to remember the hash tbh but it would be an alternative to branch management. Would it help to have some kind of a command line which can be used to separate things more clearly? YFYI: You can also change the function description (Docstring) as this will be stored and displayed in the dashboard if you need to describe a small code change for your observation.

Sorry for the long command. Please let me know if I miss something based on you requirements. We may also talk about your workflow at some point, as I have the impression that more local functionalities for the benchmarking module would also help :)

@fabianliebig
Copy link
Collaborator

I was curious and wanted to test what happens if a jobs exceed 24H. Well, the container just kept running. So I would guess it will work as long as the GITHUB_TOKEN is not used.

fabianliebig added a commit that referenced this issue Mar 3, 2025
Hi everyone, as one of the requested changes from #493, this PR sets the
GitHub-side runtime limit for the container to 24 hours. This refers to
the GITHUB_TOKEN generated per job, which expires after that time
period. I've also tested longer runtimes and found that we should also
be safe to go up to 35 days if necessary. Further changes regarding
parallelization may follow once discussed.
@fabianliebig
Copy link
Collaborator

I've added a PR for basic logging of benchmark information to the INFO channel, including the runtime. Just for the records: Please take the runtime with a grain of salt, as it may highly vary.

fabianliebig added a commit that referenced this issue Mar 5, 2025
…499)

This PR adds two lines for logging which benchmark is started and how
long it took with regards to #493.
I think the name and random seed is sufficient, but feel free to suggest
the more or less information.
An exemplary log looks like this:

![image](https://github.com/user-attachments/assets/3e627acb-e585-44f7-8c84-930ce5644edd)@Hrovatin
Kindly ask you to comment if that fits your needs. Thanks :)
@fabianliebig fabianliebig self-assigned this Mar 5, 2025
@Scienfitz
Copy link
Collaborator

I think we can close this issue (at the very latest after #491 merged)

  • @AVHopp will finish Add benchmarks #491 , optional was to add maybe a handful of non-TL benchmarks. There we dont have to reinvent the wheel, just take some of the examples we already have or some of the old benchmark repo. This should have prio as PRs like Acquisition function builder #490 are already excellent use cases for the while benchmarking idea
  • Thanks to @fabianliebig 's changes, I believe we can tick off point 1
  • I think we can also tick point 3, if correctly used the benchmark app allows for direct comparison of curves. There is no immediate need at the moment to have these reusable structures since re-coding existing benchmarks is not necessary when the benchmarking action and app are correctly used

AdrianSosic added a commit that referenced this issue Mar 12, 2025
Hi everyone, this PR adds benchmark parallelization as requested in #493
by using a matrix in the workflow. The container still needs to be
deployed separately, so two matrixes are added. To start only a subset,
the `__main__.py` was altered to take a CMD line argument so that
existing domains don't have to be changed. The selection of different
benchmarks was a bit tricky, so the separation of groups may be removed
if it overcomplicates stuff.

![image](https://github.com/user-attachments/assets/1f83bb81-7c52-47f2-acaa-775ae99f5964)
I also noticed that the average CPU utilization was at about 48% for the
15 H benchmarks which ran recently:

![image](https://github.com/user-attachments/assets/e2e67e46-120d-463c-994a-c68dec10bd08)
So I changed CPUs requested compute from 16 vCPUs to 8.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants