-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
frequent CI failures with Julia nightly and 1.9 on macOS due to timeout #1888
Comments
I just thought I could re-run the last good job on github actions to see if the newer julia version would cause this to timeout as well but this has installed an even older julia version now:
|
I did another retry of that build, this time it used:
and the tests succeeded in 1.5 hours. Better but still quite a bit worse than before. I did notice one particularly bad testcase in the failed log:
In the new one that I linked here and in the last good one linked by Max this took only about 2-3 minutes. |
I think at least to some extent the CI runners are to blame, I have been running a few tests manually here. I added some system stats in the beginning and enabled GC logging.
Edit: This did happen earlier as well, e.g. about 3 weeks ago here with Julia |
I wonder if these GC pauses could be caused by use caching things too eagerly: we create lots and lots of rings and other stuff that might get cached. That would fit with what Benjamin observed, namely heap running full, which causes more and more GCs to happen. Perhaps we can flush at least some of those caches between certain parts of the test suite? Regarding the |
There is something like In my experience, even if all caching is disabled, GC time is strictly increasing for long running jobs. I had something running for three days (all caching disabled), and now in this session every GC invocation takes
(although it is doing effectively nothing). |
Yes that is possible. There is a little bit of randomization in the code. Though I am surprised that it may take 20minutes. |
Is there a way to reproduce this on Linux? I wonder where this behavior comes from. So far this never happened for me. There is indeed some randomization going on but I am surprised that it has such a big effect. |
I haven't been able to reproduce this either after letting it run in a loop for a while, the timings I got after about 1000 iterations are all between 117 seconds and 139 seconds. (Note that I did restart julia for each iteration) I believe the reason for that long duration was extreme memory starvation of that macOS runner, I found one further log with such a timing in my experiments on github actions: https://github.com/oscar-system/Oscar.jl/actions/runs/4091477929/jobs/7064161311#step:9:3447 The reason why that testcase is particularly affected by this is that it does require somewhat more memory than other testgroups: For the most recent run on my test-branch I added I think there is not really a specific commit for julia nightly that does trigger all this but:
I have started a new CI job to compare nightly with 1.8 and 1.9 now. |
I see. Thank you @benlorenz for your explanations. Indeed the algorithms/implementation is allocation heavy. |
After torturing the github actions runners some more it looks like julia 1.9 (on macos) is similarly affected by this, for example here https://github.com/oscar-system/Oscar.jl/actions/runs/4125228965/jobs/7131303651#step:8:6741. There are a few more attempts to look at, sometimes nightly fails, sometimes 1.9, sometimes they do succeed but even then there does appear to be a lot of memory pressure for still unknown reasons. |
This should be fixed for nightly now (or at least be significantly better), after the merge of JuliaLang/julia#48614. I re-ran a bunch of github jobs and all of them succeeded for nightly. Also the last two merges to master ran fine on macos nightly. It is still an issue for 1.9 but the backport label is already there so this should be fixed as well soon. |
Thanks for keeping track of this |
Looking at e.g. < https://github.com/oscar-system/Oscar.jl/actions/runs/4026539430> it exceeds our self-imposed 2 hour timeout limit.
In contrast, Julia nightly on Linux runs the test suite in ~1 hour; while Julia 1.6 on macOS require roughly 1 hour 25 minutes (which is about the same as Julia 1.6 on Linux in that CI run).
So the failure started with our commit c6ece85 but due to the nature of the failure, I don't think that commit is at fault, it seems more likely that a change on the Julia side is responsible?
The last "good" run used
The first "bad" run used
Changes on the Julia side: JuliaLang/julia@7630606...9b1ffbb
Of course there could also be changes in one of our dependencies (one could try to extract this from the CI logs above). It might also be interesting to find out which tests got slower... A quick check shows that e.g. the
AbelianClousre
(sic!) went from 52s to 81.3s andMPolyAnyMap/MPolyRing
went from 4m04.7s to 6m47.2s. But why, and why only on macOS + Julia nightly?The text was updated successfully, but these errors were encountered: