-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
codegen gcroot optimization #14543
codegen gcroot optimization #14543
Conversation
👍 I think this is a great direction for our codegen in general. |
4f2e812
to
55fd11a
Compare
af35da1
to
a5f25fc
Compare
f6eb4d3
to
36a58c7
Compare
… wrong method early in bootstrapping later in bootstrapping, the new image is sufficiently similar, the error is no longer important but early in bootstrapping, it may try to call the wrong method and not have enough method defined to succeed
the emitted roots are essentially placeholders for later simplification by a root optimization pass. the first step here is to help the codegen system distinguish between roots used for function arguments and roots used for locals
…runtime computation matches the result of is_stable_expr
…so that it is easier to elid the store when finalizing the gc frame
…inalize_gc_frame this computes frame variable liveness at a basicblock level, then uses that to build an optimized jlcall frame
detects simple `store -> load -> store other` or `store, store other` patterns and removes the redundant root and detects dead roots, and removes them
this moves it a bit closer to becoming an llvm pass and makes it easier to pass around state to the helper functions
this pass looks for gcroots that are trivially unnecessary and elides them rewriting this functionality this way made the allocate_frame liveness computation much more straight-forward while making both passes more powerful
also drop the jltype argument from boxed; the uses of this argument were invalid anyways
… a few direct consumers
36a58c7
to
00dd5d5
Compare
i removed the WIP label, since, while I know there are more optimizations possible, i believe this to be fairly competitive with master now |
jl_cgval_t arg2 = emit_expr(args[3],ctx); | ||
ifelse_result = builder.CreateSelect(isfalse, | ||
boxed(arg2, ctx), | ||
boxed(arg1, ctx)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this just be on one line?
I believe this has started causing a bootstrap segfault on win64: https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.13330 |
Confirmed by bisect to have been introduced here.
|
I've also got intermittent
So far seems more reproducible on aarch64............... |
Actually, on aarch64, the issue is a stack overflow in the serializer during a super deep type inference. The @tkelman is stack overflow detection working on win64? In any case, it might be good to to check if there was any stackoverflow detected. |
Increase the stack size during compilation works around the issue (I increased it from 8M to 80M just to be safe but I think we probably don't actually need that much) so this is something that worth checking/use as temporary workaround. |
i guess openblas has already been doing this for us on x64, causing me to miss this entirely in testing :(. Moreover, the automated gc-rooting (4a51309) significantly increased the size of the code (proxy for measuring the average stackframe, with all else unchanged). The follow on commits slowly brought that back down via optimizations, but I suspect some common optimizable patterns may be still lacking. Also, fwiw, the win64 stackframe is often much larger than the platform ABI standard one, and type inference already often runs fairly close to the value we've been using so it wouldn't be unreasonable to double it again. |
OpenBlas only does that for the git version and I think the issue isn't that bad on x64 (I patched it out in openblas-git and still don't really see the issue (only once with threading)).
I think trying to optimize it on master is totally fine and I guess the way you propose to flatten the type inference should also help this a lot? |
this rewrites the codegen gcroot allocation algorithm to exist entirely as a llvm pass that can be run as a separate optimization pass (instead of updating
argDepth
along the way). it computes liveness intervals to perform register coloring at the basic-block level (fine-grained tracking for the instruction level has not been implemented yet).at this stage, i believe this should yield results comparable to the current algorithm. however, the expected benefits are:
+ boxed + get_gcrooted)TODO:
needsgcroot
computation in a few placesmark_gc_use