codegen gcroot optimization #14543

vtjnash · 2016-01-03T05:33:23Z

this rewrites the codegen gcroot allocation algorithm to exist entirely as a llvm pass that can be run as a separate optimization pass (instead of updating argDepth along the way). it computes liveness intervals to perform register coloring at the basic-block level (fine-grained tracking for the instruction level has not been implemented yet).

at this stage, i believe this should yield results comparable to the current algorithm. however, the expected benefits are:

the algorithm can handle an SSA-form Julia AST without degradation
flexibility for writing new / more advanced algorithms is a possibility
may make it easier to try new gc strategies (e.g. stack maps) and insert the correct gc-state transitions
simpler codegen (gets us closer to merging emit_expr + emit_unboxed ~~+ boxed + get_gcrooted~~)

TODO:

fix LLVM 3.7 compatibility
cleanup the needsgcroot computation in a few places
finish implementation and usage of mark_gc_use

Keno · 2016-01-03T19:50:21Z

👍 I think this is a great direction for our codegen in general.

… wrong method early in bootstrapping later in bootstrapping, the new image is sufficiently similar, the error is no longer important but early in bootstrapping, it may try to call the wrong method and not have enough method defined to succeed

the emitted roots are essentially placeholders for later simplification by a root optimization pass. the first step here is to help the codegen system distinguish between roots used for function arguments and roots used for locals

…runtime computation matches the result of is_stable_expr

…y version

…so that it is easier to elid the store when finalizing the gc frame

…inalize_gc_frame this computes frame variable liveness at a basicblock level, then uses that to build an optimized jlcall frame

…function

detects simple `store -> load -> store other` or `store, store other` patterns and removes the redundant root and detects dead roots, and removes them

this moves it a bit closer to becoming an llvm pass and makes it easier to pass around state to the helper functions

this pass looks for gcroots that are trivially unnecessary and elides them rewriting this functionality this way made the allocate_frame liveness computation much more straight-forward while making both passes more powerful

also drop the jltype argument from boxed; the uses of this argument were invalid anyways

… a few direct consumers

vtjnash · 2016-02-12T05:20:36Z

i removed the WIP label, since, while I know there are more optimizations possible, i believe this to be fairly competitive with master now

tkelman · 2016-02-12T05:32:09Z

src/intrinsics.cpp

+            jl_cgval_t arg2 = emit_expr(args[3],ctx);
+            ifelse_result = builder.CreateSelect(isfalse,
+                    boxed(arg2, ctx),
+                    boxed(arg1, ctx));


should this just be on one line?

codegen gcroot optimization

tkelman · 2016-02-12T21:11:13Z

I believe this has started causing a bootstrap segfault on win64: https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.13330

tkelman · 2016-02-13T12:24:10Z

Confirmed by bisect to have been introduced here. skip was a build failure, good means for i in seq 20; do make clean && make -j8 || break; done succeeded all 20 times, bad means at least one in 20-in-a-row rebuilds and bootstraps caused a segfault or an assertion failure.

$ git bisect bad
There are only 'skip'ped commits left to test.
The first bad commit could be any of:
0b2758f4754f910992c7a6af0b5d865f6ae29720
623ed94fe4c8530744c10e443cdaa4c752768f89
cdd6a41fa5bdec029d2c7555e51124eddc0daa3d
4a513090f627673ecbb655eda7cfb2f6fb2699b4
a0c4b2554b76d191ab0779fe2ef3085e2b9f90cd
15bd22f8c43f157d9d8229267f5c0f1d4b241dcd
2c202be7a313046211ea139a67c7dfa02f7c2250
269008015364c3d31d9c5d7d533951e9314e74d4
b2869f9794804aafe2298f8e9bd3c099f6a42293
0c5218c30a5b7b54f2daeaabffa284ee9f9c65fb
8b1d5a73a03bc3dfe1dfa34d02aaf0016031fc76
97ed64a7722e6b1de9e7b758f56b6380afe5022e
fcd595ab9d441299d5de901fb2f2c44378fa9ece
33ae5da49ba9c34f2c55b57d1b3de3485af8e194
35954a41277e284da540dc340b52b09dd8d8532b
44459f000d2e2a360be95dd4ae9ad11f5b3a2f26
We cannot bisect more!

Tony@LAPTOP-O230JCFF ~/julia
$ git bisect log
git bisect start
# bad: [e304c1b4454d3a00d1c7f2b8fcc9cbb3d2051b0b] Merge pull request #15037 from tkelman/tk/xcompile
git bisect bad e304c1b4454d3a00d1c7f2b8fcc9cbb3d2051b0b
# good: [f080ae35ce472d38b9accbacbad206b3cd330755] Merge pull request #15011 from JuliaLang/yyc/async-unwind
git bisect good f080ae35ce472d38b9accbacbad206b3cd330755
# bad: [00dd5d599ca7c379378b234f05c1b9be35e3d507] add remaining mark_gc_use needed for ccall, llvmcall, emit_unbox, and a few direct consumers
git bisect bad 00dd5d599ca7c379378b234f05c1b9be35e3d507
# skip: [269008015364c3d31d9c5d7d533951e9314e74d4] convert gcroot / type inspection of variable assignments to on-the-fly version
git bisect skip 269008015364c3d31d9c5d7d533951e9314e74d4
# skip: [33ae5da49ba9c34f2c55b57d1b3de3485af8e194] generate really poor, but really simple gc frames
git bisect skip 33ae5da49ba9c34f2c55b57d1b3de3485af8e194
# bad: [d2a8147e2976bbb98c841ef5f1049e975597c79e] extract collapseRedundantRoots into a separate function
git bisect bad d2a8147e2976bbb98c841ef5f1049e975597c79e
# skip: [0c5218c30a5b7b54f2daeaabffa284ee9f9c65fb] increase usage of jl_cgval_t info, rather than expression inspection
git bisect skip 0c5218c30a5b7b54f2daeaabffa284ee9f9c65fb
# good: [e824393a3e0994861f2ce09f257df146811c73d9] Type -> convert overwrite warning may result in Inference calling the wrong method early in bootstrapping
git bisect good e824393a3e0994861f2ce09f257df146811c73d9
# skip: [0b2758f4754f910992c7a6af0b5d865f6ae29720] add a simple stats counter for gcframe size (disabled)
git bisect skip 0b2758f4754f910992c7a6af0b5d865f6ae29720
# skip: [b2869f9794804aafe2298f8e9bd3c099f6a42293] optimize several common gcroot patterns
git bisect skip b2869f9794804aafe2298f8e9bd3c099f6a42293
# skip: [623ed94fe4c8530744c10e443cdaa4c752768f89] add codectx as an argument to mark_julia_type
git bisect skip 623ed94fe4c8530744c10e443cdaa4c752768f89
# skip: [8b1d5a73a03bc3dfe1dfa34d02aaf0016031fc76] minor additional optimizations for jlcall frame creation
git bisect skip 8b1d5a73a03bc3dfe1dfa34d02aaf0016031fc76
# skip: [4a513090f627673ecbb655eda7cfb2f6fb2699b4] start work of marking lifetime regions of all gc-rooted objects in a function
git bisect skip 4a513090f627673ecbb655eda7cfb2f6fb2699b4
# skip: [2c202be7a313046211ea139a67c7dfa02f7c2250] move gc frame allocation for variables to finalize_gc_frame
git bisect skip 2c202be7a313046211ea139a67c7dfa02f7c2250
# skip: [cdd6a41fa5bdec029d2c7555e51124eddc0daa3d] move entire gc frame allocation and optimization for jlcall args to finalize_gc_frame
git bisect skip cdd6a41fa5bdec029d2c7555e51124eddc0daa3d
# skip: [35954a41277e284da540dc340b52b09dd8d8532b] refactor the gcroot algorithm into a class
git bisect skip 35954a41277e284da540dc340b52b09dd8d8532b
# skip: [15bd22f8c43f157d9d8229267f5c0f1d4b241dcd] fix windows compile handling of gcroot_func declaration
git bisect skip 15bd22f8c43f157d9d8229267f5c0f1d4b241dcd
# skip: [97ed64a7722e6b1de9e7b758f56b6380afe5022e] delete gcframe if all of the locals and tempslots get removed
git bisect skip 97ed64a7722e6b1de9e7b758f56b6380afe5022e
# skip: [a0c4b2554b76d191ab0779fe2ef3085e2b9f90cd] use the locals frame as the gc root of a SA value in emit_assignment so that it is easier to elid the store when finalizing the gc frame
git bisect skip a0c4b2554b76d191ab0779fe2ef3085e2b9f90cd
# skip: [fcd595ab9d441299d5de901fb2f2c44378fa9ece] correct isimmutable info and add needsgcroot annotations so that the runtime computation matches the result of is_stable_expr
git bisect skip fcd595ab9d441299d5de901fb2f2c44378fa9ece
# bad: [44459f000d2e2a360be95dd4ae9ad11f5b3a2f26] move the gcroot algorithm as late as possible -- almost inside the FPM
git bisect bad 44459f000d2e2a360be95dd4ae9ad11f5b3a2f26
# only skipped commits left to test
# possible first bad commit: [44459f000d2e2a360be95dd4ae9ad11f5b3a2f26] move the gcroot algorithm as late as possible -- almost inside the FPM
# possible first bad commit: [35954a41277e284da540dc340b52b09dd8d8532b] refactor the gcroot algorithm into a class
# possible first bad commit: [97ed64a7722e6b1de9e7b758f56b6380afe5022e] delete gcframe if all of the locals and tempslots get removed
# possible first bad commit: [8b1d5a73a03bc3dfe1dfa34d02aaf0016031fc76] minor additional optimizations for jlcall frame creation
# possible first bad commit: [b2869f9794804aafe2298f8e9bd3c099f6a42293] optimize several common gcroot patterns
# possible first bad commit: [15bd22f8c43f157d9d8229267f5c0f1d4b241dcd] fix windows compile handling of gcroot_func declaration
# possible first bad commit: [4a513090f627673ecbb655eda7cfb2f6fb2699b4] start work of marking lifetime regions of all gc-rooted objects in a function
# possible first bad commit: [623ed94fe4c8530744c10e443cdaa4c752768f89] add codectx as an argument to mark_julia_type
# possible first bad commit: [0b2758f4754f910992c7a6af0b5d865f6ae29720] add a simple stats counter for gcframe size (disabled)
# possible first bad commit: [cdd6a41fa5bdec029d2c7555e51124eddc0daa3d] move entire gc frame allocation and optimization for jlcall args to finalize_gc_frame
# possible first bad commit: [a0c4b2554b76d191ab0779fe2ef3085e2b9f90cd] use the locals frame as the gc root of a SA value in emit_assignment so that it is easier to elid the store when finalizing the gc frame
# possible first bad commit: [2c202be7a313046211ea139a67c7dfa02f7c2250] move gc frame allocation for variables to finalize_gc_frame
# possible first bad commit: [269008015364c3d31d9c5d7d533951e9314e74d4] convert gcroot / type inspection of variable assignments to on-the-fly version
# possible first bad commit: [0c5218c30a5b7b54f2daeaabffa284ee9f9c65fb] increase usage of jl_cgval_t info, rather than expression inspection
# possible first bad commit: [fcd595ab9d441299d5de901fb2f2c44378fa9ece] correct isimmutable info and add needsgcroot annotations so that the runtime computation matches the result of is_stable_expr
# possible first bad commit: [33ae5da49ba9c34f2c55b57d1b3de3485af8e194] generate really poor, but really simple gc frames

yuyichao · 2016-02-13T14:46:09Z

I've also got intermittent TypeError during boot strap.

error during bootstrap:
TypeError(func=:abstract_call_gf, context="typeassert", expected=Core.Inference.CallStack, got=Core.Inference.EmptyCallStack())
rec_backtrace at /home/alarm/projects/julia/aarch64-test/src/task.c:667
[inline] at /home/alarm/projects/julia/aarch64-test/src/task.c:712
record_backtrace at /home/alarm/projects/julia/aarch64-test/src/task.c:853
unknown function (ip: 0x7fb7d39cb4)
unknown function (ip: 0x7db25a9e18)
jl_apply_generic at /home/alarm/projects/julia/aarch64-test/src/gf.c:1919
unknown function (ip: 0x7db25ef864)

So far seems more reproducible on aarch64...............

yuyichao · 2016-02-13T14:58:03Z

Actually, on aarch64, the issue is a stack overflow in the serializer during a super deep type inference. The StackOverflowError() thrown then seems to confuse type inference and causing it to fail on type assertion.

@tkelman is stack overflow detection working on win64? In any case, it might be good to to check if there was any stackoverflow detected.

yuyichao · 2016-02-13T15:17:23Z

Increase the stack size during compilation works around the issue (I increased it from 8M to 80M just to be safe but I think we probably don't actually need that much) so this is something that worth checking/use as temporary workaround.

vtjnash · 2016-02-13T15:54:04Z

i guess openblas has already been doing this for us on x64, causing me to miss this entirely in testing :(. Moreover, the automated gc-rooting (4a51309) significantly increased the size of the code (proxy for measuring the average stackframe, with all else unchanged). The follow on commits slowly brought that back down via optimizations, but I suspect some common optimizable patterns may be still lacking. Also, fwiw, the win64 stackframe is often much larger than the platform ABI standard one, and type inference already often runs fairly close to the value we've been using so it wouldn't be unreasonable to double it again.

yuyichao · 2016-02-13T15:58:07Z

OpenBlas only does that for the git version and I think the issue isn't that bad on x64 (I patched it out in openblas-git and still don't really see the issue (only once with threading)).

some common optimizable patterns may be still lacking

I think trying to optimize it on master is totally fine and I guess the way you propose to flatten the type inference should also help this a lot?

yuyichao added the compiler:codegen Generation of LLVM IR and native code label Jan 3, 2016

vtjnash force-pushed the jn/codegen_rewrite2 branch 2 times, most recently from 4f2e812 to 55fd11a Compare January 6, 2016 22:20

hayd mentioned this pull request Jan 8, 2016

RFC: Create local gc frame for throw #11508

Closed

vtjnash force-pushed the jn/codegen_rewrite2 branch 2 times, most recently from af35da1 to a5f25fc Compare January 20, 2016 07:13

vtjnash force-pushed the jn/codegen_rewrite2 branch 2 times, most recently from f6eb4d3 to 36a58c7 Compare February 11, 2016 23:26

vtjnash added 20 commits February 11, 2016 22:14

generate really poor, but really simple gc frames

33ae5da

the emitted roots are essentially placeholders for later simplification by a root optimization pass. the first step here is to help the codegen system distinguish between roots used for function arguments and roots used for locals

correct isimmutable info and add needsgcroot annotations so that the …

fcd595a

…runtime computation matches the result of is_stable_expr

increase usage of jl_cgval_t info, rather than expression inspection

0c5218c

convert gcroot / type inspection of variable assignments to on-the-fl…

2690080

…y version

move gc frame allocation for variables to finalize_gc_frame

2c202be

use the locals frame as the gc root of a SA value in emit_assignment …

a0c4b25

…so that it is easier to elid the store when finalizing the gc frame

move entire gc frame allocation and optimization for jlcall args to f…

cdd6a41

…inalize_gc_frame this computes frame variable liveness at a basicblock level, then uses that to build an optimized jlcall frame

add a simple stats counter for gcframe size (disabled)

0b2758f

add codectx as an argument to mark_julia_type

623ed94

start work of marking lifetime regions of all gc-rooted objects in a …

4a51309

…function

fix windows compile handling of gcroot_func declaration

15bd22f

optimize several common gcroot patterns

b2869f9

detects simple `store -> load -> store other` or `store, store other` patterns and removes the redundant root and detects dead roots, and removes them

minor additional optimizations for jlcall frame creation

8b1d5a7

delete gcframe if all of the locals and tempslots get removed

97ed64a

refactor the gcroot algorithm into a class

35954a4

this moves it a bit closer to becoming an llvm pass and makes it easier to pass around state to the helper functions

move the gcroot algorithm as late as possible -- almost inside the FPM

44459f0

extract collapseRedundantRoots into a separate function

d2a8147

this pass looks for gcroots that are trivially unnecessary and elides them rewriting this functionality this way made the allocate_frame liveness computation much more straight-forward while making both passes more powerful

fold get_gcrooted into simply boxed

3ac4574

also drop the jltype argument from boxed; the uses of this argument were invalid anyways

add remaining mark_gc_use needed for ccall, llvmcall, emit_unbox, and…

00dd5d5

… a few direct consumers

vtjnash force-pushed the jn/codegen_rewrite2 branch from 36a58c7 to 00dd5d5 Compare February 12, 2016 03:28

vtjnash changed the title ~~WIP: codegen gcroot optimization~~ codegen gcroot optimization Feb 12, 2016

tkelman reviewed Feb 12, 2016
View reviewed changes

vtjnash added a commit that referenced this pull request Feb 12, 2016

Merge pull request #14543 from JuliaLang/jn/codegen_rewrite2

0b2027e

codegen gcroot optimization

vtjnash merged commit 0b2027e into master Feb 12, 2016

vtjnash deleted the jn/codegen_rewrite2 branch February 12, 2016 20:14

yuyichao mentioned this pull request Feb 15, 2016

New segfault in build during inference #15079

Closed

nalimilan mentioned this pull request Feb 28, 2016

"#test_cartesian not defined" error in subarrays test #14991

Closed

vtjnash mentioned this pull request Mar 11, 2016

reuse boxed versions of values in codegen #9194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

codegen gcroot optimization #14543

codegen gcroot optimization #14543

vtjnash commented Jan 3, 2016

Keno commented Jan 3, 2016

vtjnash commented Feb 12, 2016

tkelman Feb 12, 2016

tkelman commented Feb 12, 2016

tkelman commented Feb 13, 2016

yuyichao commented Feb 13, 2016

yuyichao commented Feb 13, 2016

yuyichao commented Feb 13, 2016

vtjnash commented Feb 13, 2016

yuyichao commented Feb 13, 2016

codegen gcroot optimization #14543

codegen gcroot optimization #14543

Conversation

vtjnash commented Jan 3, 2016

Keno commented Jan 3, 2016

vtjnash commented Feb 12, 2016

tkelman Feb 12, 2016

Choose a reason for hiding this comment

tkelman commented Feb 12, 2016

tkelman commented Feb 13, 2016

yuyichao commented Feb 13, 2016

yuyichao commented Feb 13, 2016

yuyichao commented Feb 13, 2016

vtjnash commented Feb 13, 2016

yuyichao commented Feb 13, 2016