-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PTX: Lower unreachable control flow to avoid bad CFG reconstruction #467
Conversation
Bug: needs to handle phi's at the start of the successor: br i1 %7, label %L12, label %L41, !dbg !139
L12: ; preds = %L4
call fastcc void @julia__throw_boundserror_8234([1 x i64] %state), !dbg !139
unreachable -> br label %L41
L23: ; preds = %conversion
%8 = getelementptr inbounds [1 x i64], [1 x i64]* %4, i64 0, i64 0, !dbg !140
store i64 1, i64* %8, align 8, !dbg !140, !tbaa !120, !alias.scope !122, !noalias !123
%9 = icmp slt i64 %.fca.2.extract, 1, !dbg !145
br i1 %9, label %L31, label %L41, !dbg !152
L31: ; preds = %L23
call fastcc void @julia__throw_boundserror_8234([1 x i64] %state), !dbg !152
unreachable -> br label %L41
L41: ; preds = %L12, %L31, %L23, %L4
%storemerge = phi i64 [ 1, %L4 ], [ 2, %L23 ], !dbg !153 This is invalid, as the phi only expected two incoming edges. |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #467 +/- ##
==========================================
+ Coverage 82.51% 82.54% +0.03%
==========================================
Files 23 23
Lines 3122 3094 -28
==========================================
- Hits 2576 2554 -22
+ Misses 546 540 -6
☔ View full report in Codecov by Sentry. |
With that fixed, there's the converse: Cloning paths to get blocks that only succeed a single other block results in phis that take too many inputs: L354: ; preds = %L325.us
%.lcssa22 = phi i64 [ %111, %L325.preheader.split.us ], [ %111, %L325.preheader.L325.preheader.split_crit_edge ], [ %66, %L325.us ], [ undef, %L375 ]
%.lcssa = phi i64 [ %112, %L325.preheader.split.us ], [ %112, %L325.preheader.L325.preheader.split_crit_edge ], [ %67, %L325.us ]
%.sroa.0.0..sroa_idx = getelementptr inbounds [1 x [1 x [2 x i64]]], [1 x [1 x [2 x i64]]]* %15, i64 0, i64 0, i64 0, i64 0
store i64 %.lcssa22, i64* %.sroa.0.0..sroa_idx, align 8
%.sroa.2.0..sroa_idx4 = getelementptr inbounds [1 x [1 x [2 x i64]]], [1 x [1 x [2 x i64]]]* %15, i64 0, i64 0, i64 0, i64 1
store i64 %.lcssa, i64* %.sroa.2.0..sroa_idx4, align 8
call fastcc void @julia__throw_boundserror_15707([1 x i64] %state)
br label %L388.us See how this is a block with an |
Simplified the approach, now just emitting a branch to the unreachable block itself (or a return when it's the entry block we're looking at). This isn't ideal, as LLVM may have merged blocks:
I'm not sure how The good news is that, at first sight, this seems to fix the issue. I've only checked sm_52 on CUDA 12.1 though, so this will need some more testing. |
Just emitting |
Similar situation on sm_37; let's try this out. |
Upstreamed at https://reviews.llvm.org/D152789 |
During back-end compilation,
ptxas
inserts instructions to manage the harware's reconvergence stack (SSY and SYNC). In order to do so, it needs to identifydivergent regions:
Meanwhile, LLVM's branch-folder and block-placement MIR passes will try to optimize the block layout, e.g., by placing unlikely blocks at the end of the function:
That is not a problem as long as the unlikely block continunes back into the divergent region. Crucially, this is not the case with unreachable control flow:
Dynamically, this is fine, because the called function does not return. However,
ptxas
does not know that and adds a successor edge to theexit
block, widening the divergence range. In this example, that's not allowed, asbar.sync
cannot be executed divergently on Pascal hardware or earlier.To avoid these fall-through successors that change the control flow, we replace
unreachable
instructions with a branch to the current block (or in case of entry blocks, with a return from the function). That appears sufficient to allowptxas
to correctly identify the divergent region.For anybody who would want to review: Ignore the old code when looking at the diff; it's entirely unrelated to the new pass.
Potential fix to JuliaGPU/CUDA.jl#1746; alternative to JuliaGPU/CUDA.jl#1942.