Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PTX: Lower unreachable control flow to avoid bad CFG reconstruction #467

Merged
merged 11 commits into from
Jun 12, 2023

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Jun 8, 2023

During back-end compilation, ptxas inserts instructions to manage the harware's reconvergence stack (SSY and SYNC). In order to do so, it needs to identify
divergent regions:

entry:
  // start of divergent region
  @%p0 bra cont;
  ...
  bra.uni cont;
cont:
  // end of divergent region
  bar.sync 0;

Meanwhile, LLVM's branch-folder and block-placement MIR passes will try to optimize the block layout, e.g., by placing unlikely blocks at the end of the function:

entry:
  // start of divergent region
  @%p0 bra cont;
  @%p1 bra unlikely;
  bra.uni cont;
cont:
  // end of divergent region
  bar.sync 0;
unlikely:
  bra.uni cont;

That is not a problem as long as the unlikely block continunes back into the divergent region. Crucially, this is not the case with unreachable control flow:

  entry:
  // start of divergent region
  @%p0 bra cont;
  @%p1 bra throw;
  bra.uni cont;
cont:
  bar.sync 0;
throw:
  call throw_and_trap();
  // unreachable
exit:
  // end of divergent region
  ret;

Dynamically, this is fine, because the called function does not return. However, ptxas does not know that and adds a successor edge to the exit block, widening the divergence range. In this example, that's not allowed, as bar.sync cannot be executed divergently on Pascal hardware or earlier.

To avoid these fall-through successors that change the control flow, we replace unreachable instructions with a branch to the current block (or in case of entry blocks, with a return from the function). That appears sufficient to allow ptxas to correctly identify the divergent region.


For anybody who would want to review: Ignore the old code when looking at the diff; it's entirely unrelated to the new pass.

Potential fix to JuliaGPU/CUDA.jl#1746; alternative to JuliaGPU/CUDA.jl#1942.

@maleadt
Copy link
Member Author

maleadt commented Jun 8, 2023

Bug: needs to handle phi's at the start of the successor:

  br i1 %7, label %L12, label %L41, !dbg !139

L12:                                              ; preds = %L4
  call fastcc void @julia__throw_boundserror_8234([1 x i64] %state), !dbg !139
  unreachable -> br label %L41

L23:                                              ; preds = %conversion
  %8 = getelementptr inbounds [1 x i64], [1 x i64]* %4, i64 0, i64 0, !dbg !140
  store i64 1, i64* %8, align 8, !dbg !140, !tbaa !120, !alias.scope !122, !noalias !123
  %9 = icmp slt i64 %.fca.2.extract, 1, !dbg !145
  br i1 %9, label %L31, label %L41, !dbg !152

L31:                                              ; preds = %L23
  call fastcc void @julia__throw_boundserror_8234([1 x i64] %state), !dbg !152
  unreachable -> br label %L41

L41:                                              ; preds = %L12, %L31, %L23, %L4
  %storemerge = phi i64 [ 1, %L4 ], [ 2, %L23 ], !dbg !153

This is invalid, as the phi only expected two incoming edges.
I'm not sure what the best approach is here; create a value out of thin air, try to walk up the CFG until we find a branch that doesn't target a phi, ...

@maleadt maleadt added the ptx Stuff about the NVIDIA PTX back-end. label Jun 8, 2023
@codecov
Copy link

codecov bot commented Jun 8, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.03 🎉

Comparison is base (e82a68a) 82.51% compared to head (ffa0884) 82.54%.

❗ Current head ffa0884 differs from pull request most recent head 16c5b5d. Consider uploading reports for the commit 16c5b5d to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #467      +/-   ##
==========================================
+ Coverage   82.51%   82.54%   +0.03%     
==========================================
  Files          23       23              
  Lines        3122     3094      -28     
==========================================
- Hits         2576     2554      -22     
+ Misses        546      540       -6     
Impacted Files Coverage Δ
src/precompile.jl 81.08% <100.00%> (ø)
src/ptx.jl 96.35% <100.00%> (+2.71%) ⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@maleadt
Copy link
Member Author

maleadt commented Jun 8, 2023

With that fixed, there's the converse: Cloning paths to get blocks that only succeed a single other block results in phis that take too many inputs:

L354:                                             ; preds = %L325.us
  %.lcssa22 = phi i64 [ %111, %L325.preheader.split.us ], [ %111, %L325.preheader.L325.preheader.split_crit_edge ], [ %66, %L325.us ], [ undef, %L375 ]
  %.lcssa = phi i64 [ %112, %L325.preheader.split.us ], [ %112, %L325.preheader.L325.preheader.split_crit_edge ], [ %67, %L325.us ]
  %.sroa.0.0..sroa_idx = getelementptr inbounds [1 x [1 x [2 x i64]]], [1 x [1 x [2 x i64]]]* %15, i64 0, i64 0, i64 0, i64 0
  store i64 %.lcssa22, i64* %.sroa.0.0..sroa_idx, align 8
  %.sroa.2.0..sroa_idx4 = getelementptr inbounds [1 x [1 x [2 x i64]]], [1 x [1 x [2 x i64]]]* %15, i64 0, i64 0, i64 0, i64 1
  store i64 %.lcssa, i64* %.sroa.2.0..sroa_idx4, align 8
  call fastcc void @julia__throw_boundserror_15707([1 x i64] %state)
  br label %L388.us

See how this is a block with an unreachable in it, however, it had many predecessors. So we cloned the block, but kept the phi expecting many incoming edges. We should prune these.

@maleadt maleadt marked this pull request as ready for review June 9, 2023 13:27
@maleadt
Copy link
Member Author

maleadt commented Jun 9, 2023

Simplified the approach, now just emitting a branch to the unreachable block itself (or a return when it's the entry block we're looking at). This isn't ideal, as LLVM may have merged blocks:

entry:
  // start of divergent region 1
  @%p0 bra cont1;
  @%p1 bra throw;
  bra.uni cont1;
cont1:
  // end of divergent region 1
  bar.sync 0;   // is this executed divergently?
  // start of divergent region 2
  @%p2 bra cont2;
  @%p3 bra throw;
  bra.uni cont2;
cont2:
  // end of divergent region 2
  ...
throw:
  trap;
  br throw;

I'm not sure how ptxas will treat this, i.e., whether the barrier will be executed divergently or not. My original approach took care of this by cloning every block that had multiple predecessors and splitting the edges so that each unreachable block is only associated with a single divergent region, and branching to an alternative block by walking up the chain of predecessors and identifying an alternative successor (i.e., identifying %cont in br i1 %cond, %leading_to_unreachable, %cont). That turned out pretty tricky though, easily introducing loops (i.e. endlessly generating new code), and there were some issues with non-dominated values being introduced.

The good news is that, at first sight, this seems to fix the issue. I've only checked sm_52 on CUDA 12.1 though, so this will need some more testing.

@maleadt
Copy link
Member Author

maleadt commented Jun 12, 2023

Just emitting exit seems to work as well. sm_52 on ptxas from CUDA 11.3 or below still seems to fail, but we can just refuse to support those as they're upgradable to CUDA 11.8 without a driver change.

@maleadt maleadt changed the title PTX: Structurize unreachable control flow PTX: Lower unreachable control flow to avoid bad CFG reconstruction Jun 12, 2023
@maleadt
Copy link
Member Author

maleadt commented Jun 12, 2023

Similar situation on sm_37; let's try this out.

@maleadt maleadt merged commit 88bf4fd into master Jun 12, 2023
@maleadt maleadt deleted the tb/structurize_unreachable branch June 12, 2023 10:42
@maleadt
Copy link
Member Author

maleadt commented Jun 13, 2023

Upstreamed at https://reviews.llvm.org/D152789

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ptx Stuff about the NVIDIA PTX back-end.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant