Skip to content

Commit 1ee4d88

Browse files
maleadtArtem-B
authored andcommitted
NVPTX: Lower unreachable to exit to allow ptxas to accurately reconstruct the CFG.
PTX does not have a notion of `unreachable`, which results in emitted basic blocks having an edge to the next block: ``` block1: call @does_not_return(); // unreachable block2: // ptxas will create a CFG edge from block1 to block2 ``` This may result in significant changes to the control flow graph, e.g., when LLVM moves unreachable blocks to the end of the function. That's a problem in the context of divergent control flow, as `ptxas` uses the CFG to determine divergent regions, while some intructions may not be executed divergently. For example, `bar.sync` is not allowed to be executed divergently on Pascal or earlier. If we start with the following: ``` entry: // start of divergent region @%p0 bra cont; @%p1 bra unlikely; ... bra.uni cont; unlikely: ... // unreachable cont: // end of divergent region bar.sync 0; bra.uni exit; exit: ret; ``` it is transformed by the branch-folder and block-placement passes to: ``` entry: // start of divergent region @%p0 bra cont; @%p1 bra unlikely; ... bra.uni cont; cont: bar.sync 0; bra.uni exit; unlikely: ... // unreachable exit: // end of divergent region ret; ``` After moving the `unlikely` block to the end of the function, it has an edge to the `exit` block, which widens the divergent region and makes the `bar.sync` instruction happen divergently. That causes wrong computations, as we've been running into for years with Julia code (which emits a lot of `trap` + `unreachable` code all over the place). To work around this, add an `exit` instruction before every `unreachable`, as `ptxas` understands that exit terminates the CFG. Note that `trap` is not equivalent, and only future versions of `ptxas` will model it like `exit`. Another alternative would be to emit a branch to the block itself, but emitting `exit` seems like a cleaner solution to represent `unreachable` to me. Also note that this may not be sufficient, as it's possible that the block with unreachable control flow is branched to from different divergent regions, e.g. after block merging, in which case it may still be the case that `ptxas` could reconstruct a CFG where divergent regions are merged (I haven't confirmed this, but also haven't encountered this pattern in the wild yet): ``` entry: // start of divergent region 1 @%p0 bra cont1; @%p1 bra unlikely; bra.uni cont1; cont1: // intended end of divergent region 1 bar.sync 0; // start of divergent region 2 @%p2 bra cont2; @%p3 bra unlikely; bra.uni cont2; cont2: // intended end of divergent region 2 bra.uni exit; unlikely: ... exit; exit: // possible end of merged divergent region? ``` I originally tried to avoid the above by cloning paths towards `unreachable` and splitting the outgoing edges, but that quickly became too complicated. I propose we go with the simple solution first, also because modern GPUs with more flexible hardware thread schedulers don't even suffer from this issue. Finally, although I expect this to fix most of https://bugs.llvm.org/show_bug.cgi?id=27738, I do still encounter miscompilations with Julia's unreachable-heavy code when targeting these older GPUs using an older `ptxas` version (specifically, from CUDA 11.4 or below). This is likely due to related bugs in `ptxas` which have been fixed since, as I have filed several reproducers with NVIDIA over the past couple of years. I'm not inclined to look into fixing those issues over here, and will instead be recommending our users to upgrade CUDA to 11.5+ when using these GPUs. Also see: - JuliaGPU/CUDAnative.jl#4 - JuliaGPU/CUDA.jl#1746 - https://discourse.llvm.org/t/llvm-reordering-blocks-breaks-ptxas-divergence-analysis/71126 Reviewed By: jdoerfert, tra Differential Revision: https://reviews.llvm.org/D152789
1 parent 0c5b632 commit 1ee4d88

File tree

5 files changed

+155
-0
lines changed

5 files changed

+155
-0
lines changed

llvm/lib/Target/NVPTX/CMakeLists.txt

+1
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ set(NVPTXCodeGen_sources
2525
NVPTXLowerAggrCopies.cpp
2626
NVPTXLowerArgs.cpp
2727
NVPTXLowerAlloca.cpp
28+
NVPTXLowerUnreachable.cpp
2829
NVPTXPeephole.cpp
2930
NVPTXMCExpr.cpp
3031
NVPTXPrologEpilogPass.cpp

llvm/lib/Target/NVPTX/NVPTX.h

+1
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ MachineFunctionPass *createNVPTXReplaceImageHandlesPass();
4747
FunctionPass *createNVPTXImageOptimizerPass();
4848
FunctionPass *createNVPTXLowerArgsPass();
4949
FunctionPass *createNVPTXLowerAllocaPass();
50+
FunctionPass *createNVPTXLowerUnreachablePass();
5051
MachineFunctionPass *createNVPTXPeephole();
5152
MachineFunctionPass *createNVPTXProxyRegErasurePass();
5253

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
//===-- NVPTXLowerUnreachable.cpp - Lower unreachables to exit =====--===//
2+
//
3+
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4+
// See https://llvm.org/LICENSE.txt for license information.
5+
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6+
//
7+
//===----------------------------------------------------------------------===//
8+
//
9+
// PTX does not have a notion of `unreachable`, which results in emitted basic
10+
// blocks having an edge to the next block:
11+
//
12+
// block1:
13+
// call @does_not_return();
14+
// // unreachable
15+
// block2:
16+
// // ptxas will create a CFG edge from block1 to block2
17+
//
18+
// This may result in significant changes to the control flow graph, e.g., when
19+
// LLVM moves unreachable blocks to the end of the function. That's a problem
20+
// in the context of divergent control flow, as `ptxas` uses the CFG to
21+
// determine divergent regions, and some intructions may not be executed
22+
// divergently.
23+
//
24+
// For example, `bar.sync` is not allowed to be executed divergently on Pascal
25+
// or earlier. If we start with the following:
26+
//
27+
// entry:
28+
// // start of divergent region
29+
// @%p0 bra cont;
30+
// @%p1 bra unlikely;
31+
// ...
32+
// bra.uni cont;
33+
// unlikely:
34+
// ...
35+
// // unreachable
36+
// cont:
37+
// // end of divergent region
38+
// bar.sync 0;
39+
// bra.uni exit;
40+
// exit:
41+
// ret;
42+
//
43+
// it is transformed by the branch-folder and block-placement passes to:
44+
//
45+
// entry:
46+
// // start of divergent region
47+
// @%p0 bra cont;
48+
// @%p1 bra unlikely;
49+
// ...
50+
// bra.uni cont;
51+
// cont:
52+
// bar.sync 0;
53+
// bra.uni exit;
54+
// unlikely:
55+
// ...
56+
// // unreachable
57+
// exit:
58+
// // end of divergent region
59+
// ret;
60+
//
61+
// After moving the `unlikely` block to the end of the function, it has an edge
62+
// to the `exit` block, which widens the divergent region and makes the
63+
// `bar.sync` instruction happen divergently.
64+
//
65+
// To work around this, we add an `exit` instruction before every `unreachable`,
66+
// as `ptxas` understands that exit terminates the CFG. Note that `trap` is not
67+
// equivalent, and only future versions of `ptxas` will model it like `exit`.
68+
//
69+
//===----------------------------------------------------------------------===//
70+
71+
#include "NVPTX.h"
72+
#include "llvm/IR/Function.h"
73+
#include "llvm/IR/InlineAsm.h"
74+
#include "llvm/IR/Instructions.h"
75+
#include "llvm/IR/Type.h"
76+
#include "llvm/Pass.h"
77+
78+
using namespace llvm;
79+
80+
namespace llvm {
81+
void initializeNVPTXLowerUnreachablePass(PassRegistry &);
82+
}
83+
84+
namespace {
85+
class NVPTXLowerUnreachable : public FunctionPass {
86+
bool runOnFunction(Function &F) override;
87+
88+
public:
89+
static char ID; // Pass identification, replacement for typeid
90+
NVPTXLowerUnreachable() : FunctionPass(ID) {}
91+
StringRef getPassName() const override {
92+
return "add an exit instruction before every unreachable";
93+
}
94+
};
95+
} // namespace
96+
97+
char NVPTXLowerUnreachable::ID = 1;
98+
99+
INITIALIZE_PASS(NVPTXLowerUnreachable, "nvptx-lower-unreachable",
100+
"Lower Unreachable", false, false)
101+
102+
// =============================================================================
103+
// Main function for this pass.
104+
// =============================================================================
105+
bool NVPTXLowerUnreachable::runOnFunction(Function &F) {
106+
if (skipFunction(F))
107+
return false;
108+
109+
LLVMContext &C = F.getContext();
110+
FunctionType *ExitFTy = FunctionType::get(Type::getVoidTy(C), false);
111+
InlineAsm *Exit = InlineAsm::get(ExitFTy, "exit;", "", true);
112+
113+
bool Changed = false;
114+
for (auto &BB : F)
115+
for (auto &I : BB) {
116+
if (auto unreachableInst = dyn_cast<UnreachableInst>(&I)) {
117+
Changed = true;
118+
CallInst::Create(ExitFTy, Exit, "", unreachableInst);
119+
}
120+
}
121+
return Changed;
122+
}
123+
124+
FunctionPass *llvm::createNVPTXLowerUnreachablePass() {
125+
return new NVPTXLowerUnreachable();
126+
}

llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp

+4
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@ void initializeNVPTXAtomicLowerPass(PassRegistry &);
7272
void initializeNVPTXCtorDtorLoweringLegacyPass(PassRegistry &);
7373
void initializeNVPTXLowerAggrCopiesPass(PassRegistry &);
7474
void initializeNVPTXLowerAllocaPass(PassRegistry &);
75+
void initializeNVPTXLowerUnreachablePass(PassRegistry &);
7576
void initializeNVPTXCtorDtorLoweringLegacyPass(PassRegistry &);
7677
void initializeNVPTXLowerArgsPass(PassRegistry &);
7778
void initializeNVPTXProxyRegErasurePass(PassRegistry &);
@@ -98,6 +99,7 @@ extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeNVPTXTarget() {
9899
initializeNVPTXAtomicLowerPass(PR);
99100
initializeNVPTXLowerArgsPass(PR);
100101
initializeNVPTXLowerAllocaPass(PR);
102+
initializeNVPTXLowerUnreachablePass(PR);
101103
initializeNVPTXCtorDtorLoweringLegacyPass(PR);
102104
initializeNVPTXLowerAggrCopiesPass(PR);
103105
initializeNVPTXProxyRegErasurePass(PR);
@@ -400,6 +402,8 @@ void NVPTXPassConfig::addIRPasses() {
400402
addPass(createLoadStoreVectorizerPass());
401403
addPass(createSROAPass());
402404
}
405+
406+
addPass(createNVPTXLowerUnreachablePass());
403407
}
404408

405409
bool NVPTXPassConfig::addInstSelector() {
+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
; RUN: llc < %s -march=nvptx -mcpu=sm_20 -verify-machineinstrs | FileCheck %s
2+
; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 -verify-machineinstrs | FileCheck %s
3+
; RUN: %if ptxas && !ptxas-12.0 %{ llc < %s -march=nvptx -mcpu=sm_20 -verify-machineinstrs | %ptxas-verify %}
4+
; RUN: %if ptxas %{ llc < %s -march=nvptx64 -mcpu=sm_20 -verify-machineinstrs | %ptxas-verify %}
5+
6+
; CHECK: .extern .func throw
7+
declare void @throw() #0
8+
9+
; CHECK: .entry kernel_func
10+
define void @kernel_func() {
11+
; CHECK: call.uni
12+
; CHECK: throw,
13+
call void @throw()
14+
; CHECK: exit
15+
unreachable
16+
}
17+
18+
attributes #0 = { noreturn }
19+
20+
21+
!nvvm.annotations = !{!1}
22+
23+
!1 = !{ptr @kernel_func, !"kernel", i32 1}

0 commit comments

Comments
 (0)