-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unreachable control flow leads to illegal divergent barriers #1746
Comments
Unsure what's happening here, the log doesn't contain anything useful except for test failures. Would be good to narrow down to the actual operations that fail (esp. the |
What can I do to help? Detailed instructions would be very helpful. Calculations in a third-party package that uses GPU via |
Hmm, actually one of the failures seems to point to JuliaGPU/CUDAnative.jl#4
This has been a longstanding issue, caused by bugs in ptxas wich apparently NVIDIA doesn't manage to fix. The test there was meant to detect exactly such issues, Lines 69 to 115 in f85dd7b
For your old device, we already set all the workaround we know: https://github.com/JuliaGPU/CUDA.jl/blob/master/src/compiler/gpucompiler.jl#L21-L35. Bottom line, if you're using mapreduce kernels, or a combination of shared memory and code that may exit early (e.g. because of exceptions), then no the calculations cannot be trusted, even though it's unlikely you're running into this issue for arbitary code. If I have some time I'll grab the oldest GPU I have laying around to see if I can reproduce this, but in the mean time using a more recent GPU may be the best option. |
One thing you could try, is upgrading CUDA.jl so that the 11.8 or even 12.0 compiler is used (#1742), although I don't think NVIDIA has touched ptxas for sm_37 in those releases (this architecture is deprecated and slated for removal). |
#1660 looks like another instance of this, but on more recent hardware (sm_75)... |
Thank you @maleadt for your response! I tried to test
|
Those failures are probably not an issue, unless you rely on that specific CUSPARSE functionality. If so, do test the latest release of CUDA.jl 4.0, since much has changed. If it still happens, please open a separate issue. Let's keep the current one about the unstructured cfg-related codegen issue. FWIW, I can reproduce on my old sm_35 GPU. I'll try to reduce, however, it's unlikely that NVIDIA will fix this as sm_35/sm_37 are officially unsupported on CUDA 12.0. |
The following MWE, extracted from the CUDA.jl test suite, fails for me on Kepler hardware (sm_35; a GTX Titan) using Julia 1.9, not even using using CUDA, Test
@noinline function throw_some()
throw(42)
return
end
@inbounds function kernel(input, output, n)
i = threadIdx().x
temp = CuStaticSharedArray(Int, 1)
if i == 1
1 <= n || throw_some()
temp[1] = input
end
sync_threads()
1 <= n || throw_some()
unsafe_store!(output, temp[1], i)
return
end
function gpu(input)
output = CuArray(zeros(eltype(input), 2))
ptr = pointer(output)
ptr = reinterpret(Ptr{eltype(input)}, ptr)
@cuda threads=2 kernel(input, ptr, 99)
return Array(output)
end
function cpu(input)
output = zeros(eltype(input), 2)
for j in 1:2
@inbounds output[j] = input
end
return output
end
input = rand(1:100)
@test cpu(input) == gpu(input) Would be good if other people can confirm. EDIT: can also reproduce this on a GTX 970 (sm_52), using the same driver (470.182.3, for CUDA 11.4, but using EDIT: also reproduces on the GTX 970 using 530.41.3 (CUDA 12.1) -- I was only using driver 470 because that's the latest supporting my GTX Titan. |
On a 1050 mobile GPU:
|
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <time.h>
#include <cuda.h>
int main(int argc, char *argv[]) {
if (argc != 2) {
printf("Usage: %s <PTX file>\n", argv[0]);
exit(EXIT_FAILURE);
}
CUdevice cuDevice;
CUcontext cuContext;
CUmodule cuModule;
CUfunction cuKernel;
CUdeviceptr d_output;
int threads = 2, n = 99;
int h_output[2];
// Initialize the CUDA driver API
assert(cuInit(0) == CUDA_SUCCESS);
assert(cuDeviceGet(&cuDevice, 0) == CUDA_SUCCESS);
assert(cuCtxCreate(&cuContext, 0, cuDevice) == CUDA_SUCCESS);
// Load the PTX file
assert(cuModuleLoad(&cuModule, argv[1]) == CUDA_SUCCESS);
assert(cuModuleGetFunction(&cuKernel, cuModule, "kernel") == CUDA_SUCCESS);
// Allocate GPU memory for the output
assert(cuMemAlloc(&d_output, sizeof(int) * 2) == CUDA_SUCCESS);
// Set up the kernel parameters
void *kernel_params[] = {&d_output, &n};
// Launch the kernel
assert(cuLaunchKernel(cuKernel, 1, 1, 1, threads, 1, 1, 0, NULL, kernel_params, NULL) == CUDA_SUCCESS);
// Copy the output back to the host
assert(cuMemcpyDtoH(h_output, d_output, sizeof(int) * 2) == CUDA_SUCCESS);
// Print the output
printf("Output: %d %d\n", h_output[0], h_output[1]);
// Clean up
cuMemFree(d_output);
cuModuleUnload(cuModule);
cuCtxDestroy(cuContext);
return 0;
}
Note the difference in |
Filed with NVIDIA as bug 4078847. |
The julia MWE fails on sm_52 for me as well with NVIDIA driver 530.41.3 and CUDA 12.1. Guess this is one way NVIDIA will get me to upgrade 😞.
|
CUDA 12 still supports Pascal AFAIK, so maybe it could get fixed in 12.2 or 12.3. |
I'm able to replicate the bug it on Julia 1.9 rc2, with:
|
I see it on Julia 1.8.5, too: julia> include("..\\testbug.jl")
ERROR: LoadError: AssertionError: sum(a) ≈ 24
Stacktrace:
[1] top-level scope
@ dev\CliMA\testbug.jl:61
[2] include(fname::String)
@ Base.MainInclude .\client.jl:476
[3] top-level scope
@ REPL[5]:1
[4] top-level scope
@ .julia\packages\CUDA\Ey3w2\src\initialization.jl:52
in expression starting at dev\CliMA\testbug.jl:61
julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores
Environment:
JULIA_PKG_DEVDIR = dev/julia
julia> CUDA.versioninfo()
CUDA toolkit 11.7, artifact installation
Unknown NVIDIA driver, for CUDA 12.0
CUDA driver 12.0
Libraries:
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.1
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: missing
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)
Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86
1 device:
0: NVIDIA GeForce GTX 1050 (sm_61, 3.263 GiB / 4.000 GiB available) |
I heard back from NVIDIA. The issue is with the code we generate. Simplifying the MWE above:
This is indeed mentioned in the PTX ISA: https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-bar:
In Julia's case, we can have exceptions everywhere, resulting in branches to blocks outside of divergent regions that result in reconvergence issues. NVIDIA mentioned that they will look into special-casing |
Thanks for the update! |
FYI, we saw a similar problem, but in reverse (i.e. the results were correct on a P100, but incorrect on a V100) when using CUDA 11.3 runtime. It appears to be fixed by upgrading to CUDA 11.8. |
Posted on the LLVM Discourse: https://discourse.llvm.org/t/llvm-reordering-blocks-breaks-ptxas-divergence-analysis/71126 |
I've implemented a workaround in #1942. It solves the issue seen here by disabling the problematic LLVM passes. That's not great, as disabling those passes also affects CPU codegen, and is likely not a complete solution. But it's better than nothing for now. |
Another better workaround: JuliaGPU/GPUCompiler.jl#467 |
I think I have a fix: #1951. Everyone who has encountered this issue, please test out that PR. It requires a newly-released version of GPUCompiler (v0.21), so be sure to update your environment. Although the fix seems to cover the remaining instances of this bug (while simplifying the quirks we had in place), it does require a sufficiently recent version of the CUDA toolkit. Anything below CUDA 11.5 is unsupported, and will run into miscompilations (due to bugs in |
Passes where it used to fail on my 1050. |
…ruct the CFG. PTX does not have a notion of `unreachable`, which results in emitted basic blocks having an edge to the next block: ``` block1: call @does_not_return(); // unreachable block2: // ptxas will create a CFG edge from block1 to block2 ``` This may result in significant changes to the control flow graph, e.g., when LLVM moves unreachable blocks to the end of the function. That's a problem in the context of divergent control flow, as `ptxas` uses the CFG to determine divergent regions, while some intructions may not be executed divergently. For example, `bar.sync` is not allowed to be executed divergently on Pascal or earlier. If we start with the following: ``` entry: // start of divergent region @%p0 bra cont; @%p1 bra unlikely; ... bra.uni cont; unlikely: ... // unreachable cont: // end of divergent region bar.sync 0; bra.uni exit; exit: ret; ``` it is transformed by the branch-folder and block-placement passes to: ``` entry: // start of divergent region @%p0 bra cont; @%p1 bra unlikely; ... bra.uni cont; cont: bar.sync 0; bra.uni exit; unlikely: ... // unreachable exit: // end of divergent region ret; ``` After moving the `unlikely` block to the end of the function, it has an edge to the `exit` block, which widens the divergent region and makes the `bar.sync` instruction happen divergently. That causes wrong computations, as we've been running into for years with Julia code (which emits a lot of `trap` + `unreachable` code all over the place). To work around this, add an `exit` instruction before every `unreachable`, as `ptxas` understands that exit terminates the CFG. Note that `trap` is not equivalent, and only future versions of `ptxas` will model it like `exit`. Another alternative would be to emit a branch to the block itself, but emitting `exit` seems like a cleaner solution to represent `unreachable` to me. Also note that this may not be sufficient, as it's possible that the block with unreachable control flow is branched to from different divergent regions, e.g. after block merging, in which case it may still be the case that `ptxas` could reconstruct a CFG where divergent regions are merged (I haven't confirmed this, but also haven't encountered this pattern in the wild yet): ``` entry: // start of divergent region 1 @%p0 bra cont1; @%p1 bra unlikely; bra.uni cont1; cont1: // intended end of divergent region 1 bar.sync 0; // start of divergent region 2 @%p2 bra cont2; @%p3 bra unlikely; bra.uni cont2; cont2: // intended end of divergent region 2 bra.uni exit; unlikely: ... exit; exit: // possible end of merged divergent region? ``` I originally tried to avoid the above by cloning paths towards `unreachable` and splitting the outgoing edges, but that quickly became too complicated. I propose we go with the simple solution first, also because modern GPUs with more flexible hardware thread schedulers don't even suffer from this issue. Finally, although I expect this to fix most of https://bugs.llvm.org/show_bug.cgi?id=27738, I do still encounter miscompilations with Julia's unreachable-heavy code when targeting these older GPUs using an older `ptxas` version (specifically, from CUDA 11.4 or below). This is likely due to related bugs in `ptxas` which have been fixed since, as I have filed several reproducers with NVIDIA over the past couple of years. I'm not inclined to look into fixing those issues over here, and will instead be recommending our users to upgrade CUDA to 11.5+ when using these GPUs. Also see: - JuliaGPU/CUDAnative.jl#4 - JuliaGPU/CUDA.jl#1746 - https://discourse.llvm.org/t/llvm-reordering-blocks-breaks-ptxas-divergence-analysis/71126 Reviewed By: jdoerfert, tra Differential Revision: https://reviews.llvm.org/D152789
…ruct the CFG. PTX does not have a notion of `unreachable`, which results in emitted basic blocks having an edge to the next block: ``` block1: call @does_not_return(); // unreachable block2: // ptxas will create a CFG edge from block1 to block2 ``` This may result in significant changes to the control flow graph, e.g., when LLVM moves unreachable blocks to the end of the function. That's a problem in the context of divergent control flow, as `ptxas` uses the CFG to determine divergent regions, while some intructions may not be executed divergently. For example, `bar.sync` is not allowed to be executed divergently on Pascal or earlier. If we start with the following: ``` entry: // start of divergent region @%p0 bra cont; @%p1 bra unlikely; ... bra.uni cont; unlikely: ... // unreachable cont: // end of divergent region bar.sync 0; bra.uni exit; exit: ret; ``` it is transformed by the branch-folder and block-placement passes to: ``` entry: // start of divergent region @%p0 bra cont; @%p1 bra unlikely; ... bra.uni cont; cont: bar.sync 0; bra.uni exit; unlikely: ... // unreachable exit: // end of divergent region ret; ``` After moving the `unlikely` block to the end of the function, it has an edge to the `exit` block, which widens the divergent region and makes the `bar.sync` instruction happen divergently. That causes wrong computations, as we've been running into for years with Julia code (which emits a lot of `trap` + `unreachable` code all over the place). To work around this, add an `exit` instruction before every `unreachable`, as `ptxas` understands that exit terminates the CFG. Note that `trap` is not equivalent, and only future versions of `ptxas` will model it like `exit`. Another alternative would be to emit a branch to the block itself, but emitting `exit` seems like a cleaner solution to represent `unreachable` to me. Also note that this may not be sufficient, as it's possible that the block with unreachable control flow is branched to from different divergent regions, e.g. after block merging, in which case it may still be the case that `ptxas` could reconstruct a CFG where divergent regions are merged (I haven't confirmed this, but also haven't encountered this pattern in the wild yet): ``` entry: // start of divergent region 1 @%p0 bra cont1; @%p1 bra unlikely; bra.uni cont1; cont1: // intended end of divergent region 1 bar.sync 0; // start of divergent region 2 @%p2 bra cont2; @%p3 bra unlikely; bra.uni cont2; cont2: // intended end of divergent region 2 bra.uni exit; unlikely: ... exit; exit: // possible end of merged divergent region? ``` I originally tried to avoid the above by cloning paths towards `unreachable` and splitting the outgoing edges, but that quickly became too complicated. I propose we go with the simple solution first, also because modern GPUs with more flexible hardware thread schedulers don't even suffer from this issue. Finally, although I expect this to fix most of https://bugs.llvm.org/show_bug.cgi?id=27738, I do still encounter miscompilations with Julia's unreachable-heavy code when targeting these older GPUs using an older `ptxas` version (specifically, from CUDA 11.4 or below). This is likely due to related bugs in `ptxas` which have been fixed since, as I have filed several reproducers with NVIDIA over the past couple of years. I'm not inclined to look into fixing those issues over here, and will instead be recommending our users to upgrade CUDA to 11.5+ when using these GPUs. Also see: - JuliaGPU/CUDAnative.jl#4 - JuliaGPU/CUDA.jl#1746 - https://discourse.llvm.org/t/llvm-reordering-blocks-breaks-ptxas-divergence-analysis/71126 Reviewed By: jdoerfert, tra Differential Revision: https://reviews.llvm.org/D152789
…ruct the CFG. PTX does not have a notion of `unreachable`, which results in emitted basic blocks having an edge to the next block: ``` block1: call @does_not_return(); // unreachable block2: // ptxas will create a CFG edge from block1 to block2 ``` This may result in significant changes to the control flow graph, e.g., when LLVM moves unreachable blocks to the end of the function. That's a problem in the context of divergent control flow, as `ptxas` uses the CFG to determine divergent regions, while some intructions may not be executed divergently. For example, `bar.sync` is not allowed to be executed divergently on Pascal or earlier. If we start with the following: ``` entry: // start of divergent region @%p0 bra cont; @%p1 bra unlikely; ... bra.uni cont; unlikely: ... // unreachable cont: // end of divergent region bar.sync 0; bra.uni exit; exit: ret; ``` it is transformed by the branch-folder and block-placement passes to: ``` entry: // start of divergent region @%p0 bra cont; @%p1 bra unlikely; ... bra.uni cont; cont: bar.sync 0; bra.uni exit; unlikely: ... // unreachable exit: // end of divergent region ret; ``` After moving the `unlikely` block to the end of the function, it has an edge to the `exit` block, which widens the divergent region and makes the `bar.sync` instruction happen divergently. That causes wrong computations, as we've been running into for years with Julia code (which emits a lot of `trap` + `unreachable` code all over the place). To work around this, add an `exit` instruction before every `unreachable`, as `ptxas` understands that exit terminates the CFG. Note that `trap` is not equivalent, and only future versions of `ptxas` will model it like `exit`. Another alternative would be to emit a branch to the block itself, but emitting `exit` seems like a cleaner solution to represent `unreachable` to me. Also note that this may not be sufficient, as it's possible that the block with unreachable control flow is branched to from different divergent regions, e.g. after block merging, in which case it may still be the case that `ptxas` could reconstruct a CFG where divergent regions are merged (I haven't confirmed this, but also haven't encountered this pattern in the wild yet): ``` entry: // start of divergent region 1 @%p0 bra cont1; @%p1 bra unlikely; bra.uni cont1; cont1: // intended end of divergent region 1 bar.sync 0; // start of divergent region 2 @%p2 bra cont2; @%p3 bra unlikely; bra.uni cont2; cont2: // intended end of divergent region 2 bra.uni exit; unlikely: ... exit; exit: // possible end of merged divergent region? ``` I originally tried to avoid the above by cloning paths towards `unreachable` and splitting the outgoing edges, but that quickly became too complicated. I propose we go with the simple solution first, also because modern GPUs with more flexible hardware thread schedulers don't even suffer from this issue. Finally, although I expect this to fix most of https://bugs.llvm.org/show_bug.cgi?id=27738, I do still encounter miscompilations with Julia's unreachable-heavy code when targeting these older GPUs using an older `ptxas` version (specifically, from CUDA 11.4 or below). This is likely due to related bugs in `ptxas` which have been fixed since, as I have filed several reproducers with NVIDIA over the past couple of years. I'm not inclined to look into fixing those issues over here, and will instead be recommending our users to upgrade CUDA to 11.5+ when using these GPUs. Also see: - JuliaGPU/CUDAnative.jl#4 - JuliaGPU/CUDA.jl#1746 - https://discourse.llvm.org/t/llvm-reordering-blocks-breaks-ptxas-divergence-analysis/71126 Reviewed By: jdoerfert, tra Differential Revision: https://reviews.llvm.org/D152789
Test run
] test CUDA
finished with errors. The full log-file is attached.Details on Julia:
Details on CUDA:
test_report.txt
The text was updated successfully, but these errors were encountered: