Unreachable control flow leads to illegal divergent barriers #1746

parfenyev · 2023-01-28T20:03:26Z

Test run ] test CUDA finished with errors. The full log-file is attached.

Details on Julia:

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × Intel(R) Core(TM) i5-10600 CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 12 on 12 virtual cores

Details on CUDA:

CUDA toolkit 11.7, artifact installation
NVIDIA driver 470.161.3, for CUDA 11.4
CUDA driver 11.4

Libraries:
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 11.0.0+470.161.3
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

2 devices:
  0: Tesla K80 (sm_37, 11.170 GiB / 11.173 GiB available)
  1: Tesla K80 (sm_37, 11.109 GiB / 11.173 GiB available)

test_report.txt

The text was updated successfully, but these errors were encountered:

maleadt · 2023-01-31T15:27:44Z

Unsure what's happening here, the log doesn't contain anything useful except for test failures. Would be good to narrow down to the actual operations that fail (esp. the mapreduce failures are unexpected).

parfenyev · 2023-01-31T15:59:28Z

What can I do to help? Detailed instructions would be very helpful.

Calculations in a third-party package that uses GPU via CUDA.jl work as expected. Can these calculations be trusted despite the fact that not all tests are passed successfully?

maleadt · 2023-01-31T16:28:37Z

Hmm, actually one of the failures seems to point to JuliaGPU/CUDAnative.jl#4

codegen: Test Failed at /home/students/parfeniev/.julia/packages/CUDA/BbliS/test/codegen.jl:112
  Expression: cpu(input) == gpu(input)
   Evaluated: [8, 8] == [8, 1076158697]

This has been a longstanding issue, caused by bugs in ptxas wich apparently NVIDIA doesn't manage to fix. The test there was meant to detect exactly such issues,

CUDA.jl/test/codegen.jl

Lines 69 to 115 in f85dd7b

    
           @testset "ptxas-compatible control flow" begin 
        
               @noinline function throw_some() 
        
                   throw(42) 
        
                   return 
        
               end 
        
               @inbounds function kernel(input, output, n) 
        
                   i = threadIdx().x 
        
                   temp = CuStaticSharedArray(Int, 1) 
        
                   if i == 1 
        
                       1 <= n || throw_some() 
        
                       temp[1] = input 
        
                   end 
        
                   sync_threads() 
        
                   1 <= n || throw_some() 
        
                   unsafe_store!(output, temp[1], i) 
        
                   return 
        
               end 
        
               function gpu(input) 
        
                   output = CuArray(zeros(eltype(input), 2)) 
        
                   ptr = pointer(output) 
        
                   ptr = reinterpret(Ptr{eltype(input)}, ptr) 
        
                   @cuda threads=2 kernel(input, ptr, 99) 
        
                   return Array(output) 
        
               end 
        
               function cpu(input) 
        
                   output = zeros(eltype(input), 2) 
        
                   for j in 1:2 
        
                       @inbounds output[j] = input 
        
                   end 
        
                   return output 
        
               end 
        
               input = rand(1:100) 
        
               @test cpu(input) == gpu(input) 
        
           end 
        
           end

. In practice it triggers with mapreduce kernels, which would explain the plethora of failures you're seeing.

For your old device, we already set all the workaround we know: https://github.com/JuliaGPU/CUDA.jl/blob/master/src/compiler/gpucompiler.jl#L21-L35. ptxas for newer devices appears to use a different codebase/compiler stack, which doesn't exhibit this issue.

Bottom line, if you're using mapreduce kernels, or a combination of shared memory and code that may exit early (e.g. because of exceptions), then no the calculations cannot be trusted, even though it's unlikely you're running into this issue for arbitary code. If I have some time I'll grab the oldest GPU I have laying around to see if I can reproduce this, but in the mean time using a more recent GPU may be the best option.

maleadt · 2023-01-31T16:30:47Z

CUDA toolkit 11.7, artifact installation

One thing you could try, is upgrading CUDA.jl so that the 11.8 or even 12.0 compiler is used (#1742), although I don't think NVIDIA has touched ptxas for sm_37 in those releases (this architecture is deprecated and slated for removal).

maleadt · 2023-02-01T15:20:27Z

#1660 looks like another instance of this, but on more recent hardware (sm_75)...

parfenyev · 2023-02-01T17:31:08Z

Thank you @maleadt for your response! I tried to test ] test CUDA on a more modern GPU, now there are much fewer errors. Can you comment on what is wrong here and how critical it is for the calculations?

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 96 × AMD EPYC 7443 24-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
  Threads: 1 on 96 virtual cores
Environment:
  LD_LIBRARY_PATH = /opt/AMD/aocl//4.0/lib:

CUDA toolkit 11.7, artifact installation
NVIDIA driver 525.60.13, for CUDA 12.0
CUDA driver 12.0

Libraries:
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 12.0.0+525.60.13
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

4 devices:
  0: NVIDIA A100-SXM4-80GB (sm_80, 79.181 GiB / 80.000 GiB available)
  1: NVIDIA A100-SXM4-80GB (sm_80, 79.181 GiB / 80.000 GiB available)
  2: NVIDIA A100-SXM4-80GB (sm_80, 79.181 GiB / 80.000 GiB available)
  3: NVIDIA A100-SXM4-80GB (sm_80, 79.181 GiB / 80.000 GiB available)

test_report_v2.txt

maleadt · 2023-02-02T16:07:41Z

Those failures are probably not an issue, unless you rely on that specific CUSPARSE functionality. If so, do test the latest release of CUDA.jl 4.0, since much has changed. If it still happens, please open a separate issue.

Let's keep the current one about the unstructured cfg-related codegen issue. FWIW, I can reproduce on my old sm_35 GPU. I'll try to reduce, however, it's unlikely that NVIDIA will fix this as sm_35/sm_37 are officially unsupported on CUDA 12.0.

maleadt · 2023-04-19T11:25:02Z

The following MWE, extracted from the CUDA.jl test suite, fails for me on Kepler hardware (sm_35; a GTX Titan) using Julia 1.9, not even using --check-bounds=yes:

using CUDA, Test

@noinline function throw_some()
    throw(42)
    return
end

@inbounds function kernel(input, output, n)
    i = threadIdx().x

    temp = CuStaticSharedArray(Int, 1)
    if i == 1
        1 <= n || throw_some()
        temp[1] = input
    end
    sync_threads()

    1 <= n || throw_some()
    unsafe_store!(output, temp[1], i)

    return
end

function gpu(input)
    output = CuArray(zeros(eltype(input), 2))
    ptr = pointer(output)
    ptr = reinterpret(Ptr{eltype(input)}, ptr)

    @cuda threads=2 kernel(input, ptr, 99)

    return Array(output)
end

function cpu(input)
    output = zeros(eltype(input), 2)

    for j in 1:2
        @inbounds output[j] = input
    end

    return output
end

input = rand(1:100)
@test cpu(input) == gpu(input)

Would be good if other people can confirm.

EDIT: can also reproduce this on a GTX 970 (sm_52), using the same driver (470.182.3, for CUDA 11.4, but using ptxas from 11.8).

EDIT: also reproduces on the GTX 970 using 530.41.3 (CUDA 12.1) -- I was only using driver 470 because that's the latest supporting my GTX Titan.

christiangnrd · 2023-04-19T11:37:12Z

On a 1050 mobile GPU:

julia> @test cpu(input) == gpu(input)
Test Failed at REPL[17]:1
  Expression: cpu(input) == gpu(input)
   Evaluated: [72, 72] == [72, 0]

ERROR: There was an error during testing

julia> versioninfo()
Julia Version 1.9.0-rc2
Commit 72aec423c2a (2023-04-01 10:41 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = auto

julia> CUDA.versioninfo()
CUDA runtime 12.1, artifact installation
CUDA driver 12.0
NVIDIA driver 525.105.17

Libraries: 
- CUBLAS: 12.1.0
- CURAND: 10.3.2
- CUFFT: 11.0.2
- CUSOLVER: 11.4.4
- CUSPARSE: 12.0.2
- CUPTI: 18.0.0
- NVML: 12.0.0+525.105.17

Toolchain:
- Julia: 1.9.0-rc2
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA GeForce GTX 1050 (sm_61, 3.104 GiB / 4.000 GiB available)

maleadt · 2023-04-19T13:42:41Z

mwe.ptx

.version 6.3
.target sm_52
.address_size 64

.visible .entry kernel(
    .param .u64 output_ptr,
    .param .u32 n
)
{
    .reg .pred %p<2>;
    .reg .b32 %r<8>;
    .reg .b64 %rd<5>;

    .shared .align 32 .b8 shmem[4];

    // load the thread idx, check if we're thread 0
    mov.u32 %r1, %tid.x;
    setp.ne.s32 %p0, %r1, 0;

    // something that used to be an OOB check
    ld.param.u32 %r2, [n];
    setp.lt.s32 %p1, %r2, 1;

    @%p0 bra load_shmem;

    @%p1 bra oob1;

    bra.uni set_shmem;

    // if we set the oob1 block here, we get the expected output:
oob1:
    trap;

set_shmem:
    // thread 0 initializes the shmem
    st.shared.u32 [shmem], 42;

load_shmem:
    bar.sync 0;

    // all threads load from shmem and store to global memory
    @%p1 bra oob2;
    ld.param.u64 %rd1, [output_ptr];
    ld.shared.u32 %r3, [shmem];
    mul.wide.u32 %rd3, %r1, 4;
    add.s64 %rd4, %rd1, %rd3;
    shr.u32 %r5, %r3, 24;
    st.u8 [%rd4+3], %r5;
    shr.u32 %r6, %r3, 16;
    st.u8 [%rd4+2], %r6;
    shr.u32 %r7, %r3, 8;
    st.u8 [%rd4+1], %r7;
    st.u8 [%rd4], %r3;
    ret;

    // if we set the oob1 block here, we get a miscompile
oob1:
    trap;

oob2:
    trap;
}

loader.c

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <time.h>
#include <cuda.h>

int main(int argc, char *argv[]) {
    if (argc != 2) {
        printf("Usage: %s <PTX file>\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    CUdevice cuDevice;
    CUcontext cuContext;
    CUmodule cuModule;
    CUfunction cuKernel;
    CUdeviceptr d_output;

    int threads = 2, n = 99;
    int h_output[2];

    // Initialize the CUDA driver API
    assert(cuInit(0) == CUDA_SUCCESS);
    assert(cuDeviceGet(&cuDevice, 0) == CUDA_SUCCESS);
    assert(cuCtxCreate(&cuContext, 0, cuDevice) == CUDA_SUCCESS);

    // Load the PTX file
    assert(cuModuleLoad(&cuModule, argv[1]) == CUDA_SUCCESS);
    assert(cuModuleGetFunction(&cuKernel, cuModule, "kernel") == CUDA_SUCCESS);

    // Allocate GPU memory for the output
    assert(cuMemAlloc(&d_output, sizeof(int) * 2) == CUDA_SUCCESS);

    // Set up the kernel parameters
    void *kernel_params[] = {&d_output, &n};

    // Launch the kernel
    assert(cuLaunchKernel(cuKernel, 1, 1, 1, threads, 1, 1, 0, NULL, kernel_params, NULL) == CUDA_SUCCESS);

    // Copy the output back to the host
    assert(cuMemcpyDtoH(h_output, d_output, sizeof(int) * 2) == CUDA_SUCCESS);

    // Print the output
    printf("Output: %d %d\n", h_output[0], h_output[1]);

    // Clean up
    cuMemFree(d_output);
    cuModuleUnload(cuModule);
    cuCtxDestroy(cuContext);

    return 0;
}

❯ gcc loader.c -isystem /opt/cuda/include -lcuda -o loader && (./loader good.ptx; ./loader bad.ptx)
Output: 42 42
Output: 42 0

Note the difference in oob1 placement. This looks really bad. I can't see us working around it...

maleadt · 2023-04-19T13:56:56Z

Filed with NVIDIA as bug 4078847.

lcw · 2023-04-19T15:43:23Z

Would be good if other people can confirm.

The julia MWE fails on sm_52 for me as well with NVIDIA driver 530.41.3 and CUDA 12.1. Guess this is one way NVIDIA will get me to upgrade 😞.

❯ julia --project
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.9.0-rc2 (2023-04-01)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> include("bad.jl")
Test Failed at /tmp/scratch/lucas/bug/bad.jl:45
  Expression: cpu(input) == gpu(input)
   Evaluated: [84, 84] == [84, 0]

ERROR: LoadError: There was an error during testing
in expression starting at /tmp/scratch/lucas/bug/bad.jl:45

julia> versioninfo()
Julia Version 1.9.0-rc2
Commit 72aec423c2a (2023-04-01 10:41 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 7950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 1 on 32 virtual cores

julia> CUDA.versioninfo()
CUDA runtime 12.1, artifact installation
CUDA driver 12.1
NVIDIA driver 530.41.3

Libraries: 
- CUBLAS: 12.1.0
- CURAND: 10.3.2
- CUFFT: 11.0.2
- CUSOLVER: 11.4.4
- CUSPARSE: 12.0.2
- CUPTI: 18.0.0
- NVML: 12.0.0+530.41.3

Toolchain:
- Julia: 1.9.0-rc2
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA GeForce GTX TITAN X (sm_52, 10.411 GiB / 12.000 GiB available)

(bug) pkg> st --manifest
Status `/tmp/scratch/lucas/bug/Manifest.toml`
  [621f4979] AbstractFFTs v1.3.1
  [79e6a3ab] Adapt v3.6.1
  [a9b6321e] Atomix v0.1.0
  [ab4f0b2a] BFloat16s v0.4.2
  [fa961155] CEnum v0.4.2
  [052768ef] CUDA v4.1.4
  [1af6417a] CUDA_Runtime_Discovery v0.2.2
  [d360d2e6] ChainRulesCore v1.15.7
  [9e997f8a] ChangesOfVariables v0.1.6
  [34da2185] Compat v4.6.1
  [ffbed154] DocStringExtensions v0.9.3
  [e2ba6199] ExprTools v0.1.9
  [0c68f7d7] GPUArrays v8.6.6
  [46192b85] GPUArraysCore v0.1.4
  [61eb1bfa] GPUCompiler v0.19.2
  [3587e190] InverseFunctions v0.1.8
  [92d709cd] IrrationalConstants v0.2.2
  [692b3bcd] JLLWrappers v1.4.1
  [63c18a36] KernelAbstractions v0.9.2
  [929cbde3] LLVM v5.0.0
  [2ab3a3ac] LogExpFunctions v0.3.23
  [1914dd2f] MacroTools v0.5.10
  [21216c6a] Preferences v1.3.0
  [74087812] Random123 v1.6.0
  [e6cf234a] RandomNumbers v1.5.3
  [189a3867] Reexport v1.2.2
  [ae029012] Requires v1.3.0
  [6c6a2e73] Scratch v1.2.0
  [66db9d55] SnoopPrecompile v1.0.3
  [276daf66] SpecialFunctions v2.2.0
  [90137ffa] StaticArrays v1.5.21
  [1e83bf80] StaticArraysCore v1.4.0
  [a759f4b9] TimerOutputs v0.5.22
  [013be700] UnsafeAtomics v0.2.1
  [d80eeb9a] UnsafeAtomicsLLVM v0.1.2
  [4ee394cb] CUDA_Driver_jll v0.5.0+1
  [76a88914] CUDA_Runtime_jll v0.5.0+2
  [dad2f222] LLVMExtra_jll v0.0.21+0
  [efe28fd5] OpenSpecFun_jll v0.5.5+0
  [0dad84c5] ArgTools v1.1.1
  [56f22d72] Artifacts
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [f43a241f] Downloads v1.6.0
  [7b1f6079] FileWatching
  [b77e0a4c] InteractiveUtils
  [4af54fe1] LazyArtifacts
  [b27032c2] LibCURL v0.6.3
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [ca575930] NetworkOptions v1.2.0
  [44cfe95a] Pkg v1.8.0
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA v0.7.0
  [9e88b42a] Serialization
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [fa267f1f] TOML v1.0.0
  [a4e569a6] Tar v1.10.1
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
  [e66e0078] CompilerSupportLibraries_jll v1.0.1+0
  [deac9b47] LibCURL_jll v7.84.0+0
  [29816b5a] LibSSH2_jll v1.10.2+0
  [c8ffd9c3] MbedTLS_jll v2.28.0+0
  [14a3606d] MozillaCACerts_jll v2022.2.1
  [4536629a] OpenBLAS_jll v0.3.20+0
  [05823500] OpenLibm_jll v0.8.1+0
  [83775a58] Zlib_jll v1.2.12+3
  [8e850b90] libblastrampoline_jll v5.1.1+0
  [8e850ede] nghttp2_jll v1.48.0+0
  [3f19e933] p7zip_jll v17.4.0+0

maleadt · 2023-04-19T16:32:18Z

The julia MWE fails on sm_52 for me as well with NVIDIA driver 530.41.3 and CUDA 12.1. Guess this is one way NVIDIA will get me to upgrade 😞.

CUDA 12 still supports Pascal AFAIK, so maybe it could get fixed in 12.2 or 12.3.

simonbyrne · 2023-04-20T17:06:47Z

I'm able to replicate the bug it on Julia 1.9 rc2, with:

julia> CUDA.versioninfo()
CUDA runtime 11.8, artifact installation
CUDA driver 11.3
NVIDIA driver 465.19.1

Libraries: 
- CUBLAS: 11.4.1
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 11.0.0+465.19.1

Toolchain:
- Julia: 1.9.0-rc2
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false

1 device:
  0: NVIDIA Tesla P100-PCIE-16GB (sm_60, 15.597 GiB / 15.899 GiB available)

charleskawczynski · 2023-04-20T19:18:34Z

I see it on Julia 1.8.5, too:

julia> include("..\\testbug.jl")
ERROR: LoadError: AssertionError: sum(a) ≈ 24
Stacktrace:
 [1] top-level scope
   @ dev\CliMA\testbug.jl:61
 [2] include(fname::String)
   @ Base.MainInclude .\client.jl:476
 [3] top-level scope
   @ REPL[5]:1
 [4] top-level scope
   @ .julia\packages\CUDA\Ey3w2\src\initialization.jl:52
in expression starting at dev\CliMA\testbug.jl:61

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_PKG_DEVDIR = dev/julia

julia> CUDA.versioninfo()
CUDA toolkit 11.7, artifact installation
Unknown NVIDIA driver, for CUDA 12.0
CUDA driver 12.0

Libraries:
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.1
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: missing
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA GeForce GTX 1050 (sm_61, 3.263 GiB / 4.000 GiB available)

maleadt · 2023-05-17T11:12:18Z

I heard back from NVIDIA. The issue is with the code we generate. Simplifying the MWE above:

24 @%p0 bra load_shmem; // thread-divergent branch (%p0 is true for threads 1 only),
25                      // start of divergence region
26 @%p1 bra oob1;       // this branch target goes out of the divergence region
...
34 load_shmem:          // end of divergence region
35 bar.sync 0;          // threads need to reconverge before this instruction
...
52 oob1:
53 trap;

To generate the correct code for Pascal and earlier GPUs, the compiler needs to be able to generate SASS instructions (SSY / SYNC) that reconverge threads within a warp before executing the bar.sync instruction.

To do so, the compiler needs to be able to identify the divergence region, which is line 24 to line 34 in the example. Then it needs to ensure that there is no other entry to the region or exit from the region. In the example below, we have a branch at line 26 which exits from the middle of the region. As a result the compiler is not able to generate SSY / SYNC for this region and threads may execute bar.sync divergently which leads to undefined behavior per PTX spec.

This is indeed mentioned in the PTX ISA: https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-bar:

For .target sm_6x or below, all threads in warp (except for those have exited) must execute barrier{.cta} instruction in convergence.

In Julia's case, we can have exceptions everywhere, resulting in branches to blocks outside of divergent regions that result in reconvergence issues. NVIDIA mentioned that they will look into special-casing trap because the program is not expected to resume execution once it reaches that point, but that will only materialize in a future CUDA release.

lcw · 2023-05-17T22:11:04Z

Thanks for the update!

simonbyrne · 2023-06-01T18:27:21Z

FYI, we saw a similar problem, but in reverse (i.e. the results were correct on a P100, but incorrect on a V100) when using CUDA 11.3 runtime. It appears to be fixed by upgrading to CUDA 11.8.

maleadt · 2023-06-05T15:21:44Z

Posted on the LLVM Discourse: https://discourse.llvm.org/t/llvm-reordering-blocks-breaks-ptxas-divergence-analysis/71126

maleadt · 2023-06-06T10:33:21Z

I've implemented a workaround in #1942. It solves the issue seen here by disabling the problematic LLVM passes. That's not great, as disabling those passes also affects CPU codegen, and is likely not a complete solution. But it's better than nothing for now.

maleadt · 2023-06-09T14:30:25Z

Another better workaround: JuliaGPU/GPUCompiler.jl#467

maleadt · 2023-06-13T09:14:56Z

I think I have a fix: #1951. Everyone who has encountered this issue, please test out that PR. It requires a newly-released version of GPUCompiler (v0.21), so be sure to update your environment.

Although the fix seems to cover the remaining instances of this bug (while simplifying the quirks we had in place), it does require a sufficiently recent version of the CUDA toolkit. Anything below CUDA 11.5 is unsupported, and will run into miscompilations (due to bugs in ptxas).

christiangnrd · 2023-06-13T11:48:00Z

Passes where it used to fail on my 1050.

…ruct the CFG. PTX does not have a notion of `unreachable`, which results in emitted basic blocks having an edge to the next block: ``` block1: call @does_not_return(); // unreachable block2: // ptxas will create a CFG edge from block1 to block2 ``` This may result in significant changes to the control flow graph, e.g., when LLVM moves unreachable blocks to the end of the function. That's a problem in the context of divergent control flow, as `ptxas` uses the CFG to determine divergent regions, while some intructions may not be executed divergently. For example, `bar.sync` is not allowed to be executed divergently on Pascal or earlier. If we start with the following: ``` entry: // start of divergent region @%p0 bra cont; @%p1 bra unlikely; ... bra.uni cont; unlikely: ... // unreachable cont: // end of divergent region bar.sync 0; bra.uni exit; exit: ret; ``` it is transformed by the branch-folder and block-placement passes to: ``` entry: // start of divergent region @%p0 bra cont; @%p1 bra unlikely; ... bra.uni cont; cont: bar.sync 0; bra.uni exit; unlikely: ... // unreachable exit: // end of divergent region ret; ``` After moving the `unlikely` block to the end of the function, it has an edge to the `exit` block, which widens the divergent region and makes the `bar.sync` instruction happen divergently. That causes wrong computations, as we've been running into for years with Julia code (which emits a lot of `trap` + `unreachable` code all over the place). To work around this, add an `exit` instruction before every `unreachable`, as `ptxas` understands that exit terminates the CFG. Note that `trap` is not equivalent, and only future versions of `ptxas` will model it like `exit`. Another alternative would be to emit a branch to the block itself, but emitting `exit` seems like a cleaner solution to represent `unreachable` to me. Also note that this may not be sufficient, as it's possible that the block with unreachable control flow is branched to from different divergent regions, e.g. after block merging, in which case it may still be the case that `ptxas` could reconstruct a CFG where divergent regions are merged (I haven't confirmed this, but also haven't encountered this pattern in the wild yet): ``` entry: // start of divergent region 1 @%p0 bra cont1; @%p1 bra unlikely; bra.uni cont1; cont1: // intended end of divergent region 1 bar.sync 0; // start of divergent region 2 @%p2 bra cont2; @%p3 bra unlikely; bra.uni cont2; cont2: // intended end of divergent region 2 bra.uni exit; unlikely: ... exit; exit: // possible end of merged divergent region? ``` I originally tried to avoid the above by cloning paths towards `unreachable` and splitting the outgoing edges, but that quickly became too complicated. I propose we go with the simple solution first, also because modern GPUs with more flexible hardware thread schedulers don't even suffer from this issue. Finally, although I expect this to fix most of https://bugs.llvm.org/show_bug.cgi?id=27738, I do still encounter miscompilations with Julia's unreachable-heavy code when targeting these older GPUs using an older `ptxas` version (specifically, from CUDA 11.4 or below). This is likely due to related bugs in `ptxas` which have been fixed since, as I have filed several reproducers with NVIDIA over the past couple of years. I'm not inclined to look into fixing those issues over here, and will instead be recommending our users to upgrade CUDA to 11.5+ when using these GPUs. Also see: - JuliaGPU/CUDAnative.jl#4 - JuliaGPU/CUDA.jl#1746 - https://discourse.llvm.org/t/llvm-reordering-blocks-breaks-ptxas-divergence-analysis/71126 Reviewed By: jdoerfert, tra Differential Revision: https://reviews.llvm.org/D152789

parfenyev added the bug Something isn't working label Jan 28, 2023

maleadt added the needs information Further information is requested label Jan 31, 2023

maleadt changed the title ~~CUDA Test failed~~ Shared memory + multiple function exits cause invalid results on sm_37 Jan 31, 2023

maleadt changed the title ~~Shared memory + multiple function exits cause invalid results on sm_37~~ Shared memory + multiple function exits cause invalid results Feb 1, 2023

maleadt mentioned this issue Feb 1, 2023

CUDA test fails #1660

Closed

vchuravy mentioned this issue Feb 6, 2023

Revert "Annotate exception handling with call site effects (#359)" JuliaGPU/GPUCompiler.jl#388

Merged

maleadt mentioned this issue Mar 20, 2023

Kernel hang related to noreturn function attributes JuliaGPU/Metal.jl#113

Open

maleadt mentioned this issue Apr 19, 2023

Incorrect kernel execution with bounds checking using Julia 1.9.0-rc2 #1875

Closed

simonbyrne mentioned this issue May 31, 2023

CUDA sum incorrect on V100 CliMA/ClimaCore.jl#1240

Closed

maleadt mentioned this issue Jun 6, 2023

Disable problematic LLVM block placement passes on Pascal and earlier. #1942

Closed

maleadt mentioned this issue Jun 8, 2023

PTX: Lower unreachable control flow to avoid bad CFG reconstruction JuliaGPU/GPUCompiler.jl#467

Merged

maleadt mentioned this issue Jun 13, 2023

Update GPUCompiler to fix unreachable control flow. #1951

Merged

maleadt changed the title ~~Shared memory + multiple function exits cause invalid results~~ Unreachable control flow leads to illegal divergent barriers Jun 13, 2023

maleadt closed this as completed in #1951 Jun 14, 2023

maleadt mentioned this issue Aug 17, 2023

sync_threads() appears to not be sync'ing threads #61

Closed

maleadt mentioned this issue Sep 29, 2023

[NVPTX] Fix code generation for trap-unreachable. llvm/llvm-project#67478

Merged

maleadt mentioned this issue Mar 6, 2024

[NVPTX] Skip processing BasicBlocks with single unreachable instruction in nvptx-lower-unreachable pass. llvm/llvm-project#72641

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unreachable control flow leads to illegal divergent barriers #1746

Unreachable control flow leads to illegal divergent barriers #1746

parfenyev commented Jan 28, 2023 •

edited

Loading

maleadt commented Jan 31, 2023

parfenyev commented Jan 31, 2023

maleadt commented Jan 31, 2023

maleadt commented Jan 31, 2023

maleadt commented Feb 1, 2023

parfenyev commented Feb 1, 2023

maleadt commented Feb 2, 2023

maleadt commented Apr 19, 2023 •

edited

Loading

christiangnrd commented Apr 19, 2023

maleadt commented Apr 19, 2023

maleadt commented Apr 19, 2023

lcw commented Apr 19, 2023

maleadt commented Apr 19, 2023

simonbyrne commented Apr 20, 2023

charleskawczynski commented Apr 20, 2023

maleadt commented May 17, 2023

lcw commented May 17, 2023

simonbyrne commented Jun 1, 2023

maleadt commented Jun 5, 2023 •

edited

Loading

maleadt commented Jun 6, 2023

maleadt commented Jun 9, 2023

maleadt commented Jun 13, 2023

christiangnrd commented Jun 13, 2023

Unreachable control flow leads to illegal divergent barriers #1746

Unreachable control flow leads to illegal divergent barriers #1746

Comments

parfenyev commented Jan 28, 2023 • edited Loading

maleadt commented Jan 31, 2023

parfenyev commented Jan 31, 2023

maleadt commented Jan 31, 2023

maleadt commented Jan 31, 2023

maleadt commented Feb 1, 2023

parfenyev commented Feb 1, 2023

maleadt commented Feb 2, 2023

maleadt commented Apr 19, 2023 • edited Loading

christiangnrd commented Apr 19, 2023

maleadt commented Apr 19, 2023

maleadt commented Apr 19, 2023

lcw commented Apr 19, 2023

maleadt commented Apr 19, 2023

simonbyrne commented Apr 20, 2023

charleskawczynski commented Apr 20, 2023

maleadt commented May 17, 2023

lcw commented May 17, 2023

simonbyrne commented Jun 1, 2023

maleadt commented Jun 5, 2023 • edited Loading

maleadt commented Jun 6, 2023

maleadt commented Jun 9, 2023

maleadt commented Jun 13, 2023

christiangnrd commented Jun 13, 2023

parfenyev commented Jan 28, 2023 •

edited

Loading

maleadt commented Apr 19, 2023 •

edited

Loading

maleadt commented Jun 5, 2023 •

edited

Loading