Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreachable control flow leads to illegal divergent barriers #1746

Closed
parfenyev opened this issue Jan 28, 2023 · 23 comments · Fixed by #1951
Closed

Unreachable control flow leads to illegal divergent barriers #1746

parfenyev opened this issue Jan 28, 2023 · 23 comments · Fixed by #1951
Labels
bug Something isn't working needs information Further information is requested

Comments

@parfenyev
Copy link

parfenyev commented Jan 28, 2023

Test run ] test CUDA finished with errors. The full log-file is attached.

Details on Julia:

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × Intel(R) Core(TM) i5-10600 CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 12 on 12 virtual cores

Details on CUDA:

CUDA toolkit 11.7, artifact installation
NVIDIA driver 470.161.3, for CUDA 11.4
CUDA driver 11.4

Libraries:
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 11.0.0+470.161.3
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

2 devices:
  0: Tesla K80 (sm_37, 11.170 GiB / 11.173 GiB available)
  1: Tesla K80 (sm_37, 11.109 GiB / 11.173 GiB available)

test_report.txt

@parfenyev parfenyev added the bug Something isn't working label Jan 28, 2023
@maleadt
Copy link
Member

maleadt commented Jan 31, 2023

Unsure what's happening here, the log doesn't contain anything useful except for test failures. Would be good to narrow down to the actual operations that fail (esp. the mapreduce failures are unexpected).

@maleadt maleadt added the needs information Further information is requested label Jan 31, 2023
@parfenyev
Copy link
Author

What can I do to help? Detailed instructions would be very helpful.

Calculations in a third-party package that uses GPU via CUDA.jl work as expected. Can these calculations be trusted despite the fact that not all tests are passed successfully?

@maleadt
Copy link
Member

maleadt commented Jan 31, 2023

Hmm, actually one of the failures seems to point to JuliaGPU/CUDAnative.jl#4

codegen: Test Failed at /home/students/parfeniev/.julia/packages/CUDA/BbliS/test/codegen.jl:112
  Expression: cpu(input) == gpu(input)
   Evaluated: [8, 8] == [8, 1076158697]

This has been a longstanding issue, caused by bugs in ptxas wich apparently NVIDIA doesn't manage to fix. The test there was meant to detect exactly such issues,

CUDA.jl/test/codegen.jl

Lines 69 to 115 in f85dd7b

@testset "ptxas-compatible control flow" begin
@noinline function throw_some()
throw(42)
return
end
@inbounds function kernel(input, output, n)
i = threadIdx().x
temp = CuStaticSharedArray(Int, 1)
if i == 1
1 <= n || throw_some()
temp[1] = input
end
sync_threads()
1 <= n || throw_some()
unsafe_store!(output, temp[1], i)
return
end
function gpu(input)
output = CuArray(zeros(eltype(input), 2))
ptr = pointer(output)
ptr = reinterpret(Ptr{eltype(input)}, ptr)
@cuda threads=2 kernel(input, ptr, 99)
return Array(output)
end
function cpu(input)
output = zeros(eltype(input), 2)
for j in 1:2
@inbounds output[j] = input
end
return output
end
input = rand(1:100)
@test cpu(input) == gpu(input)
end
end
. In practice it triggers with mapreduce kernels, which would explain the plethora of failures you're seeing.

For your old device, we already set all the workaround we know: https://github.com/JuliaGPU/CUDA.jl/blob/master/src/compiler/gpucompiler.jl#L21-L35. ptxas for newer devices appears to use a different codebase/compiler stack, which doesn't exhibit this issue.

Bottom line, if you're using mapreduce kernels, or a combination of shared memory and code that may exit early (e.g. because of exceptions), then no the calculations cannot be trusted, even though it's unlikely you're running into this issue for arbitary code. If I have some time I'll grab the oldest GPU I have laying around to see if I can reproduce this, but in the mean time using a more recent GPU may be the best option.

@maleadt maleadt changed the title CUDA Test failed Shared memory + multiple function exits cause invalid results on sm_37 Jan 31, 2023
@maleadt
Copy link
Member

maleadt commented Jan 31, 2023

CUDA toolkit 11.7, artifact installation

One thing you could try, is upgrading CUDA.jl so that the 11.8 or even 12.0 compiler is used (#1742), although I don't think NVIDIA has touched ptxas for sm_37 in those releases (this architecture is deprecated and slated for removal).

@maleadt maleadt changed the title Shared memory + multiple function exits cause invalid results on sm_37 Shared memory + multiple function exits cause invalid results Feb 1, 2023
@maleadt
Copy link
Member

maleadt commented Feb 1, 2023

#1660 looks like another instance of this, but on more recent hardware (sm_75)...

@parfenyev
Copy link
Author

Thank you @maleadt for your response! I tried to test ] test CUDA on a more modern GPU, now there are much fewer errors. Can you comment on what is wrong here and how critical it is for the calculations?

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 96 × AMD EPYC 7443 24-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
  Threads: 1 on 96 virtual cores
Environment:
  LD_LIBRARY_PATH = /opt/AMD/aocl//4.0/lib:
CUDA toolkit 11.7, artifact installation
NVIDIA driver 525.60.13, for CUDA 12.0
CUDA driver 12.0

Libraries:
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 12.0.0+525.60.13
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

4 devices:
  0: NVIDIA A100-SXM4-80GB (sm_80, 79.181 GiB / 80.000 GiB available)
  1: NVIDIA A100-SXM4-80GB (sm_80, 79.181 GiB / 80.000 GiB available)
  2: NVIDIA A100-SXM4-80GB (sm_80, 79.181 GiB / 80.000 GiB available)
  3: NVIDIA A100-SXM4-80GB (sm_80, 79.181 GiB / 80.000 GiB available)

test_report_v2.txt

@maleadt
Copy link
Member

maleadt commented Feb 2, 2023

Those failures are probably not an issue, unless you rely on that specific CUSPARSE functionality. If so, do test the latest release of CUDA.jl 4.0, since much has changed. If it still happens, please open a separate issue.

Let's keep the current one about the unstructured cfg-related codegen issue. FWIW, I can reproduce on my old sm_35 GPU. I'll try to reduce, however, it's unlikely that NVIDIA will fix this as sm_35/sm_37 are officially unsupported on CUDA 12.0.

@maleadt
Copy link
Member

maleadt commented Apr 19, 2023

The following MWE, extracted from the CUDA.jl test suite, fails for me on Kepler hardware (sm_35; a GTX Titan) using Julia 1.9, not even using --check-bounds=yes:

using CUDA, Test

@noinline function throw_some()
    throw(42)
    return
end

@inbounds function kernel(input, output, n)
    i = threadIdx().x

    temp = CuStaticSharedArray(Int, 1)
    if i == 1
        1 <= n || throw_some()
        temp[1] = input
    end
    sync_threads()

    1 <= n || throw_some()
    unsafe_store!(output, temp[1], i)

    return
end

function gpu(input)
    output = CuArray(zeros(eltype(input), 2))
    ptr = pointer(output)
    ptr = reinterpret(Ptr{eltype(input)}, ptr)

    @cuda threads=2 kernel(input, ptr, 99)

    return Array(output)
end

function cpu(input)
    output = zeros(eltype(input), 2)

    for j in 1:2
        @inbounds output[j] = input
    end

    return output
end

input = rand(1:100)
@test cpu(input) == gpu(input)

Would be good if other people can confirm.

EDIT: can also reproduce this on a GTX 970 (sm_52), using the same driver (470.182.3, for CUDA 11.4, but using ptxas from 11.8).

EDIT: also reproduces on the GTX 970 using 530.41.3 (CUDA 12.1) -- I was only using driver 470 because that's the latest supporting my GTX Titan.

@christiangnrd
Copy link
Member

On a 1050 mobile GPU:

julia> @test cpu(input) == gpu(input)
Test Failed at REPL[17]:1
  Expression: cpu(input) == gpu(input)
   Evaluated: [72, 72] == [72, 0]

ERROR: There was an error during testing

julia> versioninfo()
Julia Version 1.9.0-rc2
Commit 72aec423c2a (2023-04-01 10:41 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = auto

julia> CUDA.versioninfo()
CUDA runtime 12.1, artifact installation
CUDA driver 12.0
NVIDIA driver 525.105.17

Libraries: 
- CUBLAS: 12.1.0
- CURAND: 10.3.2
- CUFFT: 11.0.2
- CUSOLVER: 11.4.4
- CUSPARSE: 12.0.2
- CUPTI: 18.0.0
- NVML: 12.0.0+525.105.17

Toolchain:
- Julia: 1.9.0-rc2
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA GeForce GTX 1050 (sm_61, 3.104 GiB / 4.000 GiB available)

@maleadt
Copy link
Member

maleadt commented Apr 19, 2023

mwe.ptx

.version 6.3
.target sm_52
.address_size 64

.visible .entry kernel(
    .param .u64 output_ptr,
    .param .u32 n
)
{
    .reg .pred %p<2>;
    .reg .b32 %r<8>;
    .reg .b64 %rd<5>;

    .shared .align 32 .b8 shmem[4];

    // load the thread idx, check if we're thread 0
    mov.u32 %r1, %tid.x;
    setp.ne.s32 %p0, %r1, 0;

    // something that used to be an OOB check
    ld.param.u32 %r2, [n];
    setp.lt.s32 %p1, %r2, 1;

    @%p0 bra load_shmem;

    @%p1 bra oob1;

    bra.uni set_shmem;

    // if we set the oob1 block here, we get the expected output:
oob1:
    trap;

set_shmem:
    // thread 0 initializes the shmem
    st.shared.u32 [shmem], 42;

load_shmem:
    bar.sync 0;

    // all threads load from shmem and store to global memory
    @%p1 bra oob2;
    ld.param.u64 %rd1, [output_ptr];
    ld.shared.u32 %r3, [shmem];
    mul.wide.u32 %rd3, %r1, 4;
    add.s64 %rd4, %rd1, %rd3;
    shr.u32 %r5, %r3, 24;
    st.u8 [%rd4+3], %r5;
    shr.u32 %r6, %r3, 16;
    st.u8 [%rd4+2], %r6;
    shr.u32 %r7, %r3, 8;
    st.u8 [%rd4+1], %r7;
    st.u8 [%rd4], %r3;
    ret;

    // if we set the oob1 block here, we get a miscompile
oob1:
    trap;

oob2:
    trap;
}

loader.c

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <time.h>
#include <cuda.h>

int main(int argc, char *argv[]) {
    if (argc != 2) {
        printf("Usage: %s <PTX file>\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    CUdevice cuDevice;
    CUcontext cuContext;
    CUmodule cuModule;
    CUfunction cuKernel;
    CUdeviceptr d_output;

    int threads = 2, n = 99;
    int h_output[2];

    // Initialize the CUDA driver API
    assert(cuInit(0) == CUDA_SUCCESS);
    assert(cuDeviceGet(&cuDevice, 0) == CUDA_SUCCESS);
    assert(cuCtxCreate(&cuContext, 0, cuDevice) == CUDA_SUCCESS);

    // Load the PTX file
    assert(cuModuleLoad(&cuModule, argv[1]) == CUDA_SUCCESS);
    assert(cuModuleGetFunction(&cuKernel, cuModule, "kernel") == CUDA_SUCCESS);

    // Allocate GPU memory for the output
    assert(cuMemAlloc(&d_output, sizeof(int) * 2) == CUDA_SUCCESS);

    // Set up the kernel parameters
    void *kernel_params[] = {&d_output, &n};

    // Launch the kernel
    assert(cuLaunchKernel(cuKernel, 1, 1, 1, threads, 1, 1, 0, NULL, kernel_params, NULL) == CUDA_SUCCESS);

    // Copy the output back to the host
    assert(cuMemcpyDtoH(h_output, d_output, sizeof(int) * 2) == CUDA_SUCCESS);

    // Print the output
    printf("Output: %d %d\n", h_output[0], h_output[1]);

    // Clean up
    cuMemFree(d_output);
    cuModuleUnload(cuModule);
    cuCtxDestroy(cuContext);

    return 0;
}
❯ gcc loader.c -isystem /opt/cuda/include -lcuda -o loader && (./loader good.ptx; ./loader bad.ptx)
Output: 42 42
Output: 42 0

Note the difference in oob1 placement. This looks really bad. I can't see us working around it...

@maleadt
Copy link
Member

maleadt commented Apr 19, 2023

Filed with NVIDIA as bug 4078847.

@lcw
Copy link
Contributor

lcw commented Apr 19, 2023

Would be good if other people can confirm.

The julia MWE fails on sm_52 for me as well with NVIDIA driver 530.41.3 and CUDA 12.1. Guess this is one way NVIDIA will get me to upgrade 😞.

❯ julia --project
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.9.0-rc2 (2023-04-01)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> include("bad.jl")
Test Failed at /tmp/scratch/lucas/bug/bad.jl:45
  Expression: cpu(input) == gpu(input)
   Evaluated: [84, 84] == [84, 0]

ERROR: LoadError: There was an error during testing
in expression starting at /tmp/scratch/lucas/bug/bad.jl:45

julia> versioninfo()
Julia Version 1.9.0-rc2
Commit 72aec423c2a (2023-04-01 10:41 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 7950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 1 on 32 virtual cores

julia> CUDA.versioninfo()
CUDA runtime 12.1, artifact installation
CUDA driver 12.1
NVIDIA driver 530.41.3

Libraries: 
- CUBLAS: 12.1.0
- CURAND: 10.3.2
- CUFFT: 11.0.2
- CUSOLVER: 11.4.4
- CUSPARSE: 12.0.2
- CUPTI: 18.0.0
- NVML: 12.0.0+530.41.3

Toolchain:
- Julia: 1.9.0-rc2
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA GeForce GTX TITAN X (sm_52, 10.411 GiB / 12.000 GiB available)

(bug) pkg> st --manifest
Status `/tmp/scratch/lucas/bug/Manifest.toml`
  [621f4979] AbstractFFTs v1.3.1
  [79e6a3ab] Adapt v3.6.1
  [a9b6321e] Atomix v0.1.0
  [ab4f0b2a] BFloat16s v0.4.2
  [fa961155] CEnum v0.4.2
  [052768ef] CUDA v4.1.4
  [1af6417a] CUDA_Runtime_Discovery v0.2.2
  [d360d2e6] ChainRulesCore v1.15.7
  [9e997f8a] ChangesOfVariables v0.1.6
  [34da2185] Compat v4.6.1
  [ffbed154] DocStringExtensions v0.9.3
  [e2ba6199] ExprTools v0.1.9
  [0c68f7d7] GPUArrays v8.6.6
  [46192b85] GPUArraysCore v0.1.4
  [61eb1bfa] GPUCompiler v0.19.2
  [3587e190] InverseFunctions v0.1.8
  [92d709cd] IrrationalConstants v0.2.2
  [692b3bcd] JLLWrappers v1.4.1
  [63c18a36] KernelAbstractions v0.9.2
  [929cbde3] LLVM v5.0.0
  [2ab3a3ac] LogExpFunctions v0.3.23
  [1914dd2f] MacroTools v0.5.10
  [21216c6a] Preferences v1.3.0
  [74087812] Random123 v1.6.0
  [e6cf234a] RandomNumbers v1.5.3
  [189a3867] Reexport v1.2.2
  [ae029012] Requires v1.3.0
  [6c6a2e73] Scratch v1.2.0
  [66db9d55] SnoopPrecompile v1.0.3
  [276daf66] SpecialFunctions v2.2.0
  [90137ffa] StaticArrays v1.5.21
  [1e83bf80] StaticArraysCore v1.4.0
  [a759f4b9] TimerOutputs v0.5.22
  [013be700] UnsafeAtomics v0.2.1
  [d80eeb9a] UnsafeAtomicsLLVM v0.1.2
  [4ee394cb] CUDA_Driver_jll v0.5.0+1
  [76a88914] CUDA_Runtime_jll v0.5.0+2
  [dad2f222] LLVMExtra_jll v0.0.21+0
  [efe28fd5] OpenSpecFun_jll v0.5.5+0
  [0dad84c5] ArgTools v1.1.1
  [56f22d72] Artifacts
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [f43a241f] Downloads v1.6.0
  [7b1f6079] FileWatching
  [b77e0a4c] InteractiveUtils
  [4af54fe1] LazyArtifacts
  [b27032c2] LibCURL v0.6.3
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [ca575930] NetworkOptions v1.2.0
  [44cfe95a] Pkg v1.8.0
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA v0.7.0
  [9e88b42a] Serialization
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [fa267f1f] TOML v1.0.0
  [a4e569a6] Tar v1.10.1
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
  [e66e0078] CompilerSupportLibraries_jll v1.0.1+0
  [deac9b47] LibCURL_jll v7.84.0+0
  [29816b5a] LibSSH2_jll v1.10.2+0
  [c8ffd9c3] MbedTLS_jll v2.28.0+0
  [14a3606d] MozillaCACerts_jll v2022.2.1
  [4536629a] OpenBLAS_jll v0.3.20+0
  [05823500] OpenLibm_jll v0.8.1+0
  [83775a58] Zlib_jll v1.2.12+3
  [8e850b90] libblastrampoline_jll v5.1.1+0
  [8e850ede] nghttp2_jll v1.48.0+0
  [3f19e933] p7zip_jll v17.4.0+0

@maleadt
Copy link
Member

maleadt commented Apr 19, 2023

The julia MWE fails on sm_52 for me as well with NVIDIA driver 530.41.3 and CUDA 12.1. Guess this is one way NVIDIA will get me to upgrade 😞.

CUDA 12 still supports Pascal AFAIK, so maybe it could get fixed in 12.2 or 12.3.

@simonbyrne
Copy link
Contributor

I'm able to replicate the bug it on Julia 1.9 rc2, with:

julia> CUDA.versioninfo()
CUDA runtime 11.8, artifact installation
CUDA driver 11.3
NVIDIA driver 465.19.1

Libraries: 
- CUBLAS: 11.4.1
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 11.0.0+465.19.1

Toolchain:
- Julia: 1.9.0-rc2
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false

1 device:
  0: NVIDIA Tesla P100-PCIE-16GB (sm_60, 15.597 GiB / 15.899 GiB available)

@charleskawczynski
Copy link
Contributor

I see it on Julia 1.8.5, too:

julia> include("..\\testbug.jl")
ERROR: LoadError: AssertionError: sum(a)  24
Stacktrace:
 [1] top-level scope
   @ dev\CliMA\testbug.jl:61
 [2] include(fname::String)
   @ Base.MainInclude .\client.jl:476
 [3] top-level scope
   @ REPL[5]:1
 [4] top-level scope
   @ .julia\packages\CUDA\Ey3w2\src\initialization.jl:52
in expression starting at dev\CliMA\testbug.jl:61

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_PKG_DEVDIR = dev/julia

julia> CUDA.versioninfo()
CUDA toolkit 11.7, artifact installation
Unknown NVIDIA driver, for CUDA 12.0
CUDA driver 12.0

Libraries:
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.1
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: missing
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA GeForce GTX 1050 (sm_61, 3.263 GiB / 4.000 GiB available)

@maleadt
Copy link
Member

maleadt commented May 17, 2023

I heard back from NVIDIA. The issue is with the code we generate. Simplifying the MWE above:

24 @%p0 bra load_shmem; // thread-divergent branch (%p0 is true for threads 1 only),
25                      // start of divergence region
26 @%p1 bra oob1;       // this branch target goes out of the divergence region
...
34 load_shmem:          // end of divergence region
35 bar.sync 0;          // threads need to reconverge before this instruction
...
52 oob1:
53 trap;

To generate the correct code for Pascal and earlier GPUs, the compiler needs to be able to generate SASS instructions (SSY / SYNC) that reconverge threads within a warp before executing the bar.sync instruction.

To do so, the compiler needs to be able to identify the divergence region, which is line 24 to line 34 in the example. Then it needs to ensure that there is no other entry to the region or exit from the region. In the example below, we have a branch at line 26 which exits from the middle of the region. As a result the compiler is not able to generate SSY / SYNC for this region and threads may execute bar.sync divergently which leads to undefined behavior per PTX spec.

This is indeed mentioned in the PTX ISA: https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-bar:

For .target sm_6x or below, all threads in warp (except for those have exited) must execute barrier{.cta} instruction in convergence.

In Julia's case, we can have exceptions everywhere, resulting in branches to blocks outside of divergent regions that result in reconvergence issues. NVIDIA mentioned that they will look into special-casing trap because the program is not expected to resume execution once it reaches that point, but that will only materialize in a future CUDA release.

@lcw
Copy link
Contributor

lcw commented May 17, 2023

Thanks for the update!

@simonbyrne
Copy link
Contributor

FYI, we saw a similar problem, but in reverse (i.e. the results were correct on a P100, but incorrect on a V100) when using CUDA 11.3 runtime. It appears to be fixed by upgrading to CUDA 11.8.

@maleadt
Copy link
Member

maleadt commented Jun 5, 2023

@maleadt
Copy link
Member

maleadt commented Jun 6, 2023

I've implemented a workaround in #1942. It solves the issue seen here by disabling the problematic LLVM passes. That's not great, as disabling those passes also affects CPU codegen, and is likely not a complete solution. But it's better than nothing for now.

@maleadt
Copy link
Member

maleadt commented Jun 9, 2023

Another better workaround: JuliaGPU/GPUCompiler.jl#467

@maleadt
Copy link
Member

maleadt commented Jun 13, 2023

I think I have a fix: #1951. Everyone who has encountered this issue, please test out that PR. It requires a newly-released version of GPUCompiler (v0.21), so be sure to update your environment.

Although the fix seems to cover the remaining instances of this bug (while simplifying the quirks we had in place), it does require a sufficiently recent version of the CUDA toolkit. Anything below CUDA 11.5 is unsupported, and will run into miscompilations (due to bugs in ptxas).

@christiangnrd
Copy link
Member

Passes where it used to fail on my 1050.

@maleadt maleadt changed the title Shared memory + multiple function exits cause invalid results Unreachable control flow leads to illegal divergent barriers Jun 13, 2023
Artem-B pushed a commit to llvm/llvm-project that referenced this issue Jun 21, 2023
…ruct the CFG.

PTX does not have a notion of `unreachable`, which results in emitted basic
blocks having an edge to the next block:

```
block1:
  call @does_not_return();
  // unreachable
block2:
  // ptxas will create a CFG edge from block1 to block2
```

This may result in significant changes to the control flow graph, e.g., when
LLVM moves unreachable blocks to the end of the function. That's a problem
in the context of divergent control flow, as `ptxas` uses the CFG to determine
divergent regions, while some intructions may not be executed divergently.

For example, `bar.sync` is not allowed to be executed divergently on Pascal
or earlier. If we start with the following:

```
entry:
  // start of divergent region
  @%p0 bra cont;
  @%p1 bra unlikely;
  ...
  bra.uni cont;
unlikely:
  ...
  // unreachable
cont:
  // end of divergent region
  bar.sync 0;
  bra.uni exit;
exit:
  ret;
```

it is transformed by the branch-folder and block-placement passes to:

```
entry:
  // start of divergent region
  @%p0 bra cont;
  @%p1 bra unlikely;
  ...
  bra.uni cont;
cont:
  bar.sync 0;
  bra.uni exit;
unlikely:
  ...
  // unreachable
exit:
  // end of divergent region
  ret;
```

After moving the `unlikely` block to the end of the function, it has an edge
to the `exit` block, which widens the divergent region and makes the `bar.sync`
instruction happen divergently. That causes wrong computations, as we've been
running into for years with Julia code (which emits a lot of `trap` +
`unreachable` code all over the place).

To work around this, add an `exit` instruction before every `unreachable`,
as `ptxas` understands that exit terminates the CFG. Note that `trap` is not
equivalent, and only future versions of `ptxas` will model it like `exit`.
Another alternative would be to emit a branch to the block itself, but emitting
`exit` seems like a cleaner solution to represent `unreachable` to me.

Also note that this may not be sufficient, as it's possible that the block
with unreachable control flow is branched to from different divergent regions,
e.g. after block merging, in which case it may still be the case that `ptxas`
could reconstruct a CFG where divergent regions are merged (I haven't confirmed
this, but also haven't encountered this pattern in the wild yet):

```
entry:
  // start of divergent region 1
  @%p0 bra cont1;
  @%p1 bra unlikely;
  bra.uni cont1;
cont1:
  // intended end of divergent region 1
  bar.sync 0;
  // start of divergent region 2
  @%p2 bra cont2;
  @%p3 bra unlikely;
  bra.uni cont2;
cont2:
  // intended end of divergent region 2
  bra.uni exit;
unlikely:
  ...
  exit;
exit:
  // possible end of merged divergent region?
```

I originally tried to avoid the above by cloning paths towards `unreachable` and
splitting the outgoing edges, but that quickly became too complicated. I propose
we go with the simple solution first, also because modern GPUs with more flexible
hardware thread schedulers don't even suffer from this issue.

Finally, although I expect this to fix most of
https://bugs.llvm.org/show_bug.cgi?id=27738, I do still encounter
miscompilations with Julia's unreachable-heavy code when targeting these
older GPUs using an older `ptxas` version (specifically, from CUDA 11.4 or
below). This is likely due to related bugs in `ptxas` which have been fixed
since, as I have filed several reproducers with NVIDIA over the past couple of
years. I'm not inclined to look into fixing those issues over here, and will
instead be recommending our users to upgrade CUDA to 11.5+ when using these GPUs.

Also see:
- JuliaGPU/CUDAnative.jl#4
- JuliaGPU/CUDA.jl#1746
- https://discourse.llvm.org/t/llvm-reordering-blocks-breaks-ptxas-divergence-analysis/71126

Reviewed By: jdoerfert, tra

Differential Revision: https://reviews.llvm.org/D152789
Chenyang-L pushed a commit to intel/llvm that referenced this issue Jul 11, 2023
…ruct the CFG.

PTX does not have a notion of `unreachable`, which results in emitted basic
blocks having an edge to the next block:

```
block1:
  call @does_not_return();
  // unreachable
block2:
  // ptxas will create a CFG edge from block1 to block2
```

This may result in significant changes to the control flow graph, e.g., when
LLVM moves unreachable blocks to the end of the function. That's a problem
in the context of divergent control flow, as `ptxas` uses the CFG to determine
divergent regions, while some intructions may not be executed divergently.

For example, `bar.sync` is not allowed to be executed divergently on Pascal
or earlier. If we start with the following:

```
entry:
  // start of divergent region
  @%p0 bra cont;
  @%p1 bra unlikely;
  ...
  bra.uni cont;
unlikely:
  ...
  // unreachable
cont:
  // end of divergent region
  bar.sync 0;
  bra.uni exit;
exit:
  ret;
```

it is transformed by the branch-folder and block-placement passes to:

```
entry:
  // start of divergent region
  @%p0 bra cont;
  @%p1 bra unlikely;
  ...
  bra.uni cont;
cont:
  bar.sync 0;
  bra.uni exit;
unlikely:
  ...
  // unreachable
exit:
  // end of divergent region
  ret;
```

After moving the `unlikely` block to the end of the function, it has an edge
to the `exit` block, which widens the divergent region and makes the `bar.sync`
instruction happen divergently. That causes wrong computations, as we've been
running into for years with Julia code (which emits a lot of `trap` +
`unreachable` code all over the place).

To work around this, add an `exit` instruction before every `unreachable`,
as `ptxas` understands that exit terminates the CFG. Note that `trap` is not
equivalent, and only future versions of `ptxas` will model it like `exit`.
Another alternative would be to emit a branch to the block itself, but emitting
`exit` seems like a cleaner solution to represent `unreachable` to me.

Also note that this may not be sufficient, as it's possible that the block
with unreachable control flow is branched to from different divergent regions,
e.g. after block merging, in which case it may still be the case that `ptxas`
could reconstruct a CFG where divergent regions are merged (I haven't confirmed
this, but also haven't encountered this pattern in the wild yet):

```
entry:
  // start of divergent region 1
  @%p0 bra cont1;
  @%p1 bra unlikely;
  bra.uni cont1;
cont1:
  // intended end of divergent region 1
  bar.sync 0;
  // start of divergent region 2
  @%p2 bra cont2;
  @%p3 bra unlikely;
  bra.uni cont2;
cont2:
  // intended end of divergent region 2
  bra.uni exit;
unlikely:
  ...
  exit;
exit:
  // possible end of merged divergent region?
```

I originally tried to avoid the above by cloning paths towards `unreachable` and
splitting the outgoing edges, but that quickly became too complicated. I propose
we go with the simple solution first, also because modern GPUs with more flexible
hardware thread schedulers don't even suffer from this issue.

Finally, although I expect this to fix most of
https://bugs.llvm.org/show_bug.cgi?id=27738, I do still encounter
miscompilations with Julia's unreachable-heavy code when targeting these
older GPUs using an older `ptxas` version (specifically, from CUDA 11.4 or
below). This is likely due to related bugs in `ptxas` which have been fixed
since, as I have filed several reproducers with NVIDIA over the past couple of
years. I'm not inclined to look into fixing those issues over here, and will
instead be recommending our users to upgrade CUDA to 11.5+ when using these GPUs.

Also see:
- JuliaGPU/CUDAnative.jl#4
- JuliaGPU/CUDA.jl#1746
- https://discourse.llvm.org/t/llvm-reordering-blocks-breaks-ptxas-divergence-analysis/71126

Reviewed By: jdoerfert, tra

Differential Revision: https://reviews.llvm.org/D152789
veselypeta pushed a commit to veselypeta/cherillvm that referenced this issue Aug 30, 2024
…ruct the CFG.

PTX does not have a notion of `unreachable`, which results in emitted basic
blocks having an edge to the next block:

```
block1:
  call @does_not_return();
  // unreachable
block2:
  // ptxas will create a CFG edge from block1 to block2
```

This may result in significant changes to the control flow graph, e.g., when
LLVM moves unreachable blocks to the end of the function. That's a problem
in the context of divergent control flow, as `ptxas` uses the CFG to determine
divergent regions, while some intructions may not be executed divergently.

For example, `bar.sync` is not allowed to be executed divergently on Pascal
or earlier. If we start with the following:

```
entry:
  // start of divergent region
  @%p0 bra cont;
  @%p1 bra unlikely;
  ...
  bra.uni cont;
unlikely:
  ...
  // unreachable
cont:
  // end of divergent region
  bar.sync 0;
  bra.uni exit;
exit:
  ret;
```

it is transformed by the branch-folder and block-placement passes to:

```
entry:
  // start of divergent region
  @%p0 bra cont;
  @%p1 bra unlikely;
  ...
  bra.uni cont;
cont:
  bar.sync 0;
  bra.uni exit;
unlikely:
  ...
  // unreachable
exit:
  // end of divergent region
  ret;
```

After moving the `unlikely` block to the end of the function, it has an edge
to the `exit` block, which widens the divergent region and makes the `bar.sync`
instruction happen divergently. That causes wrong computations, as we've been
running into for years with Julia code (which emits a lot of `trap` +
`unreachable` code all over the place).

To work around this, add an `exit` instruction before every `unreachable`,
as `ptxas` understands that exit terminates the CFG. Note that `trap` is not
equivalent, and only future versions of `ptxas` will model it like `exit`.
Another alternative would be to emit a branch to the block itself, but emitting
`exit` seems like a cleaner solution to represent `unreachable` to me.

Also note that this may not be sufficient, as it's possible that the block
with unreachable control flow is branched to from different divergent regions,
e.g. after block merging, in which case it may still be the case that `ptxas`
could reconstruct a CFG where divergent regions are merged (I haven't confirmed
this, but also haven't encountered this pattern in the wild yet):

```
entry:
  // start of divergent region 1
  @%p0 bra cont1;
  @%p1 bra unlikely;
  bra.uni cont1;
cont1:
  // intended end of divergent region 1
  bar.sync 0;
  // start of divergent region 2
  @%p2 bra cont2;
  @%p3 bra unlikely;
  bra.uni cont2;
cont2:
  // intended end of divergent region 2
  bra.uni exit;
unlikely:
  ...
  exit;
exit:
  // possible end of merged divergent region?
```

I originally tried to avoid the above by cloning paths towards `unreachable` and
splitting the outgoing edges, but that quickly became too complicated. I propose
we go with the simple solution first, also because modern GPUs with more flexible
hardware thread schedulers don't even suffer from this issue.

Finally, although I expect this to fix most of
https://bugs.llvm.org/show_bug.cgi?id=27738, I do still encounter
miscompilations with Julia's unreachable-heavy code when targeting these
older GPUs using an older `ptxas` version (specifically, from CUDA 11.4 or
below). This is likely due to related bugs in `ptxas` which have been fixed
since, as I have filed several reproducers with NVIDIA over the past couple of
years. I'm not inclined to look into fixing those issues over here, and will
instead be recommending our users to upgrade CUDA to 11.5+ when using these GPUs.

Also see:
- JuliaGPU/CUDAnative.jl#4
- JuliaGPU/CUDA.jl#1746
- https://discourse.llvm.org/t/llvm-reordering-blocks-breaks-ptxas-divergence-analysis/71126

Reviewed By: jdoerfert, tra

Differential Revision: https://reviews.llvm.org/D152789
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs information Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants