Misc. bug: CUDA errors with multi-threaded use #11804

aropb · 2025-02-11T09:26:11Z

I'm using LLamaSharp 0.21.0 (CUDA 12.8 beckend) with the commit llama.cpp:
5783575

Model for Inference (one instance for all users): Qwen2.5-14B-1M-Q5-K-M
Model for Embedding (one instance for all users): Qwen2.5-1.5B-Q5-K-M

All models using 12Gb VRAM.
FlashAttention = true!

Memory on the server: RAM 32Gb - 48Gb.

There is always enough memory (RAM, VRAM) for queries with a margin.
Everything works fine for one user.

When I run 3-4 web requests at the same time, the application crashes with fatal CUDA errors. The errors are about always the same. One request using 3,3Gb VRAM (context size: 16K, nbatch: 2048, ubatch: 512).
I see errors both when using one GPU (RTX 4090) and when using two GPUs (2 x RTX 4090, layer split mode, tensors and VRAM 50/50).

I've read it and apparently this shouldn't be happening:
#3960
#6017

Errors disappear when I add blocking code sections at the LLamaSharp level that perform: creating a context, deleting a context, using a decoder, embedding, clear KV cache, etc.

The LLamaSharp team is aware of the problem, but their code is clean and calls the native API llama.cpp.
The LLamaSharp code has its own partial (not all) resource locking during multithreading.

But at the same time, the release of #3960 was supposed to solve all these problems?
Can you tell me what to do and where the problem might be?
What is the correct approach to use llama.cpp with a GPU in a multithreaded environment?
There may be assembly recommendations llama.cpp for multithreaded mode?

If I understand correctly, then each thread needs its own ggml_backend instance?
Is it possible to create an ggml_backend instance without reloading the model?

Thanks.

CUDA Errors:

2025-02-09 16:44:06.2064 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:44:06.2064 LLama.Native.NativeApi.llama_kv_cache_clear Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 1, in function ggml_cuda_op_mul_mat at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:1615
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:44:06.2427 LLama.Native.NativeApi.llama_kv_cache_clear Error: current device: 1, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.NativeApi.llama_kv_cache_clear Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.SafeLLamaContextHandle.llama_decode Error: cudaGetLastError()
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode

2025-02-09 16:48:54.9660 LLama.Native.SafeLLamaContextHandle.llama_decode Error: ggml_cuda_compute_forward: ADD failed
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9660 LLama.Native.NativeApi.llama_kv_cache_clear Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9864 LLama.Native.NativeApi.llama_kv_cache_clear Error: current device: 1, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:607
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 1, in function ggml_cuda_compute_forward at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:2313
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9864 LLama.Native.NativeApi.llama_kv_cache_clear Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: err
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode

SciSharp/LLamaSharp#1091

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

libllama (core library)

Problem description & steps to reproduce

CUDA 12, 1-2 GPUs
Multiple users requests

The text was updated successfully, but these errors were encountered:

JohannesGaessler · 2025-02-11T09:59:43Z

Do you still get errors or incorrect results when you set the environment variable GGML_CUDA_DISABLE_GRAPHS=1?

aropb · 2025-02-11T10:29:44Z

I'm getting errors, I'm taking the llama backend build from the llamasharp repository.

aropb · 2025-02-11T10:31:23Z

GGML_CUDA_DISABLE_GRAPHS=1

Please explain about this option, should it be used when building llama?

JohannesGaessler · 2025-02-11T12:08:51Z

By default CUDA graphs are used. At runtime ggml-cuda.cu checks for whether the environment variable GGML_CUDA_DISABLE_GRAPHS is set (to anything). If it is, CUDA graphs are not used. The environment variable has no influence at compile time.

aropb · 2025-02-11T12:52:13Z

I'm checking this flag.

aropb · 2025-02-11T14:27:32Z

checks for whether the environment variable GGML_CUDA_DISABLE_GRAPHS

I have installed GGML_CUDA_DISABLE_GRAPHS=1

Indeed, it works with this flag. But why is this the only correct solution in this case? As far as I understand, there may be a drop in performance.

As soon as I remove the flag, I get CUDA errors (one GPU):

2025-02-11 17:15:15.2025 LLama.Native.SafeLLamaContextHandle.llama_decode Error: ggml_cuda_compute_forward: ADD failed
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-11 17:15:15.2025 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-11 17:15:15.2345 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-11 17:15:15.2345 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-11 17:15:15.2345 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 0, in function ggml_cuda_compute_forward at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:2313
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-11 17:15:15.2345 LLama.Native.SafeLLamaContextHandle.llama_decode Error: err
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-11 17:15:15.2345 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model

2025-02-11 17:16:24.6475 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-11 17:16:24.6622 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 0, in function launch_fattn at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\fattn-common.cuh:694
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-11 17:16:24.6622 LLama.Native.SafeLLamaContextHandle.llama_decode Error: cudaGetLastError()
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-11 17:16:24.7242 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-11 17:16:24.7253 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-11 17:16:24.7253 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-11 17:16:25.6419 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-11 17:16:25.6419 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-11 17:16:25.6419 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model

aropb · 2025-02-11T15:55:14Z

One more question, will such a call be thread-safe, given that there is only one instance of the model for everyone?

llama_tokenize(llama_model_get_vocab(this), ...)

JohannesGaessler · 2025-02-11T16:25:09Z

Indeed, it works with this flag. But why is this the only correct solution in this case? As far as I understand, there may be a drop in performance.

I don't know, @agray3 is probably a better person to answer this.

One more question, will such a call be thread-safe, given that there is only one instance of the model for everyone?

I don't know.

aropb · 2025-02-12T11:15:13Z

@slaren
Maybe you can help me?
Thanks.

slaren · 2025-02-12T11:21:00Z

llama_context objects are not thread safe, you will need a different one for each thread. I don't know if you already doing that, the logs from the C# bindings are not useful to me. If you are already using a different llama_context object per thread and are convinced that it is a bug in llama.cpp, please write a minimal repro in C/C++.

aropb · 2025-02-12T11:31:25Z

llama_context objects are not thread safe, you will need a different one for each thread

Yes.

The question is why the flag GGML_CUDA_DISABLE_GRAPHS=1 solved the problem.
There is an llama.cpp function under all C# calls (the name is the same).

llama_tokenize(llama_model_get_vocab(this), ...)
is such a call safe for all threads?

slaren · 2025-02-12T11:40:22Z

llama_tokenize and llama_model_get_vocab are thread-safe.

aropb · 2025-02-12T12:54:49Z

llama_tokenize and llama_model_get_vocab are thread-safe.

Please, why the flag GGML_CUDA_DISABLE_GRAPHS=1 solved the problem?

slaren · 2025-02-12T13:01:53Z

If I knew, I would have already told you. What's the point of insisting?

agray3 · 2025-02-12T13:15:21Z

I’m AFK on holiday this week, so can’t look in detail but you can try changing the mode at https://github.com/ggerganov/llama.cpp/blob/369be5598ac5f71f6aa6d6606e9aec769a23dafa/ggml/src/ggml-cuda/ggml-cuda.cu#L2757 to cudaStreamCaptureModeThreadLocalSent from my phoneOn 12 Feb 2025, at 14:02, Diego Devesa ***@***.***> wrote: If I knew, I would have already told you. What's the point of insisting? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

aropb · 2025-02-12T13:31:20Z

If I knew, I would have already told you. What's the point of insisting?

Please excuse me.

aropb · 2025-02-12T17:35:57Z

A thread's mode is one of the following:

cudaStreamCaptureModeGlobal: This is the default mode. If the local thread has an ongoing capture sequence that was not initiated with cudaStreamCaptureModeRelaxed at cuStreamBeginCapture, or if any other thread has a concurrent capture sequence initiated with cudaStreamCaptureModeGlobal, this thread is prohibited from potentially unsafe API calls.

cudaStreamCaptureModeThreadLocal: If the local thread has an ongoing capture sequence not initiated with cudaStreamCaptureModeRelaxed, it is prohibited from potentially unsafe API calls. Concurrent capture sequences in other threads are ignored.

cudaStreamCaptureModeRelaxed: The local thread is not prohibited from potentially unsafe API calls. Note that the thread is still prohibited from API calls which necessarily conflict with stream capture, for example, attempting cudaEventQuery on an event that was last recorded inside a capture sequence.

aropb · 2025-02-12T20:36:01Z

cudaStreamCaptureModeThreadLocal

I tried the options with cudaStreamCaptureModeThreadLocal and cudaStreamCaptureModeGlobal, but the error did not disappear.

aropb · 2025-02-13T17:10:36Z

A few details by mistake. This is how errors are divided into threads.

This thread calls the docoder:

2025-02-13 19:24:25.4454|LLamaStatelessExecutor InferAsync Evaluate New Token

2025-02-13 19:24:25.4584 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-13 19:24:25.4765 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 0, in function ggml_cuda_op_mul_mat at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:1511
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-13 19:24:25.4765 LLama.Native.SafeLLamaContextHandle.llama_decode Error: cudaGetLastError()
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode

This thread calls the create context:

2025-02-13 19:24:25.5281||LLamaStatelessExecutor InferAsync CreateContext

2025-02-13 19:24:25.5394 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-13 19:24:25.5394 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-13 19:24:25.5394 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: cudaDeviceSynchronize()

This thread calls the create context:

2025-02-13 19:24:25.5552|LLamaStatelessExecutor InferAsync CreateContext

SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-13 19:24:25.5839 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-13 19:24:25.5839 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-13 19:24:25.5839 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model

Do I understand correctly that creating a context is not thread-safe?

aropb · 2025-02-17T07:06:45Z

@agray3
Please, can you help me figure out this problem?
Thanks.

aropb added the bug-unconfirmed label Feb 11, 2025

aropb mentioned this issue Feb 11, 2025

[BUG]: CUDA errors with two GPUs (multiple parallel requests) SciSharp/LLamaSharp#1091

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: CUDA errors with multi-threaded use #11804

Misc. bug: CUDA errors with multi-threaded use #11804

aropb commented Feb 11, 2025 •

edited

Loading

JohannesGaessler commented Feb 11, 2025

aropb commented Feb 11, 2025

aropb commented Feb 11, 2025 •

edited

Loading

JohannesGaessler commented Feb 11, 2025

aropb commented Feb 11, 2025

aropb commented Feb 11, 2025 •

edited

Loading

aropb commented Feb 11, 2025 •

edited

Loading

JohannesGaessler commented Feb 11, 2025

aropb commented Feb 12, 2025

slaren commented Feb 12, 2025

aropb commented Feb 12, 2025

slaren commented Feb 12, 2025

aropb commented Feb 12, 2025

slaren commented Feb 12, 2025

agray3 commented Feb 12, 2025 via email

aropb commented Feb 12, 2025

aropb commented Feb 12, 2025 •

edited

Loading

aropb commented Feb 12, 2025

aropb commented Feb 13, 2025 •

edited

Loading

aropb commented Feb 17, 2025 •

edited

Loading

Misc. bug: CUDA errors with multi-threaded use #11804

Misc. bug: CUDA errors with multi-threaded use #11804

Comments

aropb commented Feb 11, 2025 • edited Loading

Operating systems

Which llama.cpp modules do you know to be affected?

Problem description & steps to reproduce

JohannesGaessler commented Feb 11, 2025

aropb commented Feb 11, 2025

aropb commented Feb 11, 2025 • edited Loading

JohannesGaessler commented Feb 11, 2025

aropb commented Feb 11, 2025

aropb commented Feb 11, 2025 • edited Loading

aropb commented Feb 11, 2025 • edited Loading

JohannesGaessler commented Feb 11, 2025

aropb commented Feb 12, 2025

slaren commented Feb 12, 2025

aropb commented Feb 12, 2025

slaren commented Feb 12, 2025

aropb commented Feb 12, 2025

slaren commented Feb 12, 2025

agray3 commented Feb 12, 2025 via email

aropb commented Feb 12, 2025

aropb commented Feb 12, 2025 • edited Loading

aropb commented Feb 12, 2025

aropb commented Feb 13, 2025 • edited Loading

aropb commented Feb 17, 2025 • edited Loading

aropb commented Feb 11, 2025 •

edited

Loading

aropb commented Feb 11, 2025 •

edited

Loading

aropb commented Feb 11, 2025 •

edited

Loading

aropb commented Feb 11, 2025 •

edited

Loading

aropb commented Feb 12, 2025 •

edited

Loading

aropb commented Feb 13, 2025 •

edited

Loading

aropb commented Feb 17, 2025 •

edited

Loading