-
Notifications
You must be signed in to change notification settings - Fork 11.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misc. bug: CUDA errors with multi-threaded use #11804
Comments
Do you still get errors or incorrect results when you set the environment variable |
I'm getting errors, I'm taking the llama backend build from the llamasharp repository. |
Please explain about this option, should it be used when building llama? |
By default CUDA graphs are used. At runtime |
I'm checking this flag. |
I have installed GGML_CUDA_DISABLE_GRAPHS=1 Indeed, it works with this flag. But why is this the only correct solution in this case? As far as I understand, there may be a drop in performance. As soon as I remove the flag, I get CUDA errors (one GPU): 2025-02-11 17:15:15.2025 LLama.Native.SafeLLamaContextHandle.llama_decode Error: ggml_cuda_compute_forward: ADD failed 2025-02-11 17:16:24.6475 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture |
One more question, will such a call be thread-safe, given that there is only one instance of the model for everyone? llama_tokenize(llama_model_get_vocab(this), ...) |
I don't know, @agray3 is probably a better person to answer this.
I don't know. |
@slaren |
|
Yes. The question is why the flag GGML_CUDA_DISABLE_GRAPHS=1 solved the problem. llama_tokenize(llama_model_get_vocab(this), ...) |
|
Please, why the flag GGML_CUDA_DISABLE_GRAPHS=1 solved the problem? |
If I knew, I would have already told you. What's the point of insisting? |
I’m AFK on holiday this week, so can’t look in detail but you can try changing the mode at https://github.com/ggerganov/llama.cpp/blob/369be5598ac5f71f6aa6d6606e9aec769a23dafa/ggml/src/ggml-cuda/ggml-cuda.cu#L2757 to cudaStreamCaptureModeThreadLocalSent from my phoneOn 12 Feb 2025, at 14:02, Diego Devesa ***@***.***> wrote:
If I knew, I would have already told you. What's the point of insisting?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Please excuse me. |
A thread's mode is one of the following: cudaStreamCaptureModeGlobal: This is the default mode. If the local thread has an ongoing capture sequence that was not initiated with cudaStreamCaptureModeRelaxed at cuStreamBeginCapture, or if any other thread has a concurrent capture sequence initiated with cudaStreamCaptureModeGlobal, this thread is prohibited from potentially unsafe API calls. cudaStreamCaptureModeThreadLocal: If the local thread has an ongoing capture sequence not initiated with cudaStreamCaptureModeRelaxed, it is prohibited from potentially unsafe API calls. Concurrent capture sequences in other threads are ignored. cudaStreamCaptureModeRelaxed: The local thread is not prohibited from potentially unsafe API calls. Note that the thread is still prohibited from API calls which necessarily conflict with stream capture, for example, attempting cudaEventQuery on an event that was last recorded inside a capture sequence. |
I tried the options with cudaStreamCaptureModeThreadLocal and cudaStreamCaptureModeGlobal, but the error did not disappear. |
A few details by mistake. This is how errors are divided into threads. This thread calls the docoder: 2025-02-13 19:24:25.4454|LLamaStatelessExecutor InferAsync Evaluate New Token 2025-02-13 19:24:25.4584 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture This thread calls the create context: 2025-02-13 19:24:25.5281||LLamaStatelessExecutor InferAsync CreateContext 2025-02-13 19:24:25.5394 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: CUDA error: operation not permitted when stream is capturing This thread calls the create context: 2025-02-13 19:24:25.5552|LLamaStatelessExecutor InferAsync CreateContext SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model Do I understand correctly that creating a context is not thread-safe? |
@agray3 |
I'm using LLamaSharp 0.21.0 (CUDA 12.8 beckend) with the commit llama.cpp:
5783575
Model for Inference (one instance for all users): Qwen2.5-14B-1M-Q5-K-M
Model for Embedding (one instance for all users): Qwen2.5-1.5B-Q5-K-M
All models using 12Gb VRAM.
FlashAttention = true!
Memory on the server: RAM 32Gb - 48Gb.
There is always enough memory (RAM, VRAM) for queries with a margin.
Everything works fine for one user.
When I run 3-4 web requests at the same time, the application crashes with fatal CUDA errors. The errors are about always the same. One request using 3,3Gb VRAM (context size: 16K, nbatch: 2048, ubatch: 512).
I see errors both when using one GPU (RTX 4090) and when using two GPUs (2 x RTX 4090, layer split mode, tensors and VRAM 50/50).
I've read it and apparently this shouldn't be happening:
#3960
#6017
Errors disappear when I add blocking code sections at the LLamaSharp level that perform: creating a context, deleting a context, using a decoder, embedding, clear KV cache, etc.
The LLamaSharp team is aware of the problem, but their code is clean and calls the native API llama.cpp.
The LLamaSharp code has its own partial (not all) resource locking during multithreading.
But at the same time, the release of #3960 was supposed to solve all these problems?
Can you tell me what to do and where the problem might be?
What is the correct approach to use llama.cpp with a GPU in a multithreaded environment?
There may be assembly recommendations llama.cpp for multithreaded mode?
If I understand correctly, then each thread needs its own ggml_backend instance?
Is it possible to create an ggml_backend instance without reloading the model?
Thanks.
CUDA Errors:
2025-02-09 16:44:06.2064 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:44:06.2064 LLama.Native.NativeApi.llama_kv_cache_clear Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 1, in function ggml_cuda_op_mul_mat at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:1615
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:44:06.2427 LLama.Native.NativeApi.llama_kv_cache_clear Error: current device: 1, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.NativeApi.llama_kv_cache_clear Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.SafeLLamaContextHandle.llama_decode Error: cudaGetLastError()
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9660 LLama.Native.SafeLLamaContextHandle.llama_decode Error: ggml_cuda_compute_forward: ADD failed
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9660 LLama.Native.NativeApi.llama_kv_cache_clear Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9864 LLama.Native.NativeApi.llama_kv_cache_clear Error: current device: 1, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:607
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 1, in function ggml_cuda_compute_forward at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:2313
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9864 LLama.Native.NativeApi.llama_kv_cache_clear Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: err
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
SciSharp/LLamaSharp#1091
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
libllama (core library)
Problem description & steps to reproduce
The text was updated successfully, but these errors were encountered: