-
Notifications
You must be signed in to change notification settings - Fork 395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: CUDA errors with two GPUs (multiple parallel requests) #1091
Comments
the threadsafety story in LLamaSharp is basically just: whatever the thread safety of llama.cpp is. I'm not sure what the state of that is at the moment, so I'd recommend asking about this upstream. Many of the error messages you're getting are straight from llama.cpp, so they should be understandable to the upstream maintainers. In the past I added a lock (see here) which ensures only one inference call is ever happening (even across multiple contexts). The issues mentioned in that comment have since been resolved, which is interesting, potentially this can be removed! It sounds like the issues you're getting are mostly around creation and destruction of contexts, so potentially we should add a similar locking system inside those methods. Would you be interested in working on a PR in this area? It would probably just involve a static lock (like the current one) wrapped around the internals of |
Thanks for your reply. I'm not ready to make changes at the PR level yet. |
I looked at it, but I didn't understand much :) |
I noticed that errors always start: Maybe they're interfering with each other, too. At the same time, I see in decode() this: Apparently, needed to protect all operations with handle using lock() |
Since #3960 is resolved I was hoping we could remove the |
Very good news, I managed to fix a bug for 2 GPUs. I added to the above a lock to the tokenizer (there is a decoder there) and locked the decoder in InferAsync. This means that if you add a lock wherever you need it, everything will work. So far it looks very ugly, but it works! :) InferAsync: ...
... ...
... ContextLocker - based on SemaphoreSlim. |
But there is a question about performance so that it does not degrade. There are no errors after adding the locks. But of course I see delays in responses due to waiting for locks. And it is not clear whether this is correct. Here you need to understand very well what is doing inside llama.cpp. Maybe the locking solution is not quite right and there is another way to properly solve the problem with CUDA errors for multiple requests. |
I checked again, as soon as I removed the locks inside InferAsync(), the error immediately appeared: 2025-02-10 14:21:49.7078 LLama.Native.NativeApi.llama_kv_cache_clear Error: CUDA error: operation not permitted when stream is capturing |
Or maybe it's the assembly llama.cpp [Adds the build parameter LLAMA_SCHED_MAX_COPIES]? |
ggml-org/llama.cpp#3960 (comment) But here we are talking about launching from different threads and on different GPUs!? If so, how can this be managed? The more I read, the more questions I have :) |
Exactly my feelings on llama.cpp thread safety! This is why I was a little vague about thread safety in my initial reply. |
Maybe you need to ask the llama.cpp team a question? They write that llama.cpp thread-safe. |
@martindevans Does that mean anything to you? Can I create new instance ggml_backend for each thread in LLamaSharp now without reloading the models? |
Creating a It's possible that's done in llama.cpp as part of creating a context or loading weights - in which case you can create a context-per-thread. |
Yes, of course, the context is always different for the thread. Im testing GGML_CUDA_DISABLE_GRAPHS=1 |
Is there a way to build CUDA 12 llama for Windows with this change? |
If you've got the CUDA toolchains installed locally, you can make that change and run the cmake file. If not, you could:
|
Unfortunately, I won't be able to do it. |
The second one really isn't as complex as it perhaps sounds! Modifying llama.cpp is as simple as changing that one line from Modifying the build script just requires replacing all the lines that look like Running the build script just requires going into After the build is done (it takes about 2 hours) you can download the binaries and replace them in your copy of LLamaSharp. |
I'll try, but I'm not sure what will work, so I found out about these flags. https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html They are currently in use: cudaStreamCaptureModeRelaxed A thread's mode is one of the following: cudaStreamCaptureModeGlobal: This is the default mode. If the local thread has an ongoing capture sequence that was not initiated with cudaStreamCaptureModeRelaxed at cuStreamBeginCapture, or if any other thread has a concurrent capture sequence initiated with cudaStreamCaptureModeGlobal, this thread is prohibited from potentially unsafe API calls. cudaStreamCaptureModeThreadLocal: If the local thread has an ongoing capture sequence not initiated with cudaStreamCaptureModeRelaxed, it is prohibited from potentially unsafe API calls. Concurrent capture sequences in other threads are ignored. cudaStreamCaptureModeRelaxed: The local thread is not prohibited from potentially unsafe API calls. Note that the thread is still prohibited from API calls which necessarily conflict with stream capture, for example, attempting cudaEventQuery on an event that was last recorded inside a capture sequence. |
I tried the options with cuda Stream Capture Mode Thread Local and cuda Stream Capture Mode Global, but the error did not disappear. |
A little more information about the error. I added a lot of logging :) During the error occurs. First thread: When creating a context in StatelessExecutor.InferAsync() 2025-02-13 19:24:25.5394||LLama.LLamaWeights|ERROR|CUDA error: operation not permitted when stream is capturing Second thread: When creating a context in StatelessExecutor.InferAsync() 2025-02-13 19:24:25.5839||LLama.LLamaWeights|ERROR|CUDA error: operation not permitted when stream is capturing Third thread: When executing this code in StatelessExecutor.InferAsync()
2025-02-13 19:24:25.4584||LLama.LLamaWeights|ERROR|CUDA error: operation failed due to a previous error during capture Do you have any ideas what else to check? But there is a problem. I think that you definitely need to protect the creation/destroy of a context with lock(). |
Description
I run several requests (3-4) at the same time, which are executed sequentially by LLamaEmbedder.GetEmbeddings() and StatelessExecutor.InferAsync().
The models for these commands are different.
For Infer (one instance for all users): Qwen2.5-14B-1M-Q5-K-M
For Embedding (one instance for all users): Qwen2.5-1.5B-Q5-K-M
There is always enough memory for queries with a margin.
1. One GPU
-- First there was the CUDA errors:
CUDA error: operation failed due to a previous error during capture
CUDA error: operation not permitted when stream is capturing
ggml_cuda_compute_forward: ADD failed
-- The errors went away when I added thread blocking to GetEmbeddings() and CreateContext/Destroy to InferAsync()
Why did I have to do this, is it right?
Questions:
what are the general limitations of multithreading for LLamaSharp? What should be considered in this case? Does anyone have experience implementing a multi-threaded web application?
2. Two GPUs
GPUSplitMode = GPUSplitMode.Layer;
Despite the fixes for one GPU, errors still occur on two GPUs:
2025-02-09 16:44:06.2064 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:44:06.2064 LLama.Native.NativeApi.llama_kv_cache_clear Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 1, in function ggml_cuda_op_mul_mat at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:1615
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:44:06.2427 LLama.Native.NativeApi.llama_kv_cache_clear Error: current device: 1, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.NativeApi.llama_kv_cache_clear Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.SafeLLamaContextHandle.llama_decode Error: cudaGetLastError()
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9660 LLama.Native.SafeLLamaContextHandle.llama_decode Error: ggml_cuda_compute_forward: ADD failed
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9660 LLama.Native.NativeApi.llama_kv_cache_clear Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9864 LLama.Native.NativeApi.llama_kv_cache_clear Error: current device: 1, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:607
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 1, in function ggml_cuda_compute_forward at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:2313
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9864 LLama.Native.NativeApi.llama_kv_cache_clear Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: err
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
Questions:
what to do? What should I pay attention to?
If each subsequent request is sent after 2-3 seconds, then everything works!
As a result of many hours of experimentation, I think that creating and deleting a context (where VRAM memory is allocated) should be performed in thread-safe mode (inside lock).
It may also need to be taken into account in other places where the GPU resource is used.
Thanks.
Reproduction Steps
Multiple parallel requests
Environment & Configuration
Known Workarounds
No response
The text was updated successfully, but these errors were encountered: