[BUG]: CUDA errors with two GPUs (multiple parallel requests) #1091

aropb · 2025-02-09T14:32:23Z

Description

I run several requests (3-4) at the same time, which are executed sequentially by LLamaEmbedder.GetEmbeddings() and StatelessExecutor.InferAsync().

The models for these commands are different.
For Infer (one instance for all users): Qwen2.5-14B-1M-Q5-K-M
For Embedding (one instance for all users): Qwen2.5-1.5B-Q5-K-M

There is always enough memory for queries with a margin.

1. One GPU
-- First there was the CUDA errors:
CUDA error: operation failed due to a previous error during capture
CUDA error: operation not permitted when stream is capturing
ggml_cuda_compute_forward: ADD failed

-- The errors went away when I added thread blocking to GetEmbeddings() and CreateContext/Destroy to InferAsync()

Why did I have to do this, is it right?

Questions:
what are the general limitations of multithreading for LLamaSharp? What should be considered in this case? Does anyone have experience implementing a multi-threaded web application?

2. Two GPUs

GPUSplitMode = GPUSplitMode.Layer;

Despite the fixes for one GPU, errors still occur on two GPUs:

2025-02-09 16:44:06.2064 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:44:06.2064 LLama.Native.NativeApi.llama_kv_cache_clear Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 1, in function ggml_cuda_op_mul_mat at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:1615
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:44:06.2427 LLama.Native.NativeApi.llama_kv_cache_clear Error: current device: 1, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.NativeApi.llama_kv_cache_clear Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.SafeLLamaContextHandle.llama_decode Error: cudaGetLastError()
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode

2025-02-09 16:48:54.9660 LLama.Native.SafeLLamaContextHandle.llama_decode Error: ggml_cuda_compute_forward: ADD failed
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9660 LLama.Native.NativeApi.llama_kv_cache_clear Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9864 LLama.Native.NativeApi.llama_kv_cache_clear Error: current device: 1, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:607
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 1, in function ggml_cuda_compute_forward at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:2313
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9864 LLama.Native.NativeApi.llama_kv_cache_clear Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: err
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode

Questions:
what to do? What should I pay attention to?

If each subsequent request is sent after 2-3 seconds, then everything works!

As a result of many hours of experimentation, I think that creating and deleting a context (where VRAM memory is allocated) should be performed in thread-safe mode (inside lock).

It may also need to be taken into account in other places where the GPU resource is used.

Thanks.

Reproduction Steps

Multiple parallel requests

Environment & Configuration

Operating system: Windows Server 2019
.NET runtime version: 9.0.1
LLamaSharp version: 0.21.0
CUDA version (if you are using cuda backend): 12.8
CPU & GPU device: 2 x RTX 4090 24Gb

Known Workarounds

No response

martindevans · 2025-02-09T15:26:44Z

the threadsafety story in LLamaSharp is basically just: whatever the thread safety of llama.cpp is. I'm not sure what the state of that is at the moment, so I'd recommend asking about this upstream. Many of the error messages you're getting are straight from llama.cpp, so they should be understandable to the upstream maintainers.

In the past I added a lock (see here) which ensures only one inference call is ever happening (even across multiple contexts). The issues mentioned in that comment have since been resolved, which is interesting, potentially this can be removed!

It sounds like the issues you're getting are mostly around creation and destruction of contexts, so potentially we should add a similar locking system inside those methods.

Would you be interested in working on a PR in this area? It would probably just involve a static lock (like the current one) wrapped around the internals of Create and ReleaseHandle (and of course some testing to see if it actually resolves your issue!).

aropb · 2025-02-09T15:33:03Z

Thanks for your reply. I'm not ready to make changes at the PR level yet.

aropb · 2025-02-09T15:39:05Z

#596
ggml-org/llama.cpp#3960

I looked at it, but I didn't understand much :)

aropb · 2025-02-09T16:49:41Z

@martindevans

I noticed that errors always start:
SafeLLamaContextHandle.llama_decode
or
SafeLLamaContextHandle.llama_init_from_model
or
NativeApi.llama_kv_cache_clear

Maybe they're interfering with each other, too.

At the same time, I see in decode() this:
lock (GlobalInferenceLock)

Apparently, needed to protect all operations with handle using lock()

martindevans · 2025-02-09T17:27:52Z

Since #3960 is resolved I was hoping we could remove the GlobalInferenceLock entirely, but yeah it's possible it actually needs to be widened.

aropb · 2025-02-09T17:52:46Z

@martindevans

Very good news, I managed to fix a bug for 2 GPUs. I added to the above a lock to the tokenizer (there is a decoder there) and locked the decoder in InferAsync.

This means that if you add a lock wherever you need it, everything will work.

So far it looks very ugly, but it works! :)

InferAsync:
...
using (var lockerResult = await ContextLocker.LockAsync(cancellationToken: token))
{
result = await Context.DecodeAsync(tokens, LLamaSeqId.Zero, _batch, n_past);
}

...

                using (var lockerResult = await ContextLocker.LockAsync(cancellationToken: token))
                {
                    id = InferenceParams.SamplingPipeline.Sample(...);
                 }

...
using (var lockerResult = await ContextLocker.LockAsync(cancellationToken: token))
{
NativeApi.llama_kv_cache_seq_rm(Context.NativeHandle, LLamaSeqId.Zero, tokensKeep, tokensKeep + n_discard);
NativeApi.llama_kv_cache_seq_add(Context.NativeHandle, LLamaSeqId.Zero, tokensKeep + n_discard, n_past, -n_discard);
}

...

                using (var lockerResult = await ContextLocker.LockAsync(cancellationToken: token))
                {
                    returnCode = await Context.DecodeAsync(_batch, token);
                 }
                
                if (returnCode != 0)
                    throw new LLamaDecodeError(returnCode);

...

ContextLocker - based on SemaphoreSlim.
Please fix this at the LLamaSharp level. This is a problem for many people, I think.

aropb · 2025-02-10T11:03:51Z

But there is a question about performance so that it does not degrade.

There are no errors after adding the locks. But of course I see delays in responses due to waiting for locks. And it is not clear whether this is correct. Here you need to understand very well what is doing inside llama.cpp.

Maybe the locking solution is not quite right and there is another way to properly solve the problem with CUDA errors for multiple requests.

aropb · 2025-02-10T11:26:25Z

I checked again, as soon as I removed the locks inside InferAsync(), the error immediately appeared:

2025-02-10 14:21:49.7078 LLama.Native.NativeApi.llama_kv_cache_clear Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-10 14:21:49.7078 LLama.Native.SafeLLamaContextHandle.llama_decode Error: ggml_cuda_compute_forward: RMS_NORM failed
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-10 14:21:49.7322 LLama.Native.NativeApi.llama_kv_cache_clear Error: current device: 1, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-10 14:21:49.7322 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-10 14:21:49.7322 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 1, in function ggml_cuda_compute_forward at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:2313
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-10 14:21:49.7322 LLama.Native.SafeLLamaContextHandle.llama_decode Error: err
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-10 14:21:49.7322 LLama.Native.NativeApi.llama_kv_cache_clear Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear

aropb · 2025-02-10T12:24:19Z

Since #3960 is resolved I was hoping we could remove the GlobalInferenceLock entirely, but yeah it's possible it actually needs to be widened.

Or maybe it's the assembly llama.cpp [Adds the build parameter LLAMA_SCHED_MAX_COPIES]?

ggml-org/llama.cpp#6017

aropb · 2025-02-10T14:58:42Z

ggml-org/llama.cpp#3960 (comment)

But here we are talking about launching from different threads and on different GPUs!? If so, how can this be managed? The more I read, the more questions I have :)

martindevans · 2025-02-10T15:01:07Z

The more I read, the more questions I have

Exactly my feelings on llama.cpp thread safety! This is why I was a little vague about thread safety in my initial reply.

aropb · 2025-02-10T15:07:08Z

Maybe you need to ask the llama.cpp team a question?
It's only about CUDA one or multiple GPUs.
But I can't even formulate it correctly :)

They write that llama.cpp thread-safe.

aropb · 2025-02-11T09:27:00Z

ggml-org/llama.cpp#11804

aropb · 2025-02-11T10:16:21Z

@martindevans
If I understand correctly, then each thread needs its own ggml_backend instance.
Yes, this is about whisper, but it may still be relevant for llama.cpp

Does that mean anything to you?
ggerganov/whisper.cpp#1986 (comment)

Can I create new instance ggml_backend for each thread in LLamaSharp now without reloading the models?

martindevans · 2025-02-11T13:54:22Z

Creating a ggml_backend isn't something that LLamaSharp does.

It's possible that's done in llama.cpp as part of creating a context or loading weights - in which case you can create a context-per-thread.

aropb · 2025-02-11T13:58:00Z

Yes, of course, the context is always different for the thread.

Im testing GGML_CUDA_DISABLE_GRAPHS=1

aropb · 2025-02-11T14:28:20Z

@martindevans
ggml-org/llama.cpp#11804 (comment)

aropb · 2025-02-12T13:36:39Z

@martindevans

Is there a way to build CUDA 12 llama for Windows with this change?
ggml-org/llama.cpp#11804 (comment)

martindevans · 2025-02-12T14:40:43Z

If you've got the CUDA toolchains installed locally, you can make that change and run the cmake file.

If not, you could:

Fork llama.cpp and make your change
Fork LLamaSharp repo and modify the build script (https://github.com/SciSharp/LLamaSharp/blob/master/.github/workflows/compile.yml) to use your custom repo
Trigger the action on GitHub, that will install the deps and run a complete build producing exactly the binaries you need.

aropb · 2025-02-12T14:44:20Z

Unfortunately, I won't be able to do it.

martindevans · 2025-02-12T15:14:46Z

The second one really isn't as complex as it perhaps sounds!

Modifying llama.cpp is as simple as changing that one line from cudaStreamCaptureModeRelaxed to cudaStreamCaptureModeThreadLocalSent and pushing it to your fork.

Modifying the build script just requires replacing all the lines that look like repository: ggerganov/llama.cpp with your fork, i.e. repository: aropb/llama.cpp.

Running the build script just requires going into https://github.com/aropb/LLamaSharp/actions/workflows/compile.yml and clicking Run Workflow (top right).

After the build is done (it takes about 2 hours) you can download the binaries and replace them in your copy of LLamaSharp.

aropb · 2025-02-12T17:08:10Z

I'll try, but I'm not sure what will work, so I found out about these flags.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html

They are currently in use: cudaStreamCaptureModeRelaxed

A thread's mode is one of the following:

cudaStreamCaptureModeGlobal: This is the default mode. If the local thread has an ongoing capture sequence that was not initiated with cudaStreamCaptureModeRelaxed at cuStreamBeginCapture, or if any other thread has a concurrent capture sequence initiated with cudaStreamCaptureModeGlobal, this thread is prohibited from potentially unsafe API calls.

cudaStreamCaptureModeThreadLocal: If the local thread has an ongoing capture sequence not initiated with cudaStreamCaptureModeRelaxed, it is prohibited from potentially unsafe API calls. Concurrent capture sequences in other threads are ignored.

cudaStreamCaptureModeRelaxed: The local thread is not prohibited from potentially unsafe API calls. Note that the thread is still prohibited from API calls which necessarily conflict with stream capture, for example, attempting cudaEventQuery on an event that was last recorded inside a capture sequence.

aropb · 2025-02-12T20:35:04Z

I tried the options with cuda Stream Capture Mode Thread Local and cuda Stream Capture Mode Global, but the error did not disappear.

aropb · 2025-02-13T16:48:54Z

@martindevans

A little more information about the error. I added a lot of logging :)
We have three threads in total. Each thread has its own context.

During the error occurs.

First thread:

When creating a context in StatelessExecutor.InferAsync()

2025-02-13 19:24:25.5394||LLama.LLamaWeights|ERROR|CUDA error: operation not permitted when stream is capturing
2025-02-13 19:24:25.5394||LLama.LLamaWeights|ERROR| current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
2025-02-13 19:24:25.5394||LLama.LLamaWeights|ERROR| cudaDeviceSynchronize()

Second thread:

When creating a context in StatelessExecutor.InferAsync()

2025-02-13 19:24:25.5839||LLama.LLamaWeights|ERROR|CUDA error: operation not permitted when stream is capturing
2025-02-13 19:24:25.5839||LLama.LLamaWeights|ERROR| current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
2025-02-13 19:24:25.5839||LLama.LLamaWeights|ERROR| cudaDeviceSynchronize()

Third thread:

When executing this code in StatelessExecutor.InferAsync()

                // Evaluate with this new token
                _batch.Clear();
                _batch.Add(id, n_past++, LLamaSeqId.Zero, true);

                var returnCode = await Context.DecodeAsync(_batch, token);

2025-02-13 19:24:25.4584||LLama.LLamaWeights|ERROR|CUDA error: operation failed due to a previous error during capture
2025-02-13 19:24:25.4765||LLama.LLamaWeights|ERROR| current device: 0, in function ggml_cuda_op_mul_mat at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:1511
2025-02-13 19:24:25.4765||LLama.LLamaWeights|ERROR| cudaGetLastError()

Do you have any ideas what else to check? But there is a problem.
I really want to figure this out, because the problem will obviously affect many people.

I think that you definitely need to protect the creation/destroy of a context with lock().
But in other places, the lock can be removed.

martindevans added the bug Something isn't working label Feb 9, 2025

martindevans added good first issue Good for newcomers help wanted Extra attention is needed labels Feb 9, 2025

aropb mentioned this issue Feb 11, 2025

Misc. bug: CUDA errors with multi-threaded use ggml-org/llama.cpp#11804

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: CUDA errors with two GPUs (multiple parallel requests) #1091

[BUG]: CUDA errors with two GPUs (multiple parallel requests) #1091

aropb commented Feb 9, 2025 •

edited

Loading

martindevans commented Feb 9, 2025 •

edited

Loading

aropb commented Feb 9, 2025

aropb commented Feb 9, 2025 •

edited

Loading

aropb commented Feb 9, 2025 •

edited

Loading

martindevans commented Feb 9, 2025

aropb commented Feb 9, 2025 •

edited

Loading

aropb commented Feb 10, 2025 •

edited

Loading

aropb commented Feb 10, 2025

aropb commented Feb 10, 2025 •

edited

Loading

aropb commented Feb 10, 2025

martindevans commented Feb 10, 2025

aropb commented Feb 10, 2025 •

edited

Loading

aropb commented Feb 11, 2025

aropb commented Feb 11, 2025 •

edited

Loading

martindevans commented Feb 11, 2025

aropb commented Feb 11, 2025 •

edited

Loading

aropb commented Feb 11, 2025

aropb commented Feb 12, 2025

martindevans commented Feb 12, 2025

aropb commented Feb 12, 2025

martindevans commented Feb 12, 2025

aropb commented Feb 12, 2025 •

edited

Loading

aropb commented Feb 12, 2025

aropb commented Feb 13, 2025 •

edited

Loading

[BUG]: CUDA errors with two GPUs (multiple parallel requests) #1091

[BUG]: CUDA errors with two GPUs (multiple parallel requests) #1091

Comments

aropb commented Feb 9, 2025 • edited Loading

Description

Reproduction Steps

Environment & Configuration

Known Workarounds

martindevans commented Feb 9, 2025 • edited Loading

aropb commented Feb 9, 2025

aropb commented Feb 9, 2025 • edited Loading

aropb commented Feb 9, 2025 • edited Loading

martindevans commented Feb 9, 2025

aropb commented Feb 9, 2025 • edited Loading

aropb commented Feb 10, 2025 • edited Loading

aropb commented Feb 10, 2025

aropb commented Feb 10, 2025 • edited Loading

aropb commented Feb 10, 2025

martindevans commented Feb 10, 2025

aropb commented Feb 10, 2025 • edited Loading

aropb commented Feb 11, 2025

aropb commented Feb 11, 2025 • edited Loading

martindevans commented Feb 11, 2025

aropb commented Feb 11, 2025 • edited Loading

aropb commented Feb 11, 2025

aropb commented Feb 12, 2025

martindevans commented Feb 12, 2025

aropb commented Feb 12, 2025

martindevans commented Feb 12, 2025

aropb commented Feb 12, 2025 • edited Loading

aropb commented Feb 12, 2025

aropb commented Feb 13, 2025 • edited Loading

aropb commented Feb 9, 2025 •

edited

Loading

martindevans commented Feb 9, 2025 •

edited

Loading

aropb commented Feb 9, 2025 •

edited

Loading

aropb commented Feb 9, 2025 •

edited

Loading

aropb commented Feb 9, 2025 •

edited

Loading

aropb commented Feb 10, 2025 •

edited

Loading

aropb commented Feb 10, 2025 •

edited

Loading

aropb commented Feb 10, 2025 •

edited

Loading

aropb commented Feb 11, 2025 •

edited

Loading

aropb commented Feb 11, 2025 •

edited

Loading

aropb commented Feb 12, 2025 •

edited

Loading

aropb commented Feb 13, 2025 •

edited

Loading