Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: CUDA errors with two GPUs (multiple parallel requests) #1091

Open
aropb opened this issue Feb 9, 2025 · 24 comments
Open

[BUG]: CUDA errors with two GPUs (multiple parallel requests) #1091

aropb opened this issue Feb 9, 2025 · 24 comments
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@aropb
Copy link

aropb commented Feb 9, 2025

Description

I run several requests (3-4) at the same time, which are executed sequentially by LLamaEmbedder.GetEmbeddings() and StatelessExecutor.InferAsync().

The models for these commands are different.
For Infer (one instance for all users): Qwen2.5-14B-1M-Q5-K-M
For Embedding (one instance for all users): Qwen2.5-1.5B-Q5-K-M

There is always enough memory for queries with a margin.

1. One GPU
-- First there was the CUDA errors:
CUDA error: operation failed due to a previous error during capture
CUDA error: operation not permitted when stream is capturing
ggml_cuda_compute_forward: ADD failed

-- The errors went away when I added thread blocking to GetEmbeddings() and CreateContext/Destroy to InferAsync()

Why did I have to do this, is it right?

Questions:
what are the general limitations of multithreading for LLamaSharp? What should be considered in this case? Does anyone have experience implementing a multi-threaded web application?

2. Two GPUs

GPUSplitMode = GPUSplitMode.Layer;

Despite the fixes for one GPU, errors still occur on two GPUs:

2025-02-09 16:44:06.2064 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:44:06.2064 LLama.Native.NativeApi.llama_kv_cache_clear Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 1, in function ggml_cuda_op_mul_mat at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:1615
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:44:06.2427 LLama.Native.NativeApi.llama_kv_cache_clear Error: current device: 1, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.NativeApi.llama_kv_cache_clear Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.SafeLLamaContextHandle.llama_decode Error: cudaGetLastError()
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode

2025-02-09 16:48:54.9660 LLama.Native.SafeLLamaContextHandle.llama_decode Error: ggml_cuda_compute_forward: ADD failed
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9660 LLama.Native.NativeApi.llama_kv_cache_clear Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9864 LLama.Native.NativeApi.llama_kv_cache_clear Error: current device: 1, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:607
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 1, in function ggml_cuda_compute_forward at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:2313
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9864 LLama.Native.NativeApi.llama_kv_cache_clear Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: err
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode

Questions:
what to do? What should I pay attention to?

If each subsequent request is sent after 2-3 seconds, then everything works!

As a result of many hours of experimentation, I think that creating and deleting a context (where VRAM memory is allocated) should be performed in thread-safe mode (inside lock).

It may also need to be taken into account in other places where the GPU resource is used.

Thanks.

Reproduction Steps

Multiple parallel requests

Environment & Configuration

  • Operating system: Windows Server 2019
  • .NET runtime version: 9.0.1
  • LLamaSharp version: 0.21.0
  • CUDA version (if you are using cuda backend): 12.8
  • CPU & GPU device: 2 x RTX 4090 24Gb

Known Workarounds

No response

@martindevans
Copy link
Member

martindevans commented Feb 9, 2025

the threadsafety story in LLamaSharp is basically just: whatever the thread safety of llama.cpp is. I'm not sure what the state of that is at the moment, so I'd recommend asking about this upstream. Many of the error messages you're getting are straight from llama.cpp, so they should be understandable to the upstream maintainers.

In the past I added a lock (see here) which ensures only one inference call is ever happening (even across multiple contexts). The issues mentioned in that comment have since been resolved, which is interesting, potentially this can be removed!

It sounds like the issues you're getting are mostly around creation and destruction of contexts, so potentially we should add a similar locking system inside those methods.

Would you be interested in working on a PR in this area? It would probably just involve a static lock (like the current one) wrapped around the internals of Create and ReleaseHandle (and of course some testing to see if it actually resolves your issue!).

@martindevans martindevans added the bug Something isn't working label Feb 9, 2025
@aropb
Copy link
Author

aropb commented Feb 9, 2025

Thanks for your reply. I'm not ready to make changes at the PR level yet.

@martindevans martindevans added good first issue Good for newcomers help wanted Extra attention is needed labels Feb 9, 2025
@aropb
Copy link
Author

aropb commented Feb 9, 2025

#596
ggml-org/llama.cpp#3960

I looked at it, but I didn't understand much :)

@aropb
Copy link
Author

aropb commented Feb 9, 2025

@martindevans

I noticed that errors always start:
SafeLLamaContextHandle.llama_decode
or
SafeLLamaContextHandle.llama_init_from_model
or
NativeApi.llama_kv_cache_clear

Maybe they're interfering with each other, too.

At the same time, I see in decode() this:
lock (GlobalInferenceLock)

Apparently, needed to protect all operations with handle using lock()

@martindevans
Copy link
Member

Since #3960 is resolved I was hoping we could remove the GlobalInferenceLock entirely, but yeah it's possible it actually needs to be widened.

@aropb
Copy link
Author

aropb commented Feb 9, 2025

@martindevans

Very good news, I managed to fix a bug for 2 GPUs. I added to the above a lock to the tokenizer (there is a decoder there) and locked the decoder in InferAsync.

This means that if you add a lock wherever you need it, everything will work.

So far it looks very ugly, but it works! :)

InferAsync:
...
using (var lockerResult = await ContextLocker.LockAsync(cancellationToken: token))
{
result = await Context.DecodeAsync(tokens, LLamaSeqId.Zero, _batch, n_past);
}

...

                using (var lockerResult = await ContextLocker.LockAsync(cancellationToken: token))
                {
                    id = InferenceParams.SamplingPipeline.Sample(...);
                 }

...
using (var lockerResult = await ContextLocker.LockAsync(cancellationToken: token))
{
NativeApi.llama_kv_cache_seq_rm(Context.NativeHandle, LLamaSeqId.Zero, tokensKeep, tokensKeep + n_discard);
NativeApi.llama_kv_cache_seq_add(Context.NativeHandle, LLamaSeqId.Zero, tokensKeep + n_discard, n_past, -n_discard);
}

...

                using (var lockerResult = await ContextLocker.LockAsync(cancellationToken: token))
                {
                    returnCode = await Context.DecodeAsync(_batch, token);
                 }
                
                if (returnCode != 0)
                    throw new LLamaDecodeError(returnCode);

...

ContextLocker - based on SemaphoreSlim.
Please fix this at the LLamaSharp level. This is a problem for many people, I think.

@aropb
Copy link
Author

aropb commented Feb 10, 2025

But there is a question about performance so that it does not degrade.

There are no errors after adding the locks. But of course I see delays in responses due to waiting for locks. And it is not clear whether this is correct. Here you need to understand very well what is doing inside llama.cpp.

Maybe the locking solution is not quite right and there is another way to properly solve the problem with CUDA errors for multiple requests.

@aropb
Copy link
Author

aropb commented Feb 10, 2025

I checked again, as soon as I removed the locks inside InferAsync(), the error immediately appeared:

2025-02-10 14:21:49.7078 LLama.Native.NativeApi.llama_kv_cache_clear Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-10 14:21:49.7078 LLama.Native.SafeLLamaContextHandle.llama_decode Error: ggml_cuda_compute_forward: RMS_NORM failed
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-10 14:21:49.7322 LLama.Native.NativeApi.llama_kv_cache_clear Error: current device: 1, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-10 14:21:49.7322 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-10 14:21:49.7322 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 1, in function ggml_cuda_compute_forward at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:2313
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-10 14:21:49.7322 LLama.Native.SafeLLamaContextHandle.llama_decode Error: err
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-10 14:21:49.7322 LLama.Native.NativeApi.llama_kv_cache_clear Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear

@aropb
Copy link
Author

aropb commented Feb 10, 2025

Since #3960 is resolved I was hoping we could remove the GlobalInferenceLock entirely, but yeah it's possible it actually needs to be widened.

Or maybe it's the assembly llama.cpp [Adds the build parameter LLAMA_SCHED_MAX_COPIES]?

ggml-org/llama.cpp#6017

@aropb
Copy link
Author

aropb commented Feb 10, 2025

ggml-org/llama.cpp#3960 (comment)

But here we are talking about launching from different threads and on different GPUs!? If so, how can this be managed? The more I read, the more questions I have :)

@martindevans
Copy link
Member

The more I read, the more questions I have

Exactly my feelings on llama.cpp thread safety! This is why I was a little vague about thread safety in my initial reply.

@aropb
Copy link
Author

aropb commented Feb 10, 2025

Maybe you need to ask the llama.cpp team a question?
It's only about CUDA one or multiple GPUs.
But I can't even formulate it correctly :)

They write that llama.cpp thread-safe.

@aropb
Copy link
Author

aropb commented Feb 11, 2025

ggml-org/llama.cpp#11804

@aropb
Copy link
Author

aropb commented Feb 11, 2025

@martindevans
If I understand correctly, then each thread needs its own ggml_backend instance.
Yes, this is about whisper, but it may still be relevant for llama.cpp

Does that mean anything to you?
ggerganov/whisper.cpp#1986 (comment)

Can I create new instance ggml_backend for each thread in LLamaSharp now without reloading the models?

@martindevans
Copy link
Member

Creating a ggml_backend isn't something that LLamaSharp does.

It's possible that's done in llama.cpp as part of creating a context or loading weights - in which case you can create a context-per-thread.

@aropb
Copy link
Author

aropb commented Feb 11, 2025

Yes, of course, the context is always different for the thread.

Im testing GGML_CUDA_DISABLE_GRAPHS=1

@aropb
Copy link
Author

aropb commented Feb 11, 2025

@martindevans
ggml-org/llama.cpp#11804 (comment)

@aropb
Copy link
Author

aropb commented Feb 12, 2025

@martindevans

Is there a way to build CUDA 12 llama for Windows with this change?
ggml-org/llama.cpp#11804 (comment)

@martindevans
Copy link
Member

If you've got the CUDA toolchains installed locally, you can make that change and run the cmake file.

If not, you could:

  1. Fork llama.cpp and make your change
  2. Fork LLamaSharp repo and modify the build script (https://github.com/SciSharp/LLamaSharp/blob/master/.github/workflows/compile.yml) to use your custom repo
  3. Trigger the action on GitHub, that will install the deps and run a complete build producing exactly the binaries you need.

@aropb
Copy link
Author

aropb commented Feb 12, 2025

Unfortunately, I won't be able to do it.

@martindevans
Copy link
Member

The second one really isn't as complex as it perhaps sounds!

Modifying llama.cpp is as simple as changing that one line from cudaStreamCaptureModeRelaxed to cudaStreamCaptureModeThreadLocalSent and pushing it to your fork.

Modifying the build script just requires replacing all the lines that look like repository: ggerganov/llama.cpp with your fork, i.e. repository: aropb/llama.cpp.

Running the build script just requires going into https://github.com/aropb/LLamaSharp/actions/workflows/compile.yml and clicking Run Workflow (top right).

After the build is done (it takes about 2 hours) you can download the binaries and replace them in your copy of LLamaSharp.

@aropb
Copy link
Author

aropb commented Feb 12, 2025

I'll try, but I'm not sure what will work, so I found out about these flags.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html

They are currently in use: cudaStreamCaptureModeRelaxed

A thread's mode is one of the following:

cudaStreamCaptureModeGlobal: This is the default mode. If the local thread has an ongoing capture sequence that was not initiated with cudaStreamCaptureModeRelaxed at cuStreamBeginCapture, or if any other thread has a concurrent capture sequence initiated with cudaStreamCaptureModeGlobal, this thread is prohibited from potentially unsafe API calls.

cudaStreamCaptureModeThreadLocal: If the local thread has an ongoing capture sequence not initiated with cudaStreamCaptureModeRelaxed, it is prohibited from potentially unsafe API calls. Concurrent capture sequences in other threads are ignored.

cudaStreamCaptureModeRelaxed: The local thread is not prohibited from potentially unsafe API calls. Note that the thread is still prohibited from API calls which necessarily conflict with stream capture, for example, attempting cudaEventQuery on an event that was last recorded inside a capture sequence.

@aropb
Copy link
Author

aropb commented Feb 12, 2025

I tried the options with cuda Stream Capture Mode Thread Local and cuda Stream Capture Mode Global, but the error did not disappear.

@aropb
Copy link
Author

aropb commented Feb 13, 2025

@martindevans

A little more information about the error. I added a lot of logging :)
We have three threads in total. Each thread has its own context.

During the error occurs.

First thread:

When creating a context in StatelessExecutor.InferAsync()

2025-02-13 19:24:25.5394||LLama.LLamaWeights|ERROR|CUDA error: operation not permitted when stream is capturing
2025-02-13 19:24:25.5394||LLama.LLamaWeights|ERROR| current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
2025-02-13 19:24:25.5394||LLama.LLamaWeights|ERROR| cudaDeviceSynchronize()

Second thread:

When creating a context in StatelessExecutor.InferAsync()

2025-02-13 19:24:25.5839||LLama.LLamaWeights|ERROR|CUDA error: operation not permitted when stream is capturing
2025-02-13 19:24:25.5839||LLama.LLamaWeights|ERROR| current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
2025-02-13 19:24:25.5839||LLama.LLamaWeights|ERROR| cudaDeviceSynchronize()

Third thread:

When executing this code in StatelessExecutor.InferAsync()

                // Evaluate with this new token
                _batch.Clear();
                _batch.Add(id, n_past++, LLamaSeqId.Zero, true);

                var returnCode = await Context.DecodeAsync(_batch, token);

2025-02-13 19:24:25.4584||LLama.LLamaWeights|ERROR|CUDA error: operation failed due to a previous error during capture
2025-02-13 19:24:25.4765||LLama.LLamaWeights|ERROR| current device: 0, in function ggml_cuda_op_mul_mat at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:1511
2025-02-13 19:24:25.4765||LLama.LLamaWeights|ERROR| cudaGetLastError()

Do you have any ideas what else to check? But there is a problem.
I really want to figure this out, because the problem will obviously affect many people.

I think that you definitely need to protect the creation/destroy of a context with lock().
But in other places, the lock can be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants