Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: CUDA errors with multi-threaded use #11804

Open
aropb opened this issue Feb 11, 2025 · 20 comments
Open

Misc. bug: CUDA errors with multi-threaded use #11804

aropb opened this issue Feb 11, 2025 · 20 comments

Comments

@aropb
Copy link

aropb commented Feb 11, 2025

I'm using LLamaSharp 0.21.0 (CUDA 12.8 beckend) with the commit llama.cpp:
5783575

Model for Inference (one instance for all users): Qwen2.5-14B-1M-Q5-K-M
Model for Embedding (one instance for all users): Qwen2.5-1.5B-Q5-K-M

All models using 12Gb VRAM.
FlashAttention = true!

Memory on the server: RAM 32Gb - 48Gb.

There is always enough memory (RAM, VRAM) for queries with a margin.
Everything works fine for one user.

When I run 3-4 web requests at the same time, the application crashes with fatal CUDA errors. The errors are about always the same. One request using 3,3Gb VRAM (context size: 16K, nbatch: 2048, ubatch: 512).
I see errors both when using one GPU (RTX 4090) and when using two GPUs (2 x RTX 4090, layer split mode, tensors and VRAM 50/50).

I've read it and apparently this shouldn't be happening:
#3960
#6017

Errors disappear when I add blocking code sections at the LLamaSharp level that perform: creating a context, deleting a context, using a decoder, embedding, clear KV cache, etc.

The LLamaSharp team is aware of the problem, but their code is clean and calls the native API llama.cpp.
The LLamaSharp code has its own partial (not all) resource locking during multithreading.

But at the same time, the release of #3960 was supposed to solve all these problems?
Can you tell me what to do and where the problem might be?
What is the correct approach to use llama.cpp with a GPU in a multithreaded environment?
There may be assembly recommendations llama.cpp for multithreaded mode?

If I understand correctly, then each thread needs its own ggml_backend instance?
Is it possible to create an ggml_backend instance without reloading the model?

Thanks.

CUDA Errors:

2025-02-09 16:44:06.2064 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:44:06.2064 LLama.Native.NativeApi.llama_kv_cache_clear Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 1, in function ggml_cuda_op_mul_mat at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:1615
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:44:06.2427 LLama.Native.NativeApi.llama_kv_cache_clear Error: current device: 1, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.NativeApi.llama_kv_cache_clear Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.SafeLLamaContextHandle.llama_decode Error: cudaGetLastError()
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode

2025-02-09 16:48:54.9660 LLama.Native.SafeLLamaContextHandle.llama_decode Error: ggml_cuda_compute_forward: ADD failed
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9660 LLama.Native.NativeApi.llama_kv_cache_clear Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9864 LLama.Native.NativeApi.llama_kv_cache_clear Error: current device: 1, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:607
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 1, in function ggml_cuda_compute_forward at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:2313
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9864 LLama.Native.NativeApi.llama_kv_cache_clear Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: err
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode

SciSharp/LLamaSharp#1091

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

libllama (core library)

Problem description & steps to reproduce

  1. CUDA 12, 1-2 GPUs
  2. Multiple users requests
@JohannesGaessler
Copy link
Collaborator

Do you still get errors or incorrect results when you set the environment variable GGML_CUDA_DISABLE_GRAPHS=1?

@aropb
Copy link
Author

aropb commented Feb 11, 2025

I'm getting errors, I'm taking the llama backend build from the llamasharp repository.

@aropb
Copy link
Author

aropb commented Feb 11, 2025

GGML_CUDA_DISABLE_GRAPHS=1

Please explain about this option, should it be used when building llama?

@JohannesGaessler
Copy link
Collaborator

By default CUDA graphs are used. At runtime ggml-cuda.cu checks for whether the environment variable GGML_CUDA_DISABLE_GRAPHS is set (to anything). If it is, CUDA graphs are not used. The environment variable has no influence at compile time.

@aropb
Copy link
Author

aropb commented Feb 11, 2025

I'm checking this flag.

@aropb
Copy link
Author

aropb commented Feb 11, 2025

checks for whether the environment variable GGML_CUDA_DISABLE_GRAPHS

I have installed GGML_CUDA_DISABLE_GRAPHS=1

Indeed, it works with this flag. But why is this the only correct solution in this case? As far as I understand, there may be a drop in performance.

As soon as I remove the flag, I get CUDA errors (one GPU):

2025-02-11 17:15:15.2025 LLama.Native.SafeLLamaContextHandle.llama_decode Error: ggml_cuda_compute_forward: ADD failed
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-11 17:15:15.2025 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-11 17:15:15.2345 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-11 17:15:15.2345 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-11 17:15:15.2345 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 0, in function ggml_cuda_compute_forward at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:2313
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-11 17:15:15.2345 LLama.Native.SafeLLamaContextHandle.llama_decode Error: err
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-11 17:15:15.2345 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model

2025-02-11 17:16:24.6475 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-11 17:16:24.6622 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 0, in function launch_fattn at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\fattn-common.cuh:694
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-11 17:16:24.6622 LLama.Native.SafeLLamaContextHandle.llama_decode Error: cudaGetLastError()
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-11 17:16:24.7242 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-11 17:16:24.7253 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-11 17:16:24.7253 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-11 17:16:25.6419 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-11 17:16:25.6419 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-11 17:16:25.6419 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model

@aropb
Copy link
Author

aropb commented Feb 11, 2025

One more question, will such a call be thread-safe, given that there is only one instance of the model for everyone?

llama_tokenize(llama_model_get_vocab(this), ...)

@JohannesGaessler
Copy link
Collaborator

Indeed, it works with this flag. But why is this the only correct solution in this case? As far as I understand, there may be a drop in performance.

I don't know, @agray3 is probably a better person to answer this.

One more question, will such a call be thread-safe, given that there is only one instance of the model for everyone?

I don't know.

@aropb
Copy link
Author

aropb commented Feb 12, 2025

@slaren
Maybe you can help me?
Thanks.

@slaren
Copy link
Member

slaren commented Feb 12, 2025

llama_context objects are not thread safe, you will need a different one for each thread. I don't know if you already doing that, the logs from the C# bindings are not useful to me. If you are already using a different llama_context object per thread and are convinced that it is a bug in llama.cpp, please write a minimal repro in C/C++.

@aropb
Copy link
Author

aropb commented Feb 12, 2025

llama_context objects are not thread safe, you will need a different one for each thread

Yes.

The question is why the flag GGML_CUDA_DISABLE_GRAPHS=1 solved the problem.
There is an llama.cpp function under all C# calls (the name is the same).

llama_tokenize(llama_model_get_vocab(this), ...)
is such a call safe for all threads?

@slaren
Copy link
Member

slaren commented Feb 12, 2025

llama_tokenize and llama_model_get_vocab are thread-safe.

@aropb
Copy link
Author

aropb commented Feb 12, 2025

llama_tokenize and llama_model_get_vocab are thread-safe.

Please, why the flag GGML_CUDA_DISABLE_GRAPHS=1 solved the problem?

@slaren
Copy link
Member

slaren commented Feb 12, 2025

If I knew, I would have already told you. What's the point of insisting?

@agray3
Copy link
Contributor

agray3 commented Feb 12, 2025 via email

@aropb
Copy link
Author

aropb commented Feb 12, 2025

If I knew, I would have already told you. What's the point of insisting?

Please excuse me.

@aropb
Copy link
Author

aropb commented Feb 12, 2025

A thread's mode is one of the following:

cudaStreamCaptureModeGlobal: This is the default mode. If the local thread has an ongoing capture sequence that was not initiated with cudaStreamCaptureModeRelaxed at cuStreamBeginCapture, or if any other thread has a concurrent capture sequence initiated with cudaStreamCaptureModeGlobal, this thread is prohibited from potentially unsafe API calls.

cudaStreamCaptureModeThreadLocal: If the local thread has an ongoing capture sequence not initiated with cudaStreamCaptureModeRelaxed, it is prohibited from potentially unsafe API calls. Concurrent capture sequences in other threads are ignored.

cudaStreamCaptureModeRelaxed: The local thread is not prohibited from potentially unsafe API calls. Note that the thread is still prohibited from API calls which necessarily conflict with stream capture, for example, attempting cudaEventQuery on an event that was last recorded inside a capture sequence.

@aropb
Copy link
Author

aropb commented Feb 12, 2025

cudaStreamCaptureModeThreadLocal

I tried the options with cudaStreamCaptureModeThreadLocal and cudaStreamCaptureModeGlobal, but the error did not disappear.

@aropb
Copy link
Author

aropb commented Feb 13, 2025

A few details by mistake. This is how errors are divided into threads.

This thread calls the docoder:

2025-02-13 19:24:25.4454|LLamaStatelessExecutor InferAsync Evaluate New Token

2025-02-13 19:24:25.4584 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-13 19:24:25.4765 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 0, in function ggml_cuda_op_mul_mat at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:1511
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-13 19:24:25.4765 LLama.Native.SafeLLamaContextHandle.llama_decode Error: cudaGetLastError()
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode

This thread calls the create context:

2025-02-13 19:24:25.5281||LLamaStatelessExecutor InferAsync CreateContext

2025-02-13 19:24:25.5394 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-13 19:24:25.5394 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-13 19:24:25.5394 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: cudaDeviceSynchronize()

This thread calls the create context:

2025-02-13 19:24:25.5552|LLamaStatelessExecutor InferAsync CreateContext

SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-13 19:24:25.5839 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-13 19:24:25.5839 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model
2025-02-13 19:24:25.5839 LLama.Native.SafeLLamaContextHandle.llama_init_from_model Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.Create => SafeLLamaContextHandle.llama_init_from_model => SafeLLamaContextHandle.llama_init_from_model

Do I understand correctly that creating a context is not thread-safe?

@aropb
Copy link
Author

aropb commented Feb 17, 2025

@agray3
Please, can you help me figure out this problem?
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants