-
Notifications
You must be signed in to change notification settings - Fork 11.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intel CPU and Graphics card Macbook pro: failed to create context with model './models/model.q4_k_s.gguf' #3129
Comments
Same here. The only workaround is to not use METAL for me:
|
Same.. My solution so far is to use -ngl 0
|
Seeing same thing, my env:
|
Same here, I think its a bug with Macbooks with an AMD Radeon GPU |
@mounta11n can you show an example of what you mean by |
For example If you set ngl to zero, you say that no layer should be offloaded to the gpu. So -ngl 0 means that you don't utilize the gpu. And yes, I think it's an issue with Macs and AMD (not only MacBooks, since I have an iMac 5k 2017) |
I found even forcing ggml-metal.m to use integrated graphics the issue persisted. |
I'm having a similar issue using an Intel MacMini and AMD Radeon RX Vega EGPU. It's persistent across server Llama Models. I'm about to try the same process on an M2.
|
@RobinWinters @nchudleigh @ro8inmorgan @ssainz @pkrmf @Bateoriginal if you guys are still interested, i have found an acceptable workaround that will allow you to utilize your gpu and let you offload layers to it.
some of you should certainly benefit from layer offloading. in my case offloading layers doesnt really give me any benefits, since my gpu (radeon pro 575) is about as fast as my cpu (fyi: i have tried offloading everything between 1 and 22 layers). the other aspect is the 3 gb vram extra memory – but this isnt relevant as well for me since i have enough cpu ram. but the loading time is about 20x faster now thanks to clblast: Without clBlast -t 3
it needs 17 seconds until first token With clBlast -t 3 -ngl 0
now it needs under 1 second until first token, and even a little bit faster with mlock: -t 3 -ngl 0 --mlock
about 860 ms until first token |
Couldn't get it all to work, but I've been using llama_cpp python. 2.3 GHz 8-Core Intel i9 rewrote comment as I made a boo boo.
Had to update Xcode command line tools. Fails to install wheel packages...error:
Not sure where to begin to resolve it all... EDIT
Yea, this worked from ./main, not from llama_cpp python bindings which keeps giving errors relating to metal. Maybe I'll submit an issue after more testing... I.e gpu working from ./main but not from llama_cpp |
I could not get llama_cpp python to work so far as well but I was able to build and use llama_cpp without metal and with cblast simply following @mounta11n instructions. An even simpler solution, I used -ngl 0 with llama_cpp that was built with metal and it worked fine (no need to remake unless you want GPU acceleration).
For me GPU is actually slowing down everything except model load and user input token eval, but I am still experimenting with various values of offloading :-) (10 to 20 so far)
Hopes, this helps, looking forward to see how you make python version works. |
@vainceha may I ask what hardware do you use and give you some general advices? I assume that you have an 8 core cpu, right? If so, it is very recommended to set -t to maximum 7, or even -t 4 is often much faster. |
That would be great, here is some data and more info The testing I was doing over weekend was on this system Quad-Core Intel Core i7/ 2.6 GHz with 16 GB RAM
I will be doing future testing on relatively newer machine with below specs 8 Core Intel i9/2.3 GHz with 16 GB RAM |
Unfortunately I can only give you personal recommendations based on my own trial and error experience. The llama.cpp documentation itself is not easy to keep track of, I guess that's the reason why there is not much else to find on the internet at the moment. At least I don't know of any other good references at the moment. But this is not meant to be a criticism of the llama.cpp team, because one also have to remember that this is absolute bleeding edge technology that is developing incredibly fast. If I would be such a skilled developer like the guys from llama.cpp and I would understand everything at once as soon as I see the code, then my time would probably be too precious to write simple manuals and documentations as well ^^'. Okay, enough monological smalltalk done, sorry. These seem to be both MacBooks. You can't upgrade RAM unfortunately, too bad. With the Quad i7 you should not address more than 3 threads, so -t 3. With that you should get the fastest results in most cases. That's because you always need a "reserve" core, which orchestrates the rest and remains for the work of the system. About top-k: with this value you specify, for each word that should be generated next (appropriately token but we say word now), how big the "pot" of words should be from which the next word should be randomly selected. This means concretely for top-k 1000, that each time, after each word, something should be picked out of 1000 possible words. But with LLMs it is similar to us humans and our brain. When we speak, we almost always have a 100% idea of what the next word should be. Sometimes we still think very briefly whether we want to take wording A or wording B. e.g. if I want to say "Because of this event I am quite.... 1. disappointed... 2. sad... 3. heartsick", then I am actually already relatively undecided. But I will never be so indecisive that I have to look at 1000 words before I can decide. That's why, in my opinion, it's quite sufficient to take a maximum of --top-k 3. Then it's a matter of how "wildly" to decide between those words. If I am someone who prefers a conservative way of thinking and speaking, then I will almost certainly choose the most common word, in my case "disappointed" and rarely or never would I venture something exotic and say something like "heartsick" in the appropriate sentence. This corresponds roughly to a setting of -temp 0.2 My personal approach is actually always to take --top-k 1, because that shows me the true core of a particular language model and leaves nothing to chance. Yes, it is definitely worth trying the new quants. Quantization is something like compression. Q_4 means that the parameters of the model have been "compressed" to 4-bit. In Q4_K_M, most layers of the model are in 4-bit, but some layers that have certain key functions are quantized to 6-bit, giving better and smarter results than their q4_0 siblings. Your i9 machine is a great device! You probably won't need GPU layer offloading here either. However, make sure to always leave at least one core here as well. So take a maximum of -t 7 |
Try building with latest |
Tested today with latest master 95bd60a on a Intel Macbook with AMD: it doesn't crash now but the performance with |
Try CMAKE_ARGS="-DLLAMA_CLBLAST=on -DLLAMA_METAL=off" pip install llama-cpp-python --no-cache-dir --force-reinstall |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Issue: Error when loading model on MacBook Pro with Intel Core i7 and Intel Iris Plus
System Information:
Steps to Reproduce:
wget https://huggingface.co/substratusai/Llama-2-13B-chat-GGUF/resolve/main/model.bin -O model.q4_k_s.gguf
./main -t 4 -m ./models/model.q4_k_s.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story\n### Response:"
Error Message:
I would appreciate any guidance or advice on how to resolve this issue. Thank you!
The text was updated successfully, but these errors were encountered: