Intel CPU and Graphics card Macbook pro: failed to create context with model './models/model.q4_k_s.gguf' #3129

Bateoriginal · 2023-09-11T21:04:11Z

Issue: Error when loading model on MacBook Pro with Intel Core i7 and Intel Iris Plus

System Information:

Device: MacBook Pro
CPU: Quad-Core Intel Core i7
Graphics: Intel Iris Plus

Steps to Reproduce:

Cloned the latest version of the repository.
Executed make.
Created a models directory using mkdir models.
Within the models folder, downloaded the model using: wget https://huggingface.co/substratusai/Llama-2-13B-chat-GGUF/resolve/main/model.bin -O model.q4_k_s.gguf
In the llama.cpp folder, executed the following command: ./main -t 4 -m ./models/model.q4_k_s.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story\n### Response:"

Error Message:

ggml_metal_init: loaded kernel_mul_mat_q6_K_f32            0x7fb57ba06090 | th_max =  896 | th_width =   16
ggml_metal_init: loaded kernel_mul_mm_f16_f32                         0x0 | th_max =    0 | th_width =    0
ggml_metal_init: load pipeline error: Error Domain=CompilerError Code=2 "AIR builtin function was called but no definition was found." UserInfo={NSLocalizedDescription=AIR builtin function was called but no definition was found.}
llama_new_context_with_model: ggml_metal_init() failed
llama_init_from_gpt_params: error: failed to create context with model './models/model.q4_k_s.gguf'
main: error: unable to load model

I would appreciate any guidance or advice on how to resolve this issue. Thank you!

The text was updated successfully, but these errors were encountered:

pkrmf · 2023-09-11T23:16:47Z

Same here. The only workaround is to not use METAL for me:

ggml_metal_init: allocating
ggml_metal_init: found device: Intel(R) HD Graphics 530
ggml_metal_init: found device: AMD Radeon Pro 455
ggml_metal_init: picking default device: AMD Radeon Pro 455

mounta11n · 2023-09-12T02:26:10Z

Same.. My solution so far is to use -ngl 0

ggml_metal_init: allocating
ggml_metal_init: found device: AMD Radeon Pro 575
ggml_metal_init: picking default device: AMD Radeon Pro 575

# ...

gml_metal_init: loaded kernel_mul_mat_q6_K_f32            0x7feed820ff50 | th_max = 1024 | th_width =   64
ggml_metal_init: loaded kernel_mul_mm_f16_f32                         0x0 | th_max =    0 | th_width =    0
ggml_metal_init: load pipeline error: Error Domain=CompilerError Code=2 "SC compilation failure
There is a call to an undefined label" UserInfo={NSLocalizedDescription=SC compilation failure
There is a call to an undefined label}
llama_new_context_with_model: ggml_metal_init() failed
llama_init_from_gpt_params: error: failed to create context with model '/Volumes/ext1tb/Models/llama-2-7b-chat-codeCherryPop.Q5_K_M.gguf'
{"timestamp":1694485384,"level":"ERROR","function":"loadModel","line":265,"message":"unable to load model","model":"/Volumes/ext1tb/Models/llama-2-7b-chat-codeCherryPop.Q5_K_M.gguf"}

ssainz · 2023-09-12T15:08:09Z

Same.. My solution so far is to use -ngl 0

ggml_metal_init: allocating
ggml_metal_init: found device: AMD Radeon Pro 575
ggml_metal_init: picking default device: AMD Radeon Pro 575

# ...

gml_metal_init: loaded kernel_mul_mat_q6_K_f32            0x7feed820ff50 | th_max = 1024 | th_width =   64
ggml_metal_init: loaded kernel_mul_mm_f16_f32                         0x0 | th_max =    0 | th_width =    0
ggml_metal_init: load pipeline error: Error Domain=CompilerError Code=2 "SC compilation failure
There is a call to an undefined label" UserInfo={NSLocalizedDescription=SC compilation failure
There is a call to an undefined label}
llama_new_context_with_model: ggml_metal_init() failed
llama_init_from_gpt_params: error: failed to create context with model '/Volumes/ext1tb/Models/llama-2-7b-chat-codeCherryPop.Q5_K_M.gguf'
{"timestamp":1694485384,"level":"ERROR","function":"loadModel","line":265,"message":"unable to load model","model":"/Volumes/ext1tb/Models/llama-2-7b-chat-codeCherryPop.Q5_K_M.gguf"}

Seeing same thing, my env:

ggml_metal_init: allocating
ggml_metal_init: found device: Intel(R) UHD Graphics 630
ggml_metal_init: found device: AMD Radeon Pro 5500M
ggml_metal_init: picking default device: AMD Radeon Pro 5500M
ggml_metal_init: loading '/Users/ssainz/Projects/ggerganov/llama.cpp/ggml-metal.metal'

ro8inmorgan · 2023-09-14T08:52:33Z

Same here, I think its a bug with Macbooks with an AMD Radeon GPU

nchudleigh · 2023-09-15T20:17:07Z

@mounta11n can you show an example of what you mean by -ngl 0?

mounta11n · 2023-09-15T23:41:51Z

For example ./server -ngl 0 -t 3 --host 0.0.0.0 -c 4096 -b 2048 --mlock -m /Volumes/ext1tb/Models/13B/Synthia-13B-q4M.gguf

If you set ngl to zero, you say that no layer should be offloaded to the gpu. So -ngl 0 means that you don't utilize the gpu.

And yes, I think it's an issue with Macs and AMD (not only MacBooks, since I have an iMac 5k 2017)

nchudleigh · 2023-09-16T00:32:36Z

I found even forcing ggml-metal.m to use integrated graphics the issue persisted.

RobinWinters · 2023-09-19T20:27:03Z

I'm having a similar issue using an Intel MacMini and AMD Radeon RX Vega EGPU.

It's persistent across server Llama Models.

I'm about to try the same process on an M2.

UserInfo={NSLocalizedDescription=SC compilation failure
There is a call to an undefined label
llama_new_context_with_model: ggml_metal_init() failed
llama_init_from_gpt_params: error: failed to create context with model './models/7B/ggml-model-q4_0.bin'
main: error: unable to load model

mounta11n · 2023-09-21T18:26:04Z

@RobinWinters @nchudleigh @ro8inmorgan @ssainz @pkrmf @Bateoriginal

if you guys are still interested, i have found an acceptable workaround that will allow you to utilize your gpu and let you offload layers to it.

first remove the current build make clean
make sure you have clblast installed, if not then brew update && brew install clblast
now you can run a new build, but disable metal and enable clblast make LLAMA_CLBLAST=1 LLAMA_NO_METAL=1
that's it, you can from now on run ./main or ./server with gpu acceleration and even with the possibility to offload layers, like for example so:

./main -s 1 -m /Volumes/ext1tb/Models/13B/Samantha.gguf -p "I believe the purpose of life is" --ignore-eos -c 64 -n 128 -t 3 -ngl 10

some of you should certainly benefit from layer offloading. in my case offloading layers doesnt really give me any benefits, since my gpu (radeon pro 575) is about as fast as my cpu (fyi: i have tried offloading everything between 1 and 22 layers). the other aspect is the 3 gb vram extra memory – but this isnt relevant as well for me since i have enough cpu ram. but the loading time is about 20x faster now thanks to clblast:

Without clBlast -t 3

llama_print_timings:        load time = 17314,26 ms
llama_print_timings:      sample time =    86,12 ms /   128 runs   (    0,67 ms per token,  1486,28 tokens per second)
llama_print_timings: prompt eval time =   956,49 ms /     8 tokens (  119,56 ms per token,     8,36 tokens per second)
llama_print_timings:        eval time = 26066,80 ms /   127 runs   (  205,25 ms per token,     4,87 tokens per second)

it needs 17 seconds until first token

With clBlast -t 3 -ngl 0

llama_print_timings:        load time =   920,24 ms
llama_print_timings:      sample time =    74,38 ms /   128 runs   (    0,58 ms per token,  1721,01 tokens per second)
llama_print_timings: prompt eval time =  1086,07 ms /     8 tokens (  135,76 ms per token,     7,37 tokens per second)
llama_print_timings:        eval time = 25951,83 ms /   127 runs   (  204,35 ms per token,     4,89 tokens per second)
llama_print_timings:       total time = 27143,61 ms

now it needs under 1 second until first token, and even a little bit faster with mlock:

-t 3 -ngl 0 --mlock

ngl 0 t 3 mlock
llama_print_timings:        load time =   858,57 ms
llama_print_timings:      sample time =    74,57 ms /   128 runs   (    0,58 ms per token,  1716,44 tokens per second)
llama_print_timings: prompt eval time =   982,50 ms /     8 tokens (  122,81 ms per token,     8,14 tokens per second)
llama_print_timings:        eval time = 25761,31 ms /   127 runs   (  202,84 ms per token,     4,93 tokens per second)
llama_print_timings:       total time = 26850,29 ms

about 860 ms until first token

ZacharyDK · 2023-09-25T02:19:38Z

Couldn't get it all to work, but I've been using llama_cpp python.

2.3 GHz 8-Core Intel i9
AMD Radeon Pro 5600M 8GB
Intel UHD Graphics 630 1536 MB
Memory: 16 GB 2667 Mhz DDR4

rewrote comment as I made a boo boo.

brew reinstall --build-from-source

Had to update Xcode command line tools.
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_METAL=off -DLLAMA_CLBLAST=on -DCLBlast_DIR=/usr/local/Cellar/clblast/1.6.1" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

Fails to install wheel packages...error:

Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [44 lines of output]
      *** scikit-build-core 0.5.1 using CMake 3.27.5 (wheel)
      *** Configuring CMake...
      2023-09-24 21:31:44,445 - scikit_build_core - WARNING - libdir/ldlibrary: /Users/zacharykolansky/miniforge3/lib/libpython3.10.a is not a real file!
      2023-09-24 21:31:44,445 - scikit_build_core - WARNING - Can't find a Python library, got libdir=/Users/zacharykolansky/miniforge3/lib, ldlibrary=libpython3.10.a, multiarch=darwin, masd=None
      loading initial cache file /var/folders/69/x1sqq6g550176x2n0p_t85b80000gn/T/tmphi4m_4u4/build/CMakeInit.txt
      -- The C compiler identification is AppleClang 14.0.0.14000029
      -- The CXX compiler identification is AppleClang 14.0.0.14000029
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Found Git: /usr/bin/git (found version "2.37.1 (Apple Git-137.1)")
      fatal: not a git repository (or any of the parent directories): .git
      fatal: not a git repository (or any of the parent directories): .git
      CMake Warning at vendor/llama.cpp/CMakeLists.txt:125 (message):
        Git repository not found; to enable automatic generation of build info,
        make sure Git is installed and the project is a Git repository.
      
      
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
      -- Found Threads: TRUE
      -- Accelerate framework found
      -- Looking for sgemm_
      -- Looking for sgemm_ - found
      -- Found BLAS: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.0.sdk/usr/lib/libblas.tbd
      -- BLAS found, Libraries: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.0.sdk/usr/lib/libblas.tbd
      CMake Error at /private/var/folders/69/x1sqq6g550176x2n0p_t85b80000gn/T/pip-build-env-5inno2pz/normal/lib/python3.10/site-packages/cmake/data/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
        Could NOT find PkgConfig (missing: PKG_CONFIG_EXECUTABLE)
      Call Stack (most recent call first):
        /private/var/folders/69/x1sqq6g550176x2n0p_t85b80000gn/T/pip-build-env-5inno2pz/normal/lib/python3.10/site-packages/cmake/data/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE)
        /private/var/folders/69/x1sqq6g550176x2n0p_t85b80000gn/T/pip-build-env-5inno2pz/normal/lib/python3.10/site-packages/cmake/data/share/cmake-3.27/Modules/FindPkgConfig.cmake:99 (find_package_handle_standard_args)
        vendor/llama.cpp/CMakeLists.txt:212 (find_package)
      
      
      -- Configuring incomplete, errors occurred!
      
      *** CMake configuration failed
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects

Not sure where to begin to resolve it all...

EDIT

now you can run a new build, but disable metal and enable clblast make LLAMA_CLBLAST=1 LLAMA_NO_METAL=1

Yea, this worked from ./main, not from llama_cpp python bindings which keeps giving errors relating to metal. Maybe I'll submit an issue after more testing... I.e gpu working from ./main but not from llama_cpp

vainceha · 2023-09-25T14:08:11Z

Couldn't get it all to work, but I've been using llama_cpp python.

2.3 GHz 8-Core Intel i9 AMD Radeon Pro 5600M 8GB Intel UHD Graphics 630 1536 MB Memory: 16 GB 2667 Mhz DDR4

rewrote comment as I made a boo boo.

brew reinstall --build-from-source

Had to update Xcode command line tools. CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_METAL=off -DLLAMA_CLBLAST=on -DCLBlast_DIR=/usr/local/Cellar/clblast/1.6.1" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

Fails to install wheel packages...error:

Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [44 lines of output]
      *** scikit-build-core 0.5.1 using CMake 3.27.5 (wheel)
      *** Configuring CMake...
      2023-09-24 21:31:44,445 - scikit_build_core - WARNING - libdir/ldlibrary: /Users/zacharykolansky/miniforge3/lib/libpython3.10.a is not a real file!
      2023-09-24 21:31:44,445 - scikit_build_core - WARNING - Can't find a Python library, got libdir=/Users/zacharykolansky/miniforge3/lib, ldlibrary=libpython3.10.a, multiarch=darwin, masd=None
      loading initial cache file /var/folders/69/x1sqq6g550176x2n0p_t85b80000gn/T/tmphi4m_4u4/build/CMakeInit.txt
      -- The C compiler identification is AppleClang 14.0.0.14000029
      -- The CXX compiler identification is AppleClang 14.0.0.14000029
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Found Git: /usr/bin/git (found version "2.37.1 (Apple Git-137.1)")
      fatal: not a git repository (or any of the parent directories): .git
      fatal: not a git repository (or any of the parent directories): .git
      CMake Warning at vendor/llama.cpp/CMakeLists.txt:125 (message):
        Git repository not found; to enable automatic generation of build info,
        make sure Git is installed and the project is a Git repository.
      
      
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
      -- Found Threads: TRUE
      -- Accelerate framework found
      -- Looking for sgemm_
      -- Looking for sgemm_ - found
      -- Found BLAS: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.0.sdk/usr/lib/libblas.tbd
      -- BLAS found, Libraries: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.0.sdk/usr/lib/libblas.tbd
      CMake Error at /private/var/folders/69/x1sqq6g550176x2n0p_t85b80000gn/T/pip-build-env-5inno2pz/normal/lib/python3.10/site-packages/cmake/data/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
        Could NOT find PkgConfig (missing: PKG_CONFIG_EXECUTABLE)
      Call Stack (most recent call first):
        /private/var/folders/69/x1sqq6g550176x2n0p_t85b80000gn/T/pip-build-env-5inno2pz/normal/lib/python3.10/site-packages/cmake/data/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE)
        /private/var/folders/69/x1sqq6g550176x2n0p_t85b80000gn/T/pip-build-env-5inno2pz/normal/lib/python3.10/site-packages/cmake/data/share/cmake-3.27/Modules/FindPkgConfig.cmake:99 (find_package_handle_standard_args)
        vendor/llama.cpp/CMakeLists.txt:212 (find_package)
      
      
      -- Configuring incomplete, errors occurred!
      
      *** CMake configuration failed
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects

Not sure where to begin to resolve it all...

I could not get llama_cpp python to work so far as well but I was able to build and use llama_cpp without metal and with cblast simply following @mounta11n instructions.

An even simpler solution, I used -ngl 0 with llama_cpp that was built with metal and it worked fine (no need to remake unless you want GPU acceleration).

./main -m ./models/llama-2-13b-chat.Q4_0.gguf \
  --color \
  --ctx_size 2048 \
  -n -1 \
  -ins -b 256 \
  --top_k 10000 \
  --temp 0.2 \
  --repeat_penalty 1.1 \
  -t 8 \
  -ngl 0

For me GPU is actually slowing down everything except model load and user input token eval, but I am still experimenting with various values of offloading :-) (10 to 20 so far)

./main -m ./models/llama-2-13b-chat.Q4_0.gguf \
  --color \
  --ctx_size 2048 \
  -n -1 \
  -ins -b 256 \
  --top_k 10000 \
  --temp 0.2 \
  --repeat_penalty 1.1 \
  -t 8 \
  -ngl 10

Hopes, this helps, looking forward to see how you make python version works.

mounta11n · 2023-09-25T18:07:56Z

@vainceha may I ask what hardware do you use and give you some general advices?

I assume that you have an 8 core cpu, right? If so, it is very recommended to set -t to maximum 7, or even -t 4 is often much faster.
2. I am not sure if --top-k 1000 would mean more calculation, but beside that it is like never necessary to go further than 10 or so, especially when you have -temp 0.2
3. it looks like you are using the old quantization method (q4_0). If you have a good reason for it, then okay. But just in case you are not aware of, there is a new quantization method (k_quant), which is highly recommended by the devs of llama.cpp

vainceha · 2023-09-25T21:17:25Z

@vainceha may I ask what hardware do you use and give you some general advices?

I assume that you have an 8 core cpu, right? If so, it is very recommended to set -t to maximum 7, or even -t 4 is often much faster. 2. I am not sure if --top-k 1000 would mean more calculation, but beside that it is like never necessary to go further than 10 or so, especially when you have -temp 0.2 3. it looks like you are using the old quantization method (q4_0). If you have a good reason for it, then okay. But just in case you are not aware of, there is a new quantization method (k_quant), which is highly recommended by the devs of llama.cpp

That would be great, here is some data and more info

The testing I was doing over weekend was on this system

Quad-Core Intel Core i7/ 2.6 GHz with 16 GB RAM
Radeon Pro 450M with 2GB, Metal GPUFamily macOS 2

I did not explored documentation at that time, was trying to get python version work but then settled for directly running it - any advice is appreciated (docs, links besides the git document itself)
no reason, will switch over and test

I will be doing future testing on relatively newer machine with below specs

8 Core Intel i9/2.3 GHz with 16 GB RAM
Radeon Pro 550M with 4GB with metal 3 support

mounta11n · 2023-09-25T22:26:18Z

Unfortunately I can only give you personal recommendations based on my own trial and error experience. The llama.cpp documentation itself is not easy to keep track of, I guess that's the reason why there is not much else to find on the internet at the moment. At least I don't know of any other good references at the moment.

But this is not meant to be a criticism of the llama.cpp team, because one also have to remember that this is absolute bleeding edge technology that is developing incredibly fast. If I would be such a skilled developer like the guys from llama.cpp and I would understand everything at once as soon as I see the code, then my time would probably be too precious to write simple manuals and documentations as well ^^'. Okay, enough monological smalltalk done, sorry.

These seem to be both MacBooks. You can't upgrade RAM unfortunately, too bad. With the Quad i7 you should not address more than 3 threads, so -t 3. With that you should get the fastest results in most cases. That's because you always need a "reserve" core, which orchestrates the rest and remains for the work of the system.

About top-k: with this value you specify, for each word that should be generated next (appropriately token but we say word now), how big the "pot" of words should be from which the next word should be randomly selected. This means concretely for top-k 1000, that each time, after each word, something should be picked out of 1000 possible words. But with LLMs it is similar to us humans and our brain. When we speak, we almost always have a 100% idea of what the next word should be. Sometimes we still think very briefly whether we want to take wording A or wording B.
Sometimes we might even use wording C.

e.g. if I want to say "Because of this event I am quite.... 1. disappointed... 2. sad... 3. heartsick", then I am actually already relatively undecided. But I will never be so indecisive that I have to look at 1000 words before I can decide. That's why, in my opinion, it's quite sufficient to take a maximum of --top-k 3.

Then it's a matter of how "wildly" to decide between those words. If I am someone who prefers a conservative way of thinking and speaking, then I will almost certainly choose the most common word, in my case "disappointed" and rarely or never would I venture something exotic and say something like "heartsick" in the appropriate sentence. This corresponds roughly to a setting of -temp 0.2
With -temp 0.2 the first word is taken in most cases anyway, sometimes the second and rarely the third. So 997 words were considered unnecessarily with --top-k 1000.

My personal approach is actually always to take --top-k 1, because that shows me the true core of a particular language model and leaves nothing to chance.
I hope this helps in understanding and setting these hyperparameters.

Yes, it is definitely worth trying the new quants. Quantization is something like compression. Q_4 means that the parameters of the model have been "compressed" to 4-bit. In Q4_K_M, most layers of the model are in 4-bit, but some layers that have certain key functions are quantized to 6-bit, giving better and smarter results than their q4_0 siblings.

Your i9 machine is a great device! You probably won't need GPU layer offloading here either. However, make sure to always leave at least one core here as well. So take a maximum of -t 7

ggerganov · 2023-10-08T07:02:43Z

Try building with latest master - #3524 might fix the issue for some Intel Macbooks

PacoDu · 2023-10-09T16:15:22Z

Try building with latest master - #3524 might fix the issue for some Intel Macbooks

Tested today with latest master 95bd60a on a Intel Macbook with AMD: it doesn't crash now but the performance with -ngl option enabled is worst than CPU only. GPU spikes at 100% and token throughput is really slow (even with only 1 layer offloaded).

MikeLP · 2023-12-09T15:14:58Z

Couldn't get it all to work, but I've been using llama_cpp python.

2.3 GHz 8-Core Intel i9 AMD Radeon Pro 5600M 8GB Intel UHD Graphics 630 1536 MB Memory: 16 GB 2667 Mhz DDR4

rewrote comment as I made a boo boo.

brew reinstall --build-from-source

Had to update Xcode command line tools. CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_METAL=off -DLLAMA_CLBLAST=on -DCLBlast_DIR=/usr/local/Cellar/clblast/1.6.1" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

Fails to install wheel packages...error:

Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [44 lines of output]
      *** scikit-build-core 0.5.1 using CMake 3.27.5 (wheel)
      *** Configuring CMake...
      2023-09-24 21:31:44,445 - scikit_build_core - WARNING - libdir/ldlibrary: /Users/zacharykolansky/miniforge3/lib/libpython3.10.a is not a real file!
      2023-09-24 21:31:44,445 - scikit_build_core - WARNING - Can't find a Python library, got libdir=/Users/zacharykolansky/miniforge3/lib, ldlibrary=libpython3.10.a, multiarch=darwin, masd=None
      loading initial cache file /var/folders/69/x1sqq6g550176x2n0p_t85b80000gn/T/tmphi4m_4u4/build/CMakeInit.txt
      -- The C compiler identification is AppleClang 14.0.0.14000029
      -- The CXX compiler identification is AppleClang 14.0.0.14000029
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Found Git: /usr/bin/git (found version "2.37.1 (Apple Git-137.1)")
      fatal: not a git repository (or any of the parent directories): .git
      fatal: not a git repository (or any of the parent directories): .git
      CMake Warning at vendor/llama.cpp/CMakeLists.txt:125 (message):
        Git repository not found; to enable automatic generation of build info,
        make sure Git is installed and the project is a Git repository.
      
      
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
      -- Found Threads: TRUE
      -- Accelerate framework found
      -- Looking for sgemm_
      -- Looking for sgemm_ - found
      -- Found BLAS: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.0.sdk/usr/lib/libblas.tbd
      -- BLAS found, Libraries: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.0.sdk/usr/lib/libblas.tbd
      CMake Error at /private/var/folders/69/x1sqq6g550176x2n0p_t85b80000gn/T/pip-build-env-5inno2pz/normal/lib/python3.10/site-packages/cmake/data/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
        Could NOT find PkgConfig (missing: PKG_CONFIG_EXECUTABLE)
      Call Stack (most recent call first):
        /private/var/folders/69/x1sqq6g550176x2n0p_t85b80000gn/T/pip-build-env-5inno2pz/normal/lib/python3.10/site-packages/cmake/data/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE)
        /private/var/folders/69/x1sqq6g550176x2n0p_t85b80000gn/T/pip-build-env-5inno2pz/normal/lib/python3.10/site-packages/cmake/data/share/cmake-3.27/Modules/FindPkgConfig.cmake:99 (find_package_handle_standard_args)
        vendor/llama.cpp/CMakeLists.txt:212 (find_package)
      
      
      -- Configuring incomplete, errors occurred!
      
      *** CMake configuration failed
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects

Not sure where to begin to resolve it all...

EDIT

now you can run a new build, but disable metal and enable clblast make LLAMA_CLBLAST=1 LLAMA_NO_METAL=1

Yea, this worked from ./main, not from llama_cpp python bindings which keeps giving errors relating to metal. Maybe I'll submit an issue after more testing... I.e gpu working from ./main but not from llama_cpp

Try

 CMAKE_ARGS="-DLLAMA_CLBLAST=on -DLLAMA_METAL=off" pip install llama-cpp-python --no-cache-dir --force-reinstall

github-actions · 2024-04-03T01:16:18Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

mounta11n mentioned this issue Sep 25, 2023

Apple M1 metal lag #1730

Closed

ggerganov mentioned this issue Oct 8, 2023

metal : support MTLGPUFamily < Apple7, formatting, style #3524

Merged

bobqianic mentioned this issue Oct 15, 2023

whisper_init_state: ggml_metal_init() failed ggerganov/whisper.cpp#1367

Closed

cracksauce mentioned this issue Jan 7, 2024

Support AMD GPUs on Intel Macs ollama/ollama#1016

Open

github-actions bot added the stale label Mar 20, 2024

github-actions bot closed this as completed Apr 3, 2024

nihalpasham mentioned this issue May 8, 2024

Metal Kernel Error on Macbook Pro nihalpasham/optimus#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intel CPU and Graphics card Macbook pro: failed to create context with model './models/model.q4_k_s.gguf' #3129

Intel CPU and Graphics card Macbook pro: failed to create context with model './models/model.q4_k_s.gguf' #3129

Bateoriginal commented Sep 11, 2023

pkrmf commented Sep 11, 2023

mounta11n commented Sep 12, 2023 •

edited

Loading

ssainz commented Sep 12, 2023

ro8inmorgan commented Sep 14, 2023

nchudleigh commented Sep 15, 2023

mounta11n commented Sep 15, 2023 •

edited

Loading

nchudleigh commented Sep 16, 2023

RobinWinters commented Sep 19, 2023

mounta11n commented Sep 21, 2023 •

edited

Loading

ZacharyDK commented Sep 25, 2023 •

edited

Loading

vainceha commented Sep 25, 2023

mounta11n commented Sep 25, 2023

vainceha commented Sep 25, 2023

mounta11n commented Sep 25, 2023

ggerganov commented Oct 8, 2023

PacoDu commented Oct 9, 2023 •

edited

Loading

MikeLP commented Dec 9, 2023

github-actions bot commented Apr 3, 2024

Intel CPU and Graphics card Macbook pro: failed to create context with model './models/model.q4_k_s.gguf' #3129

Intel CPU and Graphics card Macbook pro: failed to create context with model './models/model.q4_k_s.gguf' #3129

Comments

Bateoriginal commented Sep 11, 2023

pkrmf commented Sep 11, 2023

mounta11n commented Sep 12, 2023 • edited Loading

ssainz commented Sep 12, 2023

ro8inmorgan commented Sep 14, 2023

nchudleigh commented Sep 15, 2023

mounta11n commented Sep 15, 2023 • edited Loading

nchudleigh commented Sep 16, 2023

RobinWinters commented Sep 19, 2023

mounta11n commented Sep 21, 2023 • edited Loading

ZacharyDK commented Sep 25, 2023 • edited Loading

vainceha commented Sep 25, 2023

mounta11n commented Sep 25, 2023

vainceha commented Sep 25, 2023

mounta11n commented Sep 25, 2023

ggerganov commented Oct 8, 2023

PacoDu commented Oct 9, 2023 • edited Loading

MikeLP commented Dec 9, 2023

github-actions bot commented Apr 3, 2024

mounta11n commented Sep 12, 2023 •

edited

Loading

mounta11n commented Sep 15, 2023 •

edited

Loading

mounta11n commented Sep 21, 2023 •

edited

Loading

ZacharyDK commented Sep 25, 2023 •

edited

Loading

PacoDu commented Oct 9, 2023 •

edited

Loading