Replies: 3 comments 2 replies
-
same here on mac (intel macbook pro i7+igpu / i9+radeon pro 550) |
Beta Was this translation helpful? Give feedback.
-
I think i figured it out but i dont know about you. i was using the base gemma2-2b-instruct-q6_K_S.llamafile but on ollama i was using q2-S (q - quantization to make the model faster, lower the number the faster it is) which made it run 1.5 - 1.8 times slower, but you might be using a bigger model with a bigger difference Update: I tested this on a pi5 and the difference between ollama and llamafile is now noticeable, llamafile is 1.8 -2.2 times faster than ollama (Tested on a pi5 3ghz 4gb with gemma2:2b-instruct-q2_K.gguf) |
Beta Was this translation helpful? Give feedback.
-
OK some bench... and "answer".
It is like I expected. so some more explain:
With ARM CPU the Q4 have special optim on llama.cpp so can be faster than llamafile (did-not find time to bench...) so all depend on CPU and Quantization Type, both have resent optim. For me llamafile is the faster on K-quant, but no more on other. |
Beta Was this translation helpful? Give feedback.
-
Admittedly, I am pretty new to LLM and have a lot to learn, so this could be some basic mistake in observation or interpretation. I tried using ollama and llamafile on the same Ubuntu MATE 24.04.1 desktop running on Intel i5-8440 with 32GB of DDR4 (single-channel) RAM and no discrete GPU -- the main reason why I was hoping to see faster token/sec speed with llamafile as per claims seen. I was hoping to have a local LLM setup serving the Qwen-2.5-Coder-7B (4-bit Quantized K-M) model, in FIM mode, using the OpenAI compatible web-access interface, to be used by Continue.dev plugin running in VScode.
First I tried ollama with the above said model, and I see the following metrics reported:
Then, I ran llamafile with a freshly downloaded huggingface GGUF for the same model, and observe the reported metrics below the web-access chat UI page as:
So, if my interpretation is correct, ollama resulted in 3.38 t/s while llamafile resulted in 2.64 t/s, which seems odd and doesn't match what I had expected (llamafile to be faster). Am I missing something ?
The prompt used: "Write a python program to create a csv file with 5 columns having 10 rows, where first column has firstname, second column has age between 11 & 16, third column has a random number between 40 & 100, fourth column has a random zip code, fifth column has a city name."
Here is the llamafile chat UI output:

Here is the ollama terminal output:

Beta Was this translation helpful? Give feedback.
All reactions