Why is llamafile seemingly slower than ollama on my system for the same model and same query ? #614

bdutta · 2024-11-11T11:54:56Z

bdutta
Nov 11, 2024

Admittedly, I am pretty new to LLM and have a lot to learn, so this could be some basic mistake in observation or interpretation. I tried using ollama and llamafile on the same Ubuntu MATE 24.04.1 desktop running on Intel i5-8440 with 32GB of DDR4 (single-channel) RAM and no discrete GPU -- the main reason why I was hoping to see faster token/sec speed with llamafile as per claims seen. I was hoping to have a local LLM setup serving the Qwen-2.5-Coder-7B (4-bit Quantized K-M) model, in FIM mode, using the OpenAI compatible web-access interface, to be used by Continue.dev plugin running in VScode.

First I tried ollama with the above said model, and I see the following metrics reported:

total duration: 2m37.252170344s
load duration: 51.987167ms
prompt eval count: 98 token(s)
prompt eval duration: 6.76s
prompt eval rate: 14.50 tokens/s
eval count: 507 token(s)
eval duration: 2m29.848s
eval rate: 3.38 tokens/s

Then, I ran llamafile with a freshly downloaded huggingface GGUF for the same model, and observe the reported metrics below the web-access chat UI page as:

400 tokens predicted, 379 ms per token, 2.64 tokens per second
prompt evaluation speed is 0.00 prompt tokens evaluated per second

So, if my interpretation is correct, ollama resulted in 3.38 t/s while llamafile resulted in 2.64 t/s, which seems odd and doesn't match what I had expected (llamafile to be faster). Am I missing something ?

The prompt used: "Write a python program to create a csv file with 5 columns having 10 rows, where first column has firstname, second column has age between 11 & 16, third column has a random number between 40 & 100, fourth column has a random zip code, fifth column has a city name."

Here is the llamafile chat UI output:

Here is the ollama terminal output:

GamerBolt03 · 2025-01-28T16:42:05Z

GamerBolt03
Jan 28, 2025

same here on mac (intel macbook pro i7+igpu / i9+radeon pro 550)

0 replies

GamerBolt03 · 2025-02-01T15:01:01Z

GamerBolt03
Feb 1, 2025

I think i figured it out but i dont know about you. i was using the base gemma2-2b-instruct-q6_K_S.llamafile but on ollama i was using q2-S (q - quantization to make the model faster, lower the number the faster it is) which made it run 1.5 - 1.8 times slower, but you might be using a bigger model with a bigger difference

Update: I tested this on a pi5 and the difference between ollama and llamafile is now noticeable, llamafile is 1.8 -2.2 times faster than ollama (Tested on a pi5 3ghz 4gb with gemma2:2b-instruct-q2_K.gguf)

1 reply

zarir94 Feb 4, 2025

Why it is not the same for me :( I am using Gemma 2 2b it Q4_0 llamafile (from hudding face) and using default gemma 2 2b from ollama which is the same quantization 4 level (see screenshot) on google cloud shell. I ran the same prompt in both of the models and seems like ollama is faster. Ollama got 3.74 tokens per second and Llamafile got 3.1 tokens per second. You can try it in cloud shell if you doubt it.

Djip007 · 2025-02-12T21:42:43Z

Djip007
Feb 12, 2025

OK some bench... and "answer".

CPU: AMD Ryzen 9 7940HS
model: gemma-2-2b-it
bench param: `-p "2,4,8,16,32,64,128,256,512" -n 32 -r 5
bench exec: llamafile-0.9.0 & llama.cpp (I do not bench ollama, but expect same/slower result than llama.cpp)

model	size	test	llamafile t/s	llama.cpp t/s
BF16	5.97 GiB	pp2	20.63	20.83 ± 0.13
BF16	5.97 GiB	pp4	42.14	40.97 ± 0.20
BF16	5.97 GiB	pp8	45.30	81.90 ± 0.34
BF16	5.97 GiB	pp16	100.90	151.75 ± 2.14
BF16	5.97 GiB	pp32	181.60	252.95 ± 4.83
BF16	5.97 GiB	pp64	264.79	281.68 ± 7.32
BF16	5.97 GiB	pp128	248.34	313.68 ± 5.09
BF16	5.97 GiB	pp256	293.86	301.95 ± 14.93
BF16	5.97 GiB	pp512	266.63	269.35 ± 11.18
BF16	5.97 GiB	tg32	10.51	10.51 ± 0.02
F16	5.97 GiB	pp2	20.78	20.47 ± 0.16
F16	5.97 GiB	pp4	42.24	41.24 ± 0.38
F16	5.97 GiB	pp8	43.38	80.11 ± 0.67
F16	5.97 GiB	pp16	103.72	116.56 ± 6.16
F16	5.97 GiB	pp32	147.48	151.05 ± 2.03
F16	5.97 GiB	pp64	168.02	174.68 ± 1.96
F16	5.97 GiB	pp128	169.72	182.74 ± 2.56
F16	5.97 GiB	pp256	181.65	180.02 ± 6.50
F16	5.97 GiB	pp512	177.05	173.35 ± 4.25
F16	5.97 GiB	tg32	10.51	10.31 ± 0.08
Q8_0	3.17 GiB	pp2	36.29	36.40 ± 0.69
Q8_0	3.17 GiB	pp4	55.15	74.53 ± 0.24
Q8_0	3.17 GiB	pp8	98.92	132.21 ± 3.13
Q8_0	3.17 GiB	pp16	134.84	179.71 ± 4.71
Q8_0	3.17 GiB	pp32	155.39	204.89 ± 1.39
Q8_0	3.17 GiB	pp64	169.24	216.41 ± 5.90
Q8_0	3.17 GiB	pp128	157.27	194.76 ± 21.89
Q8_0	3.17 GiB	pp256	168.32	207.47 ± 3.16
Q8_0	3.17 GiB	pp512	160.08	183.86 ± 8.77
Q8_0	3.17 GiB	tg32	18.78	18.86 ± 0.26
Q6_K	2.45 GiB	pp2	45.79	46.30 ± 0.17
Q6_K	2.45 GiB	pp4	94.19	79.47 ± 1.86
Q6_K	2.45 GiB	pp8	160.61	107.84 ± 0.34
Q6_K	2.45 GiB	pp16	204.75	117.52 ± 1.05
Q6_K	2.45 GiB	pp32	241.74	125.38 ± 0.47
Q6_K	2.45 GiB	pp64	268.54	124.35 ± 2.67
Q6_K	2.45 GiB	pp128	243.47	127.99 ± 2.49
Q6_K	2.45 GiB	pp256	260.06	127.18 ± 1.56
Q6_K	2.45 GiB	pp512	253.32	120.96 ± 1.63
Q6_K	2.45 GiB	tg32	23.45	23.69 ± 0.12
Q4_K_M	2.04 GiB	pp2	59.15	55.09 ± 0.45
Q4_K_M	2.04 GiB	pp4	109.06	91.69 ± 1.89
Q4_K_M	2.04 GiB	pp8	186.82	118.26 ± 1.31
Q4_K_M	2.04 GiB	pp16	230.58	135.03 ± 1.62
Q4_K_M	2.04 GiB	pp32	256.15	143.68 ± 1.13
Q4_K_M	2.04 GiB	pp64	281.69	145.32 ± 2.88
Q4_K_M	2.04 GiB	pp128	282.48	143.86 ± 6.37
Q4_K_M	2.04 GiB	pp256	266.35	143.75 ± 1.61
Q4_K_M	2.04 GiB	pp512	255.37	136.89 ± 1.19
Q4_K_M	2.04 GiB	tg32	29.11	28.62 ± 0.81
Q4_0	1.96 GiB	pp2	60.83	52.88 ± 0.11
Q4_0	1.96 GiB	pp4	109.52	117.67 ± 4.34
Q4_0	1.96 GiB	pp8	157.29	159.90 ± 0.28
Q4_0	1.96 GiB	pp16	177.84	208.10 ± 1.70
Q4_0	1.96 GiB	pp32	157.57	225.37 ± 3.40
Q4_0	1.96 GiB	pp64	186.52	232.53 ± 0.82
Q4_0	1.96 GiB	pp128	189.27	218.93 ± 14.66
Q4_0	1.96 GiB	pp256	178.65	221.61 ± 2.12
Q4_0	1.96 GiB	pp512	172.60	201.75 ± 8.99
Q4_0	1.96 GiB	tg32	30.59	30.47 ± 0.29

It is like I expected. so some more explain:

I did resent change an llama.cpp for BF16/FP6, so now it is faster than llamafile (in 80% case ...) did not find time to finalise change on llamafile.
QN_0 (Q8_0, Q4_0(?)): have had resent optim on llama.cpp (but not all CPU have this OPT) so now faster than llamafile
QN_K: that is the one that have resent optim on llamafile. And as you see it is faster than llama.cpp for the same type and with my CPU faster than QN_0 of llama.cpp so use it with llamafile ;)

With ARM CPU the Q4 have special optim on llama.cpp so can be faster than llamafile (did-not find time to bench...)

so all depend on CPU and Quantization Type, both have resent optim. For me llamafile is the faster on K-quant, but no more on other.

1 reply

Djip007 Feb 12, 2025

same bench on a AMD Ryzen 9 5950X 16-Core Processor (znver3):

model_filename	test	llamafile t/s	llama.cpp t/s
BF16	pp2	18.70	18.35 ± 0.06
BF16	pp4	31.21	36.54 ± 0.28
BF16	pp8	61.52	72.71 ± 0.15
BF16	pp16	119.42	139.79 ± 0.20
BF16	pp32	170.67	204.29 ± 0.27
BF16	pp64	187.09	230.81 ± 0.39
BF16	pp128	195.26	242.17 ± 0.68
BF16	pp256	192.38	238.31 ± 0.09
BF16	pp512	182.12	218.82 ± 0.14
BF16	tg32	9.44	9.31 ± 0.00
F16	pp2	18.63	18.20 ± 0.02
F16	pp4	31.35	36.33 ± 0.05
F16	pp8	61.81	71.67 ± 0.13
F16	pp16	119.53	138.39 ± 0.28
F16	pp32	177.54	243.19 ± 0.42
F16	pp64	201.49	311.23 ± 0.25
F16	pp128	209.00	329.87 ± 0.10
F16	pp256	204.15	324.25 ± 0.31
F16	pp512	192.21	290.73 ± 0.11
F16	tg32	9.45	9.22 ± 0.00
Q8_0	pp2	33.48	32.79 ± 0.04
Q8_0	pp4	66.49	65.77 ± 0.26
Q8_0	pp8	130.46	129.00 ± 0.45
Q8_0	pp16	205.48	221.36 ± 0.20
Q8_0	pp32	237.80	264.75 ± 0.53
Q8_0	pp64	253.25	285.25 ± 0.23
Q8_0	pp128	254.67	289.12 ± 0.12
Q8_0	pp256	243.41	275.97 ± 0.09
Q8_0	pp512	223.24	245.77 ± 1.12
Q8_0	tg32	17.07	16.90 ± 0.01
Q6_K	pp2	42.22	41.05 ± 0.13
Q6_K	pp4	83.58	78.65 ± 0.32
Q6_K	pp8	162.38	126.90 ± 0.51
Q6_K	pp16	268.19	159.21 ± 0.15
Q6_K	pp32	350.97	175.13 ± 0.22
Q6_K	pp64	402.35	186.19 ± 0.06
Q6_K	pp128	416.54	188.77 ± 0.10
Q6_K	pp256	398.79	185.06 ± 0.13
Q6_K	pp512	352.36	171.97 ± 0.08
Q6_K	tg32	21.66	21.15 ± 0.02
Q4_K_M	pp2	51.87	50.28 ± 0.21
Q4_K_M	pp4	103.70	95.91 ± 0.49
Q4_K_M	pp8	197.04	154.64 ± 0.20
Q4_K_M	pp16	286.31	197.45 ± 0.45
Q4_K_M	pp32	332.45	223.00 ± 0.50
Q4_K_M	pp64	384.02	236.21 ± 0.20
Q4_K_M	pp128	386.59	240.35 ± 0.49
Q4_K_M	pp256	370.96	233.99 ± 0.05
Q4_K_M	pp512	328.87	213.24 ± 0.07
Q4_K_M	tg32	26.84	26.11 ± 0.02
Q4_0	pp2	54.11	48.62 ± 0.06
Q4_0	pp4	107.75	104.40 ± 0.63
Q4_0	pp8	204.25	158.43 ± 0.09
Q4_0	pp16	251.95	273.79 ± 0.78
Q4_0	pp32	279.03	306.74 ± 0.41
Q4_0	pp64	294.46	323.46 ± 0.60
Q4_0	pp128	296.89	323.48 ± 0.09
Q4_0	pp256	285.65	308.61 ± 0.31
Q4_0	pp512	258.75	270.92 ± 0.10
Q4_0	tg32	28.00	27.65 ± 0.16

With this CPU:

my change on FP16/BF16 on this AVX2 CPU has even more gain on llama.cpp.
Q8_0/Q4_0 is as expected faster with llama.cpp
K-Quant is the faster with llamafile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is llamafile seemingly slower than ollama on my system for the same model and same query ? #614

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Why is llamafile seemingly slower than ollama on my system for the same model and same query ? #614

bdutta Nov 11, 2024

Replies: 3 comments · 2 replies

GamerBolt03 Jan 28, 2025

GamerBolt03 Feb 1, 2025

zarir94 Feb 4, 2025

Djip007 Feb 12, 2025

Djip007 Feb 12, 2025

bdutta
Nov 11, 2024

Replies: 3 comments 2 replies

GamerBolt03
Jan 28, 2025

GamerBolt03
Feb 1, 2025

Djip007
Feb 12, 2025