You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
./llama-cli --version
version: 4077 (af148c9)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0
Operating systems
Mac
Which llama.cpp modules do you know to be affected?
Other (Please specify in the next section)
Problem description & steps to reproduce
The llama.cpp module is speculative
I was testing speedup of speculative decoding on my mac mini, with a draft model = llama-160m.Q8_0. I found the decoding speed increased when target model is in fp16, but decreased in Q4. I have included my test table and some logs below. Quantization models have much smaller model size and should be decoding faster. I don't understand if the quantization/dequantization overhead could be so large. I want to know if others have encountered similar problems and what could be the cause.
Target Model
Model Size
Base Speed (tok/s)
Speculative Speed (tok/s)
Speed Up
Llama-2-7b-chat-hf-f16
13G
17.3
26.8
+54.9%
Llama-2-7b-chat.Q4_K_M
3.8G
40.7
22.3
-45.2%
Cmd I run this this
./llama-speculative -m /Users/baihuajun/Documents/models/Llama-2-7b-chat-hf-f16.gguf -md /Users/baihuajun/Documents/models/llama-160m.Q8_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -e --temp -1 -n 64 --repeat-last-n 0 --repeat-penalty 1.0 --draft 5 -np 1
First Bad Commit
No response
Relevant log output
# This is for fp16 speculative
baihuajun@baihuajundeMac-mini bin % ./llama-speculative -m /Users/baihuajun/Documents/models/Llama-2-7b-chat-hf-f16.gguf -md /Users/baihuajun/Documents/models/llama-160m.Q8_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -e --temp -1 -n 64 --repeat-last-n 0 --repeat-penalty 1.0 --draft 5 -np 1
build: 4077 (af148c93) with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0
llama_load_model_from_file: using device Metal (Apple M4 Pro) - 16383 MiB free
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /Users/baihuajun/Documents/models/Llama-2-7b-chat-hf-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.vocab_size u32 = 32000
llama_model_loader: - kv 3: llama.context_length u32 = 4096
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.block_count u32 = 32
llama_model_loader: - kv 6: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 8: llama.attention.head_count u32 = 32
llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 11: general.file_type u32 = 1
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = truellama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = falsellama_model_loader: - kv 21: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...llama_model_loader: - type f32: 65 tensorsllama_model_loader: - type f16: 226 tensorsllm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrectllm_load_vocab: special tokens cache size = 3llm_load_vocab: token to piece cache size = 0.1684 MBllm_load_print_meta: format = GGUF V3 (latest)llm_load_print_meta: arch = llamallm_load_print_meta: vocab type = SPMllm_load_print_meta: n_vocab = 32000llm_load_print_meta: n_merges = 0llm_load_print_meta: vocab_only = 0llm_load_print_meta: n_ctx_train = 4096llm_load_print_meta: n_embd = 4096llm_load_print_meta: n_layer = 32llm_load_print_meta: n_head = 32llm_load_print_meta: n_head_kv = 32llm_load_print_meta: n_rot = 128llm_load_print_meta: n_swa = 0llm_load_print_meta: n_embd_head_k = 128llm_load_print_meta: n_embd_head_v = 128llm_load_print_meta: n_gqa = 1llm_load_print_meta: n_embd_k_gqa = 4096llm_load_print_meta: n_embd_v_gqa = 4096llm_load_print_meta: f_norm_eps = 0.0e+00llm_load_print_meta: f_norm_rms_eps = 1.0e-05llm_load_print_meta: f_clamp_kqv = 0.0e+00llm_load_print_meta: f_max_alibi_bias = 0.0e+00llm_load_print_meta: f_logit_scale = 0.0e+00llm_load_print_meta: n_ff = 11008llm_load_print_meta: n_expert = 0llm_load_print_meta: n_expert_used = 0llm_load_print_meta: causal attn = 1llm_load_print_meta: pooling type = 0llm_load_print_meta: rope type = 0llm_load_print_meta: rope scaling = linearllm_load_print_meta: freq_base_train = 10000.0llm_load_print_meta: freq_scale_train = 1llm_load_print_meta: n_ctx_orig_yarn = 4096llm_load_print_meta: rope_finetuned = unknownllm_load_print_meta: ssm_d_conv = 0llm_load_print_meta: ssm_d_inner = 0llm_load_print_meta: ssm_d_state = 0llm_load_print_meta: ssm_dt_rank = 0llm_load_print_meta: ssm_dt_b_c_rms = 0llm_load_print_meta: model type = 7Bllm_load_print_meta: model ftype = F16llm_load_print_meta: model params = 6.74 Bllm_load_print_meta: model size = 12.55 GiB (16.00 BPW) llm_load_print_meta: general.name = LLaMA v2llm_load_print_meta: BOS token = 1 '<s>'llm_load_print_meta: EOS token = 2 '</s>'llm_load_print_meta: UNK token = 0 '<unk>'llm_load_print_meta: LF token = 13 '<0x0A>'llm_load_print_meta: EOG token = 2 '</s>'llm_load_print_meta: max token length = 48llm_load_tensors: offloading 32 repeating layers to GPUllm_load_tensors: offloading output layer to GPUllm_load_tensors: offloaded 33/33 layers to GPUllm_load_tensors: Metal_Mapped model buffer size = 12603.02 MiBllm_load_tensors: CPU_Mapped model buffer size = 250.00 MiB...................................................................................................llama_new_context_with_model: n_seq_max = 1llama_new_context_with_model: n_ctx = 4096llama_new_context_with_model: n_ctx_per_seq = 4096llama_new_context_with_model: n_batch = 2048llama_new_context_with_model: n_ubatch = 512llama_new_context_with_model: flash_attn = 0llama_new_context_with_model: freq_base = 10000.0llama_new_context_with_model: freq_scale = 1ggml_metal_init: allocatingggml_metal_init: found device: Apple M4 Proggml_metal_init: picking default device: Apple M4 Proggml_metal_init: using embedded metal libraryggml_metal_init: GPU name: Apple M4 Proggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)ggml_metal_init: simdgroup reduction = trueggml_metal_init: simdgroup matrix mul. = trueggml_metal_init: has bfloat = trueggml_metal_init: use bfloat = trueggml_metal_init: hasUnifiedMemory = trueggml_metal_init: recommendedMaxWorkingSetSize = 17179.89 MBllama_kv_cache_init: Metal KV buffer size = 2048.00 MiBllama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiBllama_new_context_with_model: CPU output buffer size = 0.12 MiBllama_new_context_with_model: Metal compute buffer size = 296.00 MiBllama_new_context_with_model: CPU compute buffer size = 16.01 MiBllama_new_context_with_model: graph nodes = 1030llama_new_context_with_model: graph splits = 2common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)llama_load_model_from_file: using device Metal (Apple M4 Pro) - 1181 MiB freellama_model_loader: loaded meta data with 31 key-value pairs and 111 tensors from /Users/baihuajun/Documents/models/llama-160m.Q8_0.gguf (version GGUF V3 (latest))llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.llama_model_loader: - kv 0: general.architecture str = llamallama_model_loader: - kv 1: general.type str = modelllama_model_loader: - kv 2: general.name str = Llama 160mllama_model_loader: - kv 3: general.organization str = JackFramllama_model_loader: - kv 4: general.basename str = llamallama_model_loader: - kv 5: general.size_label str = 160Mllama_model_loader: - kv 6: general.license str = apache-2.0llama_model_loader: - kv 7: general.tags arr[str,1] = ["text-generation"]llama_model_loader: - kv 8: general.languages arr[str,1] = ["en"]llama_model_loader: - kv 9: general.datasets arr[str,1] = ["wikipedia"]llama_model_loader: - kv 10: llama.block_count u32 = 12llama_model_loader: - kv 11: llama.context_length u32 = 2048llama_model_loader: - kv 12: llama.embedding_length u32 = 768llama_model_loader: - kv 13: llama.feed_forward_length u32 = 3072llama_model_loader: - kv 14: llama.attention.head_count u32 = 12llama_model_loader: - kv 15: llama.attention.head_count_kv u32 = 12llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000001llama_model_loader: - kv 17: general.file_type u32 = 7llama_model_loader: - kv 18: llama.vocab_size u32 = 32000llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 64llama_model_loader: - kv 20: tokenizer.ggml.model str = llamallama_model_loader: - kv 21: tokenizer.ggml.pre str = defaultllama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 23: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 27: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool =true
llama_model_loader: - kv 29: tokenizer.ggml.add_eos_token bool =false
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - type f32: 25 tensors
llama_model_loader: - type q8_0: 86 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type= SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 768
llm_load_print_meta: n_layer = 12
llm_load_print_meta: n_head = 12
llm_load_print_meta: n_head_kv = 12
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 768
llm_load_print_meta: n_embd_v_gqa = 768
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 3072
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type= 0
llm_load_print_meta: rope type= 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type=?B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 162.42 M
llm_load_print_meta: model size = 164.63 MiB (8.50 BPW)
llm_load_print_meta: general.name = Llama 160m
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOG token = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 13/13 layers to GPU
llm_load_tensors: Metal_Mapped model buffer size = 164.64 MiB
llm_load_tensors: CPU_Mapped model buffer size = 24.90 MiB
.......................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_pre_seq (4096) > n_ctx_train (2048) -- possible training context overflow
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M4 Pro
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name: Apple M4 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction =true
ggml_metal_init: simdgroup matrix mul. =true
ggml_metal_init: has bfloat =true
ggml_metal_init: use bfloat =true
ggml_metal_init: hasUnifiedMemory =true
ggml_metal_init: recommendedMaxWorkingSetSize = 17179.89 MB
llama_kv_cache_init: Metal KV buffer size = 144.00 MiB
llama_new_context_with_model: KV self size = 144.00 MiB, K (f16): 72.00 MiB, V (f16): 72.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
llama_new_context_with_model: Metal compute buffer size = 110.00 MiB
llama_new_context_with_model: CPU compute buffer size = 9.51 MiB
llama_new_context_with_model: graph nodes = 390
llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
<s> Building a website can be donein 10 simple steps:
Step 1: Define Your Website's PurposeStep 2: Choose a Domain NameStep 3: Select a Web HostStep 4: Plan Your Website's Structure
Step 5: Design Your Website
Step 6: Build Your Website
Step 7: Launch Your Website
Step 8: Opt
encoded 19 tokens in 0.186 seconds, speed: 102.365 t/s
decoded 69 tokens in 2.570 seconds, speed: 26.848 t/s
n_draft = 5
n_predict = 69
n_drafted = 120
n_accept = 44
accept = 36.667%
draft:
llama_perf_context_print: load time= 218.38 ms
llama_perf_context_print: prompt evaltime= 2288.40 ms / 66 tokens ( 34.67 ms per token, 28.84 tokens per second)
llama_perf_context_print: evaltime= 182.85 ms / 96 runs ( 1.90 ms per token, 525.02 tokens per second)
llama_perf_context_print: total time= 2757.47 ms / 162 tokens
target:
llama_perf_sampler_print: sampling time= 1.54 ms / 69 runs ( 0.02 ms per token, 44660.19 tokens per second)
llama_perf_context_print: load time= 6931.61 ms
llama_perf_context_print: prompt evaltime= 2457.71 ms / 163 tokens ( 15.08 ms per token, 66.32 tokens per second)
llama_perf_context_print: evaltime= 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time= 2976.91 ms / 164 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating
# This is for Q4 speculative
baihuajun@baihuajundeMac-mini bin % ./llama-speculative -m /Users/baihuajun/Documents/models/llama-2-7b-chat.Q4_K_M.gguf -md /Users/baihuajun/Documents/models/llama-160m.Q8_0.gguf -p"Building a website can be done in 10 simple steps:\nStep 1:"-e --temp -1 -n 64 --repeat-last-n 0 --repeat-penalty 1.0 --draft 5 -np 1
build: 4077 (af148c93) with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0
llama_load_model_from_file: using device Metal (Apple M4 Pro) - 16383 MiB free
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /Users/baihuajun/Documents/models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 15
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0llama_model_loader: - kv 18: general.quantization_version u32 = 2llama_model_loader: - type f32: 65 tensorsllama_model_loader: - type q4_K: 193 tensorsllama_model_loader: - type q6_K: 33 tensorsllm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrectllm_load_vocab: special tokens cache size = 3llm_load_vocab: token to piece cache size = 0.1684 MBllm_load_print_meta: format = GGUF V2llm_load_print_meta: arch = llamallm_load_print_meta: vocab type = SPMllm_load_print_meta: n_vocab = 32000llm_load_print_meta: n_merges = 0llm_load_print_meta: vocab_only = 0llm_load_print_meta: n_ctx_train = 4096llm_load_print_meta: n_embd = 4096llm_load_print_meta: n_layer = 32llm_load_print_meta: n_head = 32llm_load_print_meta: n_head_kv = 32llm_load_print_meta: n_rot = 128llm_load_print_meta: n_swa = 0llm_load_print_meta: n_embd_head_k = 128llm_load_print_meta: n_embd_head_v = 128llm_load_print_meta: n_gqa = 1llm_load_print_meta: n_embd_k_gqa = 4096llm_load_print_meta: n_embd_v_gqa = 4096llm_load_print_meta: f_norm_eps = 0.0e+00llm_load_print_meta: f_norm_rms_eps = 1.0e-06llm_load_print_meta: f_clamp_kqv = 0.0e+00llm_load_print_meta: f_max_alibi_bias = 0.0e+00llm_load_print_meta: f_logit_scale = 0.0e+00llm_load_print_meta: n_ff = 11008llm_load_print_meta: n_expert = 0llm_load_print_meta: n_expert_used = 0llm_load_print_meta: causal attn = 1llm_load_print_meta: pooling type = 0llm_load_print_meta: rope type = 0llm_load_print_meta: rope scaling = linearllm_load_print_meta: freq_base_train = 10000.0llm_load_print_meta: freq_scale_train = 1llm_load_print_meta: n_ctx_orig_yarn = 4096llm_load_print_meta: rope_finetuned = unknownllm_load_print_meta: ssm_d_conv = 0llm_load_print_meta: ssm_d_inner = 0llm_load_print_meta: ssm_d_state = 0llm_load_print_meta: ssm_dt_rank = 0llm_load_print_meta: ssm_dt_b_c_rms = 0llm_load_print_meta: model type = 7Bllm_load_print_meta: model ftype = Q4_K - Mediumllm_load_print_meta: model params = 6.74 Bllm_load_print_meta: model size = 3.80 GiB (4.84 BPW) llm_load_print_meta: general.name = LLaMA v2llm_load_print_meta: BOS token = 1 '<s>'llm_load_print_meta: EOS token = 2 '</s>'llm_load_print_meta: UNK token = 0 '<unk>'llm_load_print_meta: LF token = 13 '<0x0A>'llm_load_print_meta: EOG token = 2 '</s>'llm_load_print_meta: max token length = 48llm_load_tensors: offloading 32 repeating layers to GPUllm_load_tensors: offloading output layer to GPUllm_load_tensors: offloaded 33/33 layers to GPUllm_load_tensors: Metal_Mapped model buffer size = 3820.93 MiBllm_load_tensors: CPU_Mapped model buffer size = 70.31 MiB..................................................................................................llama_new_context_with_model: n_seq_max = 1llama_new_context_with_model: n_ctx = 4096llama_new_context_with_model: n_ctx_per_seq = 4096llama_new_context_with_model: n_batch = 2048llama_new_context_with_model: n_ubatch = 512llama_new_context_with_model: flash_attn = 0llama_new_context_with_model: freq_base = 10000.0llama_new_context_with_model: freq_scale = 1ggml_metal_init: allocatingggml_metal_init: found device: Apple M4 Proggml_metal_init: picking default device: Apple M4 Proggml_metal_init: using embedded metal libraryggml_metal_init: GPU name: Apple M4 Proggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)ggml_metal_init: simdgroup reduction = trueggml_metal_init: simdgroup matrix mul. = trueggml_metal_init: has bfloat = trueggml_metal_init: use bfloat = trueggml_metal_init: hasUnifiedMemory = trueggml_metal_init: recommendedMaxWorkingSetSize = 17179.89 MBllama_kv_cache_init: Metal KV buffer size = 2048.00 MiBllama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiBllama_new_context_with_model: CPU output buffer size = 0.12 MiBllama_new_context_with_model: Metal compute buffer size = 296.00 MiBllama_new_context_with_model: CPU compute buffer size = 16.01 MiBllama_new_context_with_model: graph nodes = 1030llama_new_context_with_model: graph splits = 2common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)llama_load_model_from_file: using device Metal (Apple M4 Pro) - 10213 MiB freellama_model_loader: loaded meta data with 31 key-value pairs and 111 tensors from /Users/baihuajun/Documents/models/llama-160m.Q8_0.gguf (version GGUF V3 (latest))llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.llama_model_loader: - kv 0: general.architecture str = llamallama_model_loader: - kv 1: general.type str = modelllama_model_loader: - kv 2: general.name str = Llama 160mllama_model_loader: - kv 3: general.organization str = JackFramllama_model_loader: - kv 4: general.basename str = llamallama_model_loader: - kv 5: general.size_label str = 160Mllama_model_loader: - kv 6: general.license str = apache-2.0llama_model_loader: - kv 7: general.tags arr[str,1] = ["text-generation"]llama_model_loader: - kv 8: general.languages arr[str,1] = ["en"]llama_model_loader: - kv 9: general.datasets arr[str,1] = ["wikipedia"]llama_model_loader: - kv 10: llama.block_count u32 = 12llama_model_loader: - kv 11: llama.context_length u32 = 2048llama_model_loader: - kv 12: llama.embedding_length u32 = 768llama_model_loader: - kv 13: llama.feed_forward_length u32 = 3072llama_model_loader: - kv 14: llama.attention.head_count u32 = 12llama_model_loader: - kv 15: llama.attention.head_count_kv u32 = 12llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000001llama_model_loader: - kv 17: general.file_type u32 = 7llama_model_loader: - kv 18: llama.vocab_size u32 = 32000llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 64llama_model_loader: - kv 20: tokenizer.ggml.model str = llamallama_model_loader: - kv 21: tokenizer.ggml.pre str = defaultllama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 23: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 27: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool =true
llama_model_loader: - kv 29: tokenizer.ggml.add_eos_token bool =false
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - type f32: 25 tensors
llama_model_loader: - type q8_0: 86 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type= SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 768
llm_load_print_meta: n_layer = 12
llm_load_print_meta: n_head = 12
llm_load_print_meta: n_head_kv = 12
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 768
llm_load_print_meta: n_embd_v_gqa = 768
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 3072
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type= 0
llm_load_print_meta: rope type= 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type=?B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 162.42 M
llm_load_print_meta: model size = 164.63 MiB (8.50 BPW)
llm_load_print_meta: general.name = Llama 160m
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOG token = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 13/13 layers to GPU
llm_load_tensors: Metal_Mapped model buffer size = 164.64 MiB
llm_load_tensors: CPU_Mapped model buffer size = 24.90 MiB
.......................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_pre_seq (4096) > n_ctx_train (2048) -- possible training context overflow
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M4 Pro
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name: Apple M4 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction =true
ggml_metal_init: simdgroup matrix mul. =true
ggml_metal_init: has bfloat =true
ggml_metal_init: use bfloat =true
ggml_metal_init: hasUnifiedMemory =true
ggml_metal_init: recommendedMaxWorkingSetSize = 17179.89 MB
llama_kv_cache_init: Metal KV buffer size = 144.00 MiB
llama_new_context_with_model: KV self size = 144.00 MiB, K (f16): 72.00 MiB, V (f16): 72.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
llama_new_context_with_model: Metal compute buffer size = 110.00 MiB
llama_new_context_with_model: CPU compute buffer size = 9.51 MiB
llama_new_context_with_model: graph nodes = 390
llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
<s> Building a website can be donein 10 simple steps:
Step 1: Define your website's purpose and goalsStep 2: Choose a domain name and web hosting providerStep 3: Plan your website's design and layout
Step 4: Create content for your website
Step 5: Design and build your website's pagesStep 6: Add featuresencoded 19 tokens in 0.155 seconds, speed: 122.295 t/sdecoded 65 tokens in 2.907 seconds, speed: 22.359 t/sn_draft = 5n_predict = 65n_drafted = 125n_accept = 39accept = 31.200%draft:llama_perf_context_print: load time = 141.23 msllama_perf_context_print: prompt eval time = 2603.95 ms / 68 tokens ( 38.29 ms per token, 26.11 tokens per second)llama_perf_context_print: eval time = 193.91 ms / 100 runs ( 1.94 ms per token, 515.70 tokens per second)llama_perf_context_print: total time = 3062.81 ms / 168 tokenstarget:llama_perf_sampler_print: sampling time = 1.44 ms / 65 runs ( 0.02 ms per token, 45201.67 tokens per second)llama_perf_context_print: load time = 2282.95 msllama_perf_context_print: prompt eval time = 2751.54 ms / 169 tokens ( 16.28 ms per token, 61.42 tokens per second)llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)llama_perf_context_print: total time = 3204.07 ms / 170 tokensggml_metal_free: deallocatingggml_metal_free: deallocating
The text was updated successfully, but these errors were encountered:
Name and Version
./llama-cli --version
version: 4077 (af148c9)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0
Operating systems
Mac
Which llama.cpp modules do you know to be affected?
Other (Please specify in the next section)
Problem description & steps to reproduce
The llama.cpp module is speculative
I was testing speedup of speculative decoding on my mac mini, with a draft model = llama-160m.Q8_0. I found the decoding speed increased when target model is in fp16, but decreased in Q4. I have included my test table and some logs below. Quantization models have much smaller model size and should be decoding faster. I don't understand if the quantization/dequantization overhead could be so large. I want to know if others have encountered similar problems and what could be the cause.
Cmd I run this this
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: