Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: failed to start in embedding mode #702

Open
harikt opened this issue Mar 1, 2025 · 1 comment
Open

Bug: failed to start in embedding mode #702

harikt opened this issue Mar 1, 2025 · 1 comment

Comments

@harikt
Copy link

harikt commented Mar 1, 2025

What happened?

./Llama-3.2-1B-Instruct.Q6_K.llamafile --server --embedding --nobrowser
error: unknown argument: --nobrowser

Without passing --nobrowser

/Llama-3.2-1B-Instruct.Q6_K.llamafile --server --embedding
Apple Metal GPU support successfully loaded
Cmd: ./Llama-3.2-1B-Instruct.Q6_K.llamafile -m Llama-3.2-1B-Instruct.Q6_K.gguf --server --embedding
embedding_cli: llamafile version 0.9.0
embedding_cli: seed  = 580381
llama_model_loader: loaded meta data with 28 key-value pairs and 147 tensors from Llama-3.2-1B-Instruct.Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 1.2B
llama_model_loader: - kv   3:                            general.license str              = llama3.2
llama_model_loader: - kv   4:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   5:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   6:                          llama.block_count u32              = 16
llama_model_loader: - kv   7:                       llama.context_length u32              = 131072
llama_model_loader: - kv   8:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   9:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  10:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  11:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  13:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  14:                 llama.attention.key_length u32              = 64
llama_model_loader: - kv  15:               llama.attention.value_length u32              = 64
llama_model_loader: - kv  16:                          general.file_type u32              = 18
llama_model_loader: - kv  17:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  18:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   34 tensors
llama_model_loader: - type q6_K:  113 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 16
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 1.24 B
llm_load_print_meta: model size       = 967.00 MiB (6.56 BPW)
llm_load_print_meta: general.name     = n/a
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.16 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size =   967.02 MiB, (  967.08 / 10922.67)
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 17/17 layers to GPU
llm_load_tensors:      Metal buffer size =   967.01 MiB
llm_load_tensors:        CPU buffer size =   205.49 MiB
............................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1
ggml_metal_init: picking default device: Apple M1
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/harikt/.llamafile/v/0.9.0/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
llama_kv_cache_init:      Metal KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.01 MiB
llama_new_context_with_model:      Metal compute buffer size =  2176.02 MiB
llama_new_context_with_model:        CPU compute buffer size =    80.02 MiB
llama_new_context_with_model: graph nodes  = 518
llama_new_context_with_model: graph splits = 2
warming up the model with an empty run
llama.cpp/main/embedding.cpp:125: GGML_ASSERT(params.n_batch >= params.n_ctx) failed

error: Uncaught SIGABRT (SI_0) on Hari.local pid 77238 tid 77238
 /Users/harikt/Downloads/llamafile/Llama-3.2-1B-Instruct.Q6_K.llamafile
 No such file or directory
 Darwin Cosmopolitan 4.0.2 MODE=aarch64; Darwin Kernel Version 22.6.0: Fri Sep 15 13:41:30 PDT 2023; root:xnu-8796.141.3.700.8~1/RELEASE_ARM64_T8103 Hari.local 22.6.0
 cosmoaddr2line /Users/harikt/Downloads/llamafile/Llama-3.2-1B-Instruct.Q6_K.llamafile 197d10744 405b000197c1e46c 1f038001046eeb6c 800330f08 8000007dc 8000be8cc 800011798 800004904 800003d4c 800000140
 0000000000000000 x0 91c9670a4d62d8fa x8  0000000000000148 x16 0000000000000000 x24
 0000000000000000 x1 91c9670bbfbff87a x9  00000001f78f73f8 x17 0000000000000000 x25
 0000000000000008 x2 000000000000000a x10 0000000000000000 x18 0000000000000000 x26
 0000000802bb2000 x3 0000000000000002 x11 0000000000000006 x19 0000000000000000 x27
 0000000000000000 x4 0000000000000000 x12 00000001f2dd2080 x20 0000000802ba7540 x28
 0000000000012db6 x5 0555555555555555 x13 0000000000000103 x21 000000016b7101e0 x29
 0000000000000008 x6 0000000000180000 x14 00000001f2dd2160 x22 0000000197d47c28 x30
 0000000000000021 x7 0000000000080000 x15 0000000000000000 x23 000000016b7101c0 x31
 000000016b7101c0 sp 197d10744 pc NULL-1748168628
 000000016b7101e0 fp 405b000197c1e46c lr NULL-1749160588
 000000016b710200 fp 1f038001046eeb6c lr NULL+74118260
 000000016b710210 fp 800330f08 lr raise+48
 000000016b710220 fp 8000007dc lr abort+48
 000000016b710240 fp 8000be8cc lr NULL+520660
 000000016b710260 fp 800011798 lr embedding_cli(int, char**)+3628
 000000016b710e28 fp 800004904 lr main+320
 000000016b711ea0 fp 800003d4c lr cosmo+1144
 000000016b711f00 fp 800000140 lr _start
zsh: abort      ./Llama-3.2-1B-Instruct.Q6_K.llamafile --server --embedding

I have a different Meta-Llama-3-8B-Instruct.Q5_K_M.llamafile which I can start the embedding passing nobrowser etc. I can also do a test to see if embedding is working via

curl -X POST http://localhost:8080/embedding \
  -H "Content-Type: application/json" \
  -d '{
    "content": "This is a test sentence to generate embeddings for."
  }'
./Meta-Llama-3-8B-Instruct.Q5_K_M.llamafile --server --embedding --nobrowser
extracting /zip/llama.cpp/ggml.h to /Users/harikt/.llamafile/v/0.8.9/ggml.h
extracting /zip/llamafile/llamafile.h to /Users/harikt/.llamafile/v/0.8.9/llamafile.h
extracting /zip/llama.cpp/ggml-impl.h to /Users/harikt/.llamafile/v/0.8.9/ggml-impl.h
extracting /zip/llama.cpp/ggml-metal.h to /Users/harikt/.llamafile/v/0.8.9/ggml-metal.h
extracting /zip/llama.cpp/ggml-alloc.h to /Users/harikt/.llamafile/v/0.8.9/ggml-alloc.h
extracting /zip/llama.cpp/ggml-common.h to /Users/harikt/.llamafile/v/0.8.9/ggml-common.h
extracting /zip/llama.cpp/ggml-quants.h to /Users/harikt/.llamafile/v/0.8.9/ggml-quants.h
extracting /zip/llama.cpp/ggml-backend.h to /Users/harikt/.llamafile/v/0.8.9/ggml-backend.h
extracting /zip/llama.cpp/ggml-metal.metal to /Users/harikt/.llamafile/v/0.8.9/ggml-metal.metal
extracting /zip/llama.cpp/ggml-backend-impl.h to /Users/harikt/.llamafile/v/0.8.9/ggml-backend-impl.h
extracting /zip/llama.cpp/ggml-metal.m to /Users/harikt/.llamafile/v/0.8.9/ggml-metal.m
building ggml-metal.dylib with xcode...
llamafile_log_command: cc -I. -O3 -fPIC -shared -pthread -DNDEBUG -ffixed-x28 -DTARGET_OS_OSX -DGGML_MULTIPLATFORM /Users/harikt/.llamafile/v/0.8.9/ggml-metal.m -o /Users/harikt/.llamafile/v/0.8.9/ggml-metal.dylib.5snynm -framework Foundation -framework Metal -framework MetalKit
In file included from /Users/harikt/.llamafile/v/0.8.9/ggml-metal.m:3:
In file included from /Users/harikt/.llamafile/v/0.8.9/ggml-metal.h:22:
In file included from /Users/harikt/.llamafile/v/0.8.9/ggml.h:219:
In file included from /usr/local/include/stdio.h:64:
/usr/local/include/_stdio.h:93:16: warning: pointer is missing a nullability type specifier (_Nonnull, _Nullable, or _Null_unspecified) [-Wnullability-completeness]
        unsigned char   *_base;
                        ^
/usr/local/include/_stdio.h:93:16: note: insert '_Nullable' if the pointer may be null
        unsigned char   *_base;
                        ^
                          _Nullable
/usr/local/include/_stdio.h:93:16: note: insert '_Nonnull' if the pointer should never be null
        unsigned char   *_base;
                        ^
                          _Nonnull
/usr/local/include/_stdio.h:138:32: warning: pointer is missing a nullability type specifier (_Nonnull, _Nullable, or _Null_unspecified) [-Wnullability-completeness]
        int     (* _Nullable _read) (void *, char *, int);
                                          ^
/usr/local/include/_stdio.h:138:32: note: insert '_Nullable' if the pointer may be null
        int     (* _Nullable _read) (void *, char *, int);
                                          ^
                                           _Nullable
/usr/local/include/_stdio.h:138:32: note: insert '_Nonnull' if the pointer should never be null
        int     (* _Nullable _read) (void *, char *, int);
                                          ^
                                           _Nonnull
/usr/local/include/_stdio.h:138:40: warning: pointer is missing a nullability type specifier (_Nonnull, _Nullable, or _Null_unspecified) [-Wnullability-completeness]
        int     (* _Nullable _read) (void *, char *, int);
                                                  ^
/usr/local/include/_stdio.h:138:40: note: insert '_Nullable' if the pointer may be null
        int     (* _Nullable _read) (void *, char *, int);
                                                  ^
                                                   _Nullable
/usr/local/include/_stdio.h:138:40: note: insert '_Nonnull' if the pointer should never be null
        int     (* _Nullable _read) (void *, char *, int);
                                                  ^
                                                   _Nonnull
/usr/local/include/_stdio.h:139:35: warning: pointer is missing a nullability type specifier (_Nonnull, _Nullable, or _Null_unspecified) [-Wnullability-completeness]
        fpos_t  (* _Nullable _seek) (void *, fpos_t, int);
                                          ^
/usr/local/include/_stdio.h:139:35: note: insert '_Nullable' if the pointer may be null
        fpos_t  (* _Nullable _seek) (void *, fpos_t, int);
                                          ^
                                           _Nullable
/usr/local/include/_stdio.h:139:35: note: insert '_Nonnull' if the pointer should never be null
        fpos_t  (* _Nullable _seek) (void *, fpos_t, int);
                                          ^
                                           _Nonnull
/usr/local/include/_stdio.h:140:32: warning: pointer is missing a nullability type specifier (_Nonnull, _Nullable, or _Null_unspecified) [-Wnullability-completeness]
        int     (* _Nullable _write)(void *, const char *, int);
                                          ^
/usr/local/include/_stdio.h:140:32: note: insert '_Nullable' if the pointer may be null
        int     (* _Nullable _write)(void *, const char *, int);
                                          ^
                                           _Nullable
/usr/local/include/_stdio.h:140:32: note: insert '_Nonnull' if the pointer should never be null
        int     (* _Nullable _write)(void *, const char *, int);
                                          ^
                                           _Nonnull
/usr/local/include/_stdio.h:140:46: warning: pointer is missing a nullability type specifier (_Nonnull, _Nullable, or _Null_unspecified) [-Wnullability-completeness]
        int     (* _Nullable _write)(void *, const char *, int);
                                                        ^
/usr/local/include/_stdio.h:140:46: note: insert '_Nullable' if the pointer may be null
        int     (* _Nullable _write)(void *, const char *, int);
                                                        ^
                                                         _Nullable
/usr/local/include/_stdio.h:140:46: note: insert '_Nonnull' if the pointer should never be null
        int     (* _Nullable _write)(void *, const char *, int);
                                                        ^
                                                         _Nonnull
/usr/local/include/_stdio.h:144:18: warning: pointer is missing a nullability type specifier (_Nonnull, _Nullable, or _Null_unspecified) [-Wnullability-completeness]
        struct __sFILEX *_extra; /* additions to FILE to not break ABI */
                        ^
/usr/local/include/_stdio.h:144:18: note: insert '_Nullable' if the pointer may be null
        struct __sFILEX *_extra; /* additions to FILE to not break ABI */
                        ^
                          _Nullable
/usr/local/include/_stdio.h:144:18: note: insert '_Nonnull' if the pointer should never be null
        struct __sFILEX *_extra; /* additions to FILE to not break ABI */
                        ^
                          _Nonnull
In file included from /Users/harikt/.llamafile/v/0.8.9/ggml-metal.m:3:
In file included from /Users/harikt/.llamafile/v/0.8.9/ggml-metal.h:22:
In file included from /Users/harikt/.llamafile/v/0.8.9/ggml.h:219:
/usr/local/include/stdio.h:67:13: warning: pointer is missing a nullability type specifier (_Nonnull, _Nullable, or _Null_unspecified) [-Wnullability-completeness]
extern FILE *__stdinp;
            ^
/usr/local/include/stdio.h:67:13: note: insert '_Nullable' if the pointer may be null
extern FILE *__stdinp;
            ^
              _Nullable
/usr/local/include/stdio.h:67:13: note: insert '_Nonnull' if the pointer should never be null
extern FILE *__stdinp;
            ^
              _Nonnull
/usr/local/include/stdio.h:395:41: warning: pointer is missing a nullability type specifier (_Nonnull, _Nullable, or _Null_unspecified) [-Wnullability-completeness]
                 int (* _Nullable)(void *, const char *, int),
                                        ^
/usr/local/include/stdio.h:395:41: note: insert '_Nullable' if the pointer may be null
                 int (* _Nullable)(void *, const char *, int),
                                        ^
                                         _Nullable
/usr/local/include/stdio.h:395:41: note: insert '_Nonnull' if the pointer should never be null
                 int (* _Nullable)(void *, const char *, int),
                                        ^
                                         _Nonnull
/usr/local/include/stdio.h:395:55: warning: pointer is missing a nullability type specifier (_Nonnull, _Nullable, or _Null_unspecified) [-Wnullability-completeness]
                 int (* _Nullable)(void *, const char *, int),
                                                      ^
/usr/local/include/stdio.h:395:55: note: insert '_Nullable' if the pointer may be null
                 int (* _Nullable)(void *, const char *, int),
                                                      ^
                                                       _Nullable
/usr/local/include/stdio.h:395:55: note: insert '_Nonnull' if the pointer should never be null
                 int (* _Nullable)(void *, const char *, int),
                                                      ^
                                                       _Nonnull
/usr/local/include/stdio.h:396:44: warning: pointer is missing a nullability type specifier (_Nonnull, _Nullable, or _Null_unspecified) [-Wnullability-completeness]
                 fpos_t (* _Nullable)(void *, fpos_t, int),
                                           ^
/usr/local/include/stdio.h:396:44: note: insert '_Nullable' if the pointer may be null
                 fpos_t (* _Nullable)(void *, fpos_t, int),
                                           ^
                                            _Nullable
/usr/local/include/stdio.h:396:44: note: insert '_Nonnull' if the pointer should never be null
                 fpos_t (* _Nullable)(void *, fpos_t, int),
                                           ^
                                            _Nonnull
/usr/local/include/stdio.h:397:41: warning: pointer is missing a nullability type specifier (_Nonnull, _Nullable, or _Null_unspecified) [-Wnullability-completeness]
                 int (* _Nullable)(void *));
                                        ^
/usr/local/include/stdio.h:397:41: note: insert '_Nullable' if the pointer may be null
                 int (* _Nullable)(void *));
                                        ^
                                         _Nullable
/usr/local/include/stdio.h:397:41: note: insert '_Nonnull' if the pointer should never be null
                 int (* _Nullable)(void *));
                                        ^
                                         _Nonnull
/usr/local/include/stdio.h:393:6: warning: pointer is missing a nullability type specifier (_Nonnull, _Nullable, or _Null_unspecified) [-Wnullability-completeness]
FILE    *funopen(const void *,
        ^
/usr/local/include/stdio.h:393:6: note: insert '_Nullable' if the pointer may be null
FILE    *funopen(const void *,
        ^
          _Nullable
/usr/local/include/stdio.h:393:6: note: insert '_Nonnull' if the pointer should never be null
FILE    *funopen(const void *,
        ^
          _Nonnull
13 warnings generated.
Apple Metal GPU support successfully loaded
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2869,"msg":"build info","tid":"34364426944","timestamp":1740806526}
{"function":"server_cli","level":"INFO","line":2872,"msg":"system info","n_threads":4,"n_threads_batch":-1,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"34364426944","timestamp":1740806526,"total_threads":8}
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from Meta-Llama-3-8B-Instruct.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                          llama.block_count u32              = 32
llama_model_loader: - kv   2:                       llama.context_length u32              = 8192
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   5:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   6:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   7:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   8:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                          general.file_type u32              = 17
llama_model_loader: - kv  10:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  11:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 5.33 GiB (5.70 BPW)
llm_load_print_meta: general.name     = n/a
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.34 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size =  5115.48 MiB, ( 5115.55 / 10922.67)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloaded 32/33 layers to GPU
llm_load_tensors:      Metal buffer size =  5115.48 MiB
llm_load_tensors:        CPU buffer size =  5459.93 MiB
........................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1
ggml_metal_init: picking default device: Apple M1
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/harikt/.llamafile/v/0.8.9/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
llama_kv_cache_init:      Metal KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.50 MiB
llama_new_context_with_model:      Metal compute buffer size =   164.00 MiB
llama_new_context_with_model:        CPU compute buffer size =   258.50 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 3
{"function":"initialize","level":"INFO","line":489,"msg":"initializing slots","n_slots":1,"tid":"34364426944","timestamp":1740806532}
{"function":"initialize","level":"INFO","line":498,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"34364426944","timestamp":1740806532}
{"function":"server_cli","level":"INFO","line":3090,"msg":"model loaded","tid":"34364426944","timestamp":1740806532}

llama server listening at http://127.0.0.1:8080

One question is does all llamafile have embedding support or some of them don't? ( I know by default embedding is turned off and only will work if we pass --embedding. Asking if we pass --embedding to any llamafile model does it work or not. ) I noticed many of the models are crashing at my end.

Version

Meta-Llama-3-8B-Instruct.Q5_K_M.llamafile is llamafile v0.8.9 is working fine.

Llama-3.2-1B-Instruct.Q6_K.llamafile is llamafile v0.9.0 : Has bug.

What operating system are you seeing the problem on?

Mac

Relevant log output

I am using M1. If needed can provide any more details.
@harikt
Copy link
Author

harikt commented Mar 1, 2025

I wonder whether the --embedding option is removed in v2. This is what I tried

./Llama-3.2-1B-Instruct.Q6_K.llamafile --v2 --server
Apple Metal GPU support successfully loaded
2025-03-01T15:14:47.318523 llamafile/server/listen.cpp:41 server listen http://127.0.0.1:8080
2025-03-01T15:14:47.318589 llamafile/server/worker.cpp:143  warning: gpu mode disables pledge security
2025-03-01T15:14:54.200581 llamafile/server/client.cpp:679 52266 GET /embedding
curl -v "http://127.0.0.1:8080/embedding?content=hello+world"
*   Trying 127.0.0.1:8080...
* Connected to 127.0.0.1 (127.0.0.1) port 8080
* using HTTP/1.x
> GET /embedding?content=hello+world HTTP/1.1
> Host: 127.0.0.1:8080
> User-Agent: curl/8.11.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 200 OK
< Server: llamafile/0.9.0
< Referrer-Policy: origin
< Cache-Control: private; max-age=0
< Date: Sat, 01 Mar 2025 09:44:54 GMT
< Content-Type: application/json
< X-Wall-Micros: 119904
< X-User-Micros: 0
< X-System-Micros: 0
< Content-Length: 27925
<
{
  "add_special": true,
  "parse_special": false,
  "tokens_provided": 3,
  "tokens_used": 3,
  "embedding": [0.009888176, 0.0065530497, -0.012054207, 0.007288858, 0.022060089, -0.0105772605, 0.025267405, 0.013900169, 0.0019114544, 0.019479051, -0.0054129953, 0.018611025, -0.01781048, 0.012734498, 0.

and I can see the embedding coming. Is the /embedding endpoint a POST or GET.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant