Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b3327
added support for Authorization Bearer tokens when downloading model …
b3325
llama : add early return for empty range (#8327) * llama : add early return for empty range This commit adds an early return to the llama_kv_cache_seq_add and llama_kv_cache_seq_div functions. The motivation for adding this is to avoid looping over the cache when the range is empty. I ran into this when using the self-extend feature in main.cpp. Signed-off-by: Daniel Bevenius <[email protected]> * llama : add static_cast to fix CI warning/error This commit attempts to fix the following warning/error: ```console src/llama.cpp:7271:31: error: comparison of integer expressions of different signedness: ‘int’ and ‘uint32_t’ {aka ‘unsigned int’} [-Werror=sign-compare] 7271 | if (i < hparams.n_layer_dense_lead) { | ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` This can be reproduced locally by setting -Wsign-compare in the Makefile. Signed-off-by: Daniel Bevenius <[email protected]> * squash! llama : add early return for empty range Remove the setting of cache.head to 0 when the range is empty. Signed-off-by: Daniel Bevenius <[email protected]> * Update src/llama.cpp --------- Signed-off-by: Daniel Bevenius <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
b3324
Detokenizer fixes (#8039) * Add llama_detokenize(): - Update header files location - UNKNOWN and CONTROL are 'special pieces' - Remove space after UNKNOWN and CONTROL - Refactor llama_token_to_piece() - Add flag: clean_up_tokenization_spaces - Symmetric params for llama_tokenize() and llama_detokenize() * Update and fix tokenizer tests: - Using llama_detokenize() - Unexpected vocab type as test fail instead of error - Useful when automating tests: - If you don't know in advance the vocab type - Differenciate other loading errors - Skip unicode surrogaes and undefined - Gracefully exit threads - Using exit() is throwing random exceptions - Clean old known problematic codepoints - Minor: confusing hexadecimal codepoint * Update bruteforce random tests - Add detokenizer checks - New generator: ascii_lr_strip - New generator: apostrophe - Add more vocabs files - Detokenize special tokens. - Replace errors with '\uFFFD' when detokenizing to 'utf-8' - More edge cases - Better detokenization results check * Fix add_space_prefix, set false by default * Better leading space removal * Do not remove space when decoding special tokens * Bugfix: custom regexs splits undefined unicode codepoints * 'viking' detokenizer clean spaces
b3322
llama : fix compile warning (#8304)
b3317
CUDA: MMQ support for iq4_nl, iq4_xs (#8278)
b3316
CUDA: revert part of the RDNA1 optimizations (#8309) The change on the launch_bounds was causing a small performance drop in perplexity of 25 t/s
b3315
llama : streamline embeddings from "non-embedding" models (#8087)
b3314
CUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 (#8311)
b3311
llama : prefer n_ over num_ prefix (#8308)
b3309
[SYCL] Fix WARP_SIZE=16 bug of Intel GPU (#8266) * fix group_norm ut * split softmax * fix softmax * add concat support condition * revert debug code * move QK_WARP_SIZE to presets.hpp