Skip to content

Releases: ggml-org/llama.cpp

b3327

06 Jul 21:20
86e7299
Compare
Choose a tag to compare
added support for Authorization Bearer tokens when downloading model …

b3325

06 Jul 08:31
87e25a1
Compare
Choose a tag to compare
llama : add early return for empty range (#8327)

* llama : add early return for empty range

This commit adds an early return to the llama_kv_cache_seq_add and
llama_kv_cache_seq_div functions.

The motivation for adding this is to avoid looping over the cache
when the range is empty. I ran into this when using the self-extend
feature in main.cpp.

Signed-off-by: Daniel Bevenius <[email protected]>

* llama : add static_cast to fix CI warning/error

This commit attempts to fix the following warning/error:

```console
src/llama.cpp:7271:31: error:
comparison of integer expressions of different signedness:
‘int’ and ‘uint32_t’ {aka ‘unsigned int’} [-Werror=sign-compare]
 7271 |                         if (i < hparams.n_layer_dense_lead) {
      |                             ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
This can be reproduced locally by setting -Wsign-compare in the
Makefile.

Signed-off-by: Daniel Bevenius <[email protected]>

* squash! llama : add early return for empty range

Remove the setting of cache.head to 0 when the range is empty.

Signed-off-by: Daniel Bevenius <[email protected]>

* Update src/llama.cpp

---------

Signed-off-by: Daniel Bevenius <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>

b3324

05 Jul 17:54
213701b
Compare
Choose a tag to compare
Detokenizer fixes (#8039)

* Add llama_detokenize():
  - Update header files location
  - UNKNOWN and CONTROL are 'special pieces'
  - Remove space after UNKNOWN and CONTROL
  - Refactor llama_token_to_piece()
  - Add flag: clean_up_tokenization_spaces
  - Symmetric params for llama_tokenize() and llama_detokenize()

* Update and fix tokenizer tests:
  - Using llama_detokenize()
  - Unexpected vocab type as test fail instead of error
    - Useful when automating tests:
    - If you don't know in advance the vocab type
    - Differenciate other loading errors
  - Skip unicode surrogaes and undefined
  - Gracefully exit threads
    - Using exit() is throwing random exceptions
  - Clean old known problematic codepoints
  - Minor: confusing hexadecimal codepoint

* Update bruteforce random tests
  - Add detokenizer checks
  - New generator: ascii_lr_strip
  - New generator: apostrophe
  - Add more vocabs files
  - Detokenize special tokens.
  - Replace errors with '\uFFFD' when detokenizing to 'utf-8'
  - More edge cases
  - Better detokenization results check

* Fix add_space_prefix, set false by default
* Better leading space removal
* Do not remove space when decoding special tokens
* Bugfix: custom regexs splits undefined unicode codepoints
* 'viking' detokenizer clean spaces

b3322

05 Jul 16:05
7ed03b8
Compare
Choose a tag to compare
llama : fix compile warning (#8304)

b3317

05 Jul 10:38
8e55830
Compare
Choose a tag to compare
CUDA: MMQ support for iq4_nl, iq4_xs (#8278)

b3316

05 Jul 10:38
0a42380
Compare
Choose a tag to compare
CUDA: revert part of the RDNA1 optimizations (#8309)

The change on the launch_bounds was causing a small performance drop in perplexity of 25 t/s

b3315

05 Jul 10:23
d12f781
Compare
Choose a tag to compare
llama : streamline embeddings from "non-embedding" models (#8087)

b3314

05 Jul 10:01
bcefa03
Compare
Choose a tag to compare
CUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 (#8311)

b3311

05 Jul 07:06
aa5898d
Compare
Choose a tag to compare
llama : prefer n_ over num_ prefix (#8308)

b3309

05 Jul 05:56
a9554e2
Compare
Choose a tag to compare
[SYCL] Fix WARP_SIZE=16 bug of Intel GPU (#8266)

* fix group_norm ut

* split softmax

* fix softmax

* add concat support condition

* revert debug code

* move QK_WARP_SIZE to presets.hpp