Skip to content

Releases: ggml-org/llama.cpp

b3353

09 Jul 12:11
9925ca4
Compare
Choose a tag to compare
cmake : allow external ggml (#8370)

b3347

08 Jul 19:32
2ec846d
Compare
Choose a tag to compare
sycl : fix powf call in device code (#8368)

b3345

08 Jul 10:52
Compare
Choose a tag to compare
sync : ggml

ggml-ci

b3342

08 Jul 09:05
470939d
Compare
Choose a tag to compare
common : preallocate sampling token data vector (#8363)

`emplace_back` repeatedly-called is slower than preallocating the vector to the vocab size and directly inserting the data. Some rudimentary profiling with `chrono` improves the performance of this block of code from ~500us/op to ~40us/op.

Overall, this slightly improves the sampling performance which has a more substantial impact for the `examples/lookahead` implementation -- I am able to see a ~10% performance boost in lookahead inference.

b3341

08 Jul 09:01
6f0dbf6
Compare
Choose a tag to compare
infill : assert prefix/suffix tokens + remove old space logic (#8351)

b3340

08 Jul 08:02
ffd0079
Compare
Choose a tag to compare
common : avoid unnecessary logits fetch (#8358)

b3334

07 Jul 15:03
f7cab35
Compare
Choose a tag to compare
gguf-hash: model wide and per tensor hashing using xxhash and sha1 (#…

b3333

07 Jul 14:39
905942a
Compare
Choose a tag to compare
llama : support glm3 and glm4 (#8031)

* add chatglm3-6b model support huggingface model:
 https://hf-mirror.com/THUDM/chatglm3-6b

Signed-off-by: XingXing Qiao <[email protected]>

* remove .rotary_pos_emb.inv_freq and unuse code for chatglm3 model

Signed-off-by: XingXing Qiao <[email protected]>

* fix lint error

Signed-off-by: XingXing Qiao <[email protected]>

* optimize convert-hf-to-gguf.py for chatglm model

Signed-off-by: XingXing Qiao <[email protected]>

* support glm-4-9b-chat

Signed-off-by: XingXing Qiao <[email protected]>

* fix eos tokens to glm4

* remove unused log

* add preprocess to chatglm3 and chatglm4

* add eos_id_list to llama.cpp

* fix code style

* fix code style

* fix conflicts

* fix conflicts

* Revert "add eos_id_list to llama.cpp"

This reverts commit 3a4d5790bfdc205c5b658204239f168fc21cc1a8.

* set <|endoftext|> as eos and <|user|> as eot

* fix chat template bug

* add comment to glm prefix and suffix

* fix conflicts and add rope_ratio & ChatGLMForConditionalGeneration

* fix chat template bug

* fix codestyle

* fix conflicts

* modified the general name of glm model

* fix conflicts

* remove prefix and suffix

* use normal glm4 chattempalte & use LLM_FFN_SWIGLU in phi3

* fix: resolve Flake8 errors in `convert-hf-to-gguf.py`

- Fix E302 by adding two blank lines before top-level function definitions
- Replace print statements to fix NP100
- Fix E303 by ensuring only one blank line between lines of code

* fix rope ratio to solve incorrect answers

* fix by comments

---------

Signed-off-by: XingXing Qiao <[email protected]>
Co-authored-by: XingXing Qiao <[email protected]>
Co-authored-by: Umpire2018 <[email protected]>

b3332

07 Jul 14:38
b504008
Compare
Choose a tag to compare
llama : fix n_rot default (#8348)

ggml-ci

b3328

07 Jul 09:55
cb4d86c
Compare
Choose a tag to compare
server: Retrieve prompt template in /props (#8337)

* server: Retrieve prompt template in /props

This PR adds the following:
- Expose the model's Jinja2 prompt template from the model in the /props endpoint.
- Change log-level from Error to Warning for warning about template mismatch.

The front-end stands a better chance of actually executing the Jinja template format correctly. Server is currently just guessing it.

Ideally this should have been inside a JSON block that expose the same key/value pairs as listed during startup in "llm_load_print_meta" function.

* Make string buffer dynamic

* Add doc and better string handling

* Using chat_template naming convention

* Use intermediate vector for string assignment