Releases · ggml-org/llama.cpp

02 Mar 14:36

14dec0c

b4799

main: use jinja chat template system prompt by default (#12118)

* Use jinja chat template system prompt by default

* faster conditional order

* remove nested ternary

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>

Assets 25

01 Mar 15:06

github-actions

b4798

1782cdf

b4798

main: update outdated system prompt message (followup to #12131) (#12…

Assets 25

01 Mar 13:45

github-actions

b4797

45a8e76

b4797

common : add --system-prompt parameter, replace behavior of -p in con…

Assets 25

01 Mar 12:39

github-actions

b4796

80c41dd

b4796

CUDA: compress mode option and default to size (#12029)

cuda 12.8 added the option to specify stronger compression for binaries, so we now default to "size".

Assets 25

28 Feb 14:35

github-actions

b4793

70680c4

b4793

ggml : upgrade init_tensor API to return a ggml_status (#11854)

* Upgrade init_tensor API to return a ggml_status

To prepare for an 'abort-free' ggml
(ggml not to abort on OOMs but return a OOM status),
as agreeed with Diego in the ggml repo,
upgrade the init_tensor() and view_init() APIs
to return a ggml_status.

* misc fixes

---------

Co-authored-by: slaren <[email protected]>

Assets 25

28 Feb 12:25

github-actions

b4792

c43a3e7

b4792

llama : add Phi-4-mini support (supersede #12099) (#12108)

* Added Phi-4-mini-instruct support

* Update regex per ngxson

* Change the vocab base to Xenova/gpt-4o

* fix conversion update script

* no need to check longrope

* minor style fix

* fix python style

---------

Co-authored-by: Nicholas Sparks <[email protected]>

Assets 25

28 Feb 09:32

github-actions

b4790

438a839

b4790

vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizatio…

Assets 25

28 Feb 09:00

github-actions

b4789

9c42b17

b4789

CUDA: fix logic for V100 + GGML_CUDA_FORCE_MMQ (#12098)

Assets 25

28 Feb 08:36

github-actions

b4788

05e6f5a

b4788

ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot (#12064)

* Added SVE Support for Q2_K Quantized Models

* Use 4-space indentation in the switch cases

* removed comments lines

* Remove the loop Retain the curly bracess for better understanding of code

* Remove the comment like added for q3_k_q8_k kernel

---------

Co-authored-by: vithulep <[email protected]>

Assets 25

28 Feb 08:17

github-actions

b4786

fbeda90

b4786

vulkan: matmul dequantization improvements (#12015)

* faster dequant for old quants

* dont use unpack for iq4_nl

* vec2 unpack for q8

Assets 25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: ggml-org/llama.cpp

b4799

b4798

b4797

b4796

b4793

b4792

b4790

b4789

b4788

b4786