Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b4799
main: use jinja chat template system prompt by default (#12118) * Use jinja chat template system prompt by default * faster conditional order * remove nested ternary --------- Co-authored-by: Xuan Son Nguyen <[email protected]>
b4798
main: update outdated system prompt message (followup to #12131) (#12…
b4797
common : add --system-prompt parameter, replace behavior of -p in con…
b4796
CUDA: compress mode option and default to size (#12029) cuda 12.8 added the option to specify stronger compression for binaries, so we now default to "size".
b4793
ggml : upgrade init_tensor API to return a ggml_status (#11854) * Upgrade init_tensor API to return a ggml_status To prepare for an 'abort-free' ggml (ggml not to abort on OOMs but return a OOM status), as agreeed with Diego in the ggml repo, upgrade the init_tensor() and view_init() APIs to return a ggml_status. * misc fixes --------- Co-authored-by: slaren <[email protected]>
b4792
llama : add Phi-4-mini support (supersede #12099) (#12108) * Added Phi-4-mini-instruct support * Update regex per ngxson * Change the vocab base to Xenova/gpt-4o * fix conversion update script * no need to check longrope * minor style fix * fix python style --------- Co-authored-by: Nicholas Sparks <[email protected]>
b4790
vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizatio…
b4789
CUDA: fix logic for V100 + GGML_CUDA_FORCE_MMQ (#12098)
b4788
ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot (#12064) * Added SVE Support for Q2_K Quantized Models * Use 4-space indentation in the switch cases * removed comments lines * Remove the loop Retain the curly bracess for better understanding of code * Remove the comment like added for q3_k_q8_k kernel --------- Co-authored-by: vithulep <[email protected]>
b4786
vulkan: matmul dequantization improvements (#12015) * faster dequant for old quants * dont use unpack for iq4_nl * vec2 unpack for q8