Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b4767
[SYCL] Optimize mul_mat for Q4_0 on Intel GPU (#12035) * opt performance by reorder for Intel GPU * detect hw type and save opt feature, and print opt feature * correct name * support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed * add env variable GGML_SYCL_DISABLE_OPT for debug * use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT * add performance data * mv getrows functions to separeted files * fix global variables --------- Co-authored-by: arthw <[email protected]>
b4765
SYCL: Fix GGML_SYCL_DEBUG macro (#11995)
b4764
run: allow to customize prompt by env var LLAMA_PROMPT_PREFIX (#12041) Signed-off-by: Florent Benoit <[email protected]>
b4763
Some llama-run cleanups (#11973) Use consolidated open function call from File class. Change read_all to to_string(). Remove exclusive locking, the intent for that lock is to avoid multiple processes writing to the same file, it's not an issue for readers, although we may want to consider adding a shared lock. Remove passing nullptr as reference, references are never supposed to be null. clang-format the code for consistent styling. Signed-off-by: Eric Curtin <[email protected]>
gg-ci-fix-arm-b4760-f343850
cont : fix archive name to use matrix
b4762
ggml-cpu: Support s390x SIMD Instruction Set (#12019) * ggml: add s390x ARCH_FLAGS for compilation Signed-off-by: Aaron Teo <[email protected]> * ggml: add SIMD for s390x using vector intrinsics SIMD is activated for: * ggml_vec_dot_f32 * ggml_vec_dot_f16 * ggml_vec_mad_f32 * ggml_vec_mad_f16 * ggml_vec_mad_f32_unroll * ggml_vec_scale_f32 * ggml_vec_scale_f16 SIMD is NOT activated for: * ggml_vec_dot_f16_unroll (pending bugfix) Signed-off-by: Aaron Teo <[email protected]> * ggml: fix missing escape character in GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <[email protected]> * ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR Signed-off-by: Aaron Teo <[email protected]> * ggml: fix s390x GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <[email protected]> * ggml: full SIMD activation for F32,F16 s390x Signed-off-by: Aaron Teo <[email protected]> * ggml: add option to disable s390x VXE/VXE2 Signed-off-by: Aaron Teo <[email protected]> * ggml: change vecintrin.h include to ggml-cpu-impl * add __VXE__ and __VXE2__ macros Signed-off-by: Aaron Teo <[email protected]> * cmake: add s390x target detection for VX/VXE/VXE2 Signed-off-by: Aaron Teo <[email protected]> * ggml: move s390x vector intrinsics to ggml-cpu-impl.h Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x Q8_0 SIMD Signed-off-by: Aaron Teo <[email protected]> * ggml: correct documentation for Q8_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x reduce code complexity Q8_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x bugfix typo Q8_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activated for Q4_1 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x inline vec_reve Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for Q4_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: add VXE backend feature Signed-off-by: Aaron Teo <[email protected]> * ggml: remove test.py Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for quantize_row_q8_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for quantize_row_q8_1 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for iq4_xs Signed-off-by: Aaron Teo <[email protected]> * ggml: bugfix iq4_xs Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for iq4_nl Signed-off-by: Aaron Teo <[email protected]> * ggml: add float, double, and long vector data type Signed-off-by: Aaron Teo <[email protected]> * ggml: clean up iq4_xs SIMD Signed-off-by: Aaron Teo <[email protected]> * ggml: fix improper use of restrict keyword Signed-off-by: Aaron Teo <[email protected]> * ggml: update warning message for ggml_vec_tbl Signed-off-by: Aaron Teo <[email protected]> * ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K Signed-off-by: Aaron Teo <[email protected]> * ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs Signed-off-by: Aaron Teo <[email protected]> * ggml: switch to restrict for iq4_nl Signed-off-by: Aaron Teo <[email protected]> * ggml: slight dot product speed improvement for q4_1_q8_1 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for q6_K Signed-off-by: Aaron Teo <[email protected]> * ggml: add missing `_t` to ggml_int8x16x4_t Signed-off-by: Aaron Teo <[email protected]> * ggml: fix missing `_t` for ggml_vec_xl_s8x4 Signed-off-by: Aaron Teo <[email protected]> * ggml: fix more missing `_t` Signed-off-by: Aaron Teo <[email protected]> * ggml: add unroll and prefetch to Q8_0 increase of 3.86% for prompt processing and 32.22% for token generation Signed-off-by: Aaron Teo <[email protected]> * ggml: patch Q8_0 to use proper vector sizes Signed-off-by: Aaron Teo <[email protected]> * ggml: optimise Q8_0 dot prod compute kernel further Signed-off-by: Aaron Teo <[email protected]> * ggml: add unroll and prefetch to Q4_1 Signed-off-by: Aaron Teo <[email protected]> * ggml: refactor Q6_K variable naming for readability Signed-off-by: Aaron Teo <[email protected]> * ggml: fix Q6_K typos Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for Q5_K Signed-off-by: Aaron Teo <[email protected]> * ggml: fix wrong char*x16_t naming Signed-off-by: Aaron Teo <[email protected]> * ggml: Q5_K y0 wrong signness Signed-off-by: Aaron Teo <[email protected]> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <[email protected]> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for Q4_K Signed-off-by: Aaron Teo <[email protected]> * ggml: fix Q4_K invalid vector intrinsics Signed-off-by: Aaron Teo <[email protected]> * ggml: simplify ggml_padd_s16 compute kernel Signed-off-by: Aaron Teo <[email protected]> * ggml: correct ggml-cpu vxe wording Signed-off-by: Aaron Teo <[email protected]> * ggml: change ggml_aligned_malloc alignment to 256 256 is the cache line size for s390x platforms Signed-off-by: Aaron Teo <[email protected]> * ggml: resolve pr merge via cherry-pick 225bbbf Signed-off-by: Aaron Teo <[email protected]> * ggml : fix LoongArch compile error with 128-bit SIMD (#11701) * ggml: resolve pr merge via cherry-pick 4571953 Signed-off-by: Aaron Teo <[email protected]> * ggml: cmake remove fork when determining s390x machine type thank you @ericcurtin Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]> Co-authored-by: Jinyang He <[email protected]> Co-authored-by: junchao-zhao <[email protected]>
b4761
CUDA: app option to compile without FlashAttention (#12025)
b4760
llava: build clip image from pixels (#11999) * llava: export function `clip_build_img_from_pixels` to build image from pixels decoded by other libraries instead of stb_image.h for better performance * Apply suggestions from code review --------- Co-authored-by: Xuan-Son Nguyen <[email protected]>
b4759
ci : fix arm upload artifacts (#12024) * ci : fix arm upload artifacts * cont : fix archive name to use matrix
b4756
server : disable Nagle's algorithm (#12020)