Opinions about template based GEMM kernels #3965

mingfeima · 2023-11-06T06:07:02Z

mingfeima
Nov 6, 2023
Collaborator

Hi, this is Mingfei from intel pytorch team and we want to help optimize the performance of llama.cpp on intel hardware. I need some guidelines about how to make contributions in this project:

Firstly about the intel Xe GPU: the programming language is SYCL and also we have a template based GEMM solution called XeTLA (you can consider they are counterparts to cuda and cutlass). So I was wondering is it proper to use XeTLA in this project, or is there any plan to integrate nvidia's cutlass?
Second question is about the x86 CPU optimization: is it proper to use jitted kernels in this project? e.g. xbyak which is a C++ JIT assembler for x86. Some of my colleagues already did some good work in jblas which shows a pretty good result.

Any opinion is welcome :) feel free to comment so that we can find the most proper manner to contribute.

ggerganov · 2023-11-06T08:02:22Z

ggerganov
Nov 6, 2023
Maintainer

Hi, thanks for bringing up the discussion. I'll try to outline some of the practices that we have followed so far to accommodate different backends into ggml / llama.cpp and hopefully through discussion we can find the best way to support Intel GPUs and potentially JIT kernels.

Background

The project started out with ggml.h + ggml.c implementing the core computation engine (matrix operations, activation functions, inference graph construction and processing, etc.) with initial focus on CPU-only inference. SIMD support (ARM, AVX, WASM, etc.) was added by implementing custom routines for matrix multiplication:

https://github.com/ggerganov/llama.cpp/blob/2833a6f63c1b87c7f4ac574bcf7a15a2f3bf3ede/ggml.c#L1102-L1137

Note that these are naive dot product implementations, without any advanced GEMM optimizations.

Later, support for OpenBLAS and other BLAS CPU libraries was added directly in ggml.c by calling F32 GEMM when possible and appropriate:

https://github.com/ggerganov/llama.cpp/blob/2833a6f63c1b87c7f4ac574bcf7a15a2f3bf3ede/ggml.c#L9498-L9504

At some point, the following idea for adding GPU support to ggml was proposed: #915
For me it was important to keep the GPU code highly separated from the core ggml implementation because at that time I didn't have a good understanding about what would be necessary to support different GPU backends and was also worried that adding GPU-related code in ggml would increase the technical debt tremendously. Therefore, the idea was to keep ggml.c oblivious to the GPU stuff and use it just to produce computation graphs (i.e. the inference steps) which would then be passed to a GPU backend for processing. The Metal backend was the prime example of this idea: #1642. Other backends, such as CUDA and OpenCL followed, so we ended up with the current state:

Metal backend: ggml-metal.h + ggml-metal.m
CUDA backend: ggml-cuda.h + ggml-cuda.cu
OpenCL backend: ggml-opencl.h + ggml-opencl.cpp
Vulkan backend: in the works (Vulkan Implementation #2059)

The existing backend implementations, even though mostly decoupled from the core ggml code, still rely on multiple hacks and custom tricks to be able to function properly. This prevents a more widespread adoption in other ggml-based projects so a more tightly integrated solution is necessary. @slaren was able to identify some of the existing problems with the backends and has been leading the effort for implementing a new backend interface which is better integrated with the core ggml library (ggml-org/ggml#547, ggml-org/ggml#586). The goal is in the future, all backend implementations to utilize the newly proposed interface in ggml-backend so that we have more unified and seamless way for GPU support.

This is a short background and overview of how we support various backends in ggml that I think is useful to know when discussing future integrations.

Back to the specific questions:

So I was wondering is it proper to use XeTLA in this project, or is there any plan to integrate nvidia's cutlass?

In case XeTLA is something similar to BLAS, then it could be integrated straight into ggml.c, given that it provides GEMM API that we can call (F32 required, and F16 would be great). If XeTLA is something that can be used in a similar way as we use CUDA for example (i.e. where we can write custom kernels for different ops), then we can think about integrating it as a new backend in ggml. I'm not familiar with cutlass, so would defer this to some of the collaborators who understand better.

Note that if we decide to integrate it as a custom backend, I would like to have all the implementation contained in 1 or 2 files similar to the existing backends. It can of course include 3rd party libs (as we do with CUDA, Metal, etc.) but the ggml related code has to be compact and localized in small amount of code.

is it proper to use jitted kernels in this project?

I understand the basic principle of JIT-ing, but I don't have experience with implementing this technique. If it is something that we can write in pure C and help to optimize the existing SIMD routines in ggml without introducing a huge amount of bloat, we can think about adding it to the library. But I think I would need to see some sort of "hello world" for jitted ggml to make a better decision.

4 replies

mingfeima Nov 8, 2023
Collaborator Author

@ggerganov thanks for the detailed reply! It really helps us to understand the principles to contribute in this project.

For Intel Xe GPU, we will stick to current pattern similar to other backends (maybe like this): ggml-sycl.h + ggml-sycl.cpp.

For the CPU part, the optimization can be done in multiple ways. After some internal discussion, we propose 3 options:

Option-1: Use jblas and refactor the source code into `ggml-jblas.h` + `ggml-jblas.cpp`:

jblas is a lightweight, header-only acceleration library for high-performance GEMM and related computations on Intel platform. The library is inspired by Cutlass, providing high-level template class abstractions for various elements required for computation.

The source code of jblas is here containing the following files:

jit_base.hpp
jit_blas.h
jit_blas_epilogue.h
jit_blas_gemm.h
jit_blas_transformer.h
jit_blas_utils.h
jit_blas_weight_compression.h
jit_blas_wrapper.h
kernel_avx2.h
kernel_avx512_bf16.h
kernel_avx512f.h
kernel_jit.h
kernel_jit_injector.h
kernel_ref.h
kernel_wrapper.h

Also xbyak (this is c++ jit assembler) is needed, the source code is here, it contains the following headers:

xbyak.h
xbyak_bin2hex.h
xbyak_mnemonic.h
xbyak_util.h

And a minimal example of using jblas is like:

/// unit test for int4 gemm
using GEMMKernel = wrapper::transformer_default::weight_comp::avx512_vnni::QKVGemmDynamicS4Fp32KBlock;
using PrologueB = GEMMKernel::WeightType;
GEMMKernel kernel;
auto packedw = kernel.getWeightPtr()->createStorage(n, k, blocksize);
avector<int8_t> buffer(packedw.mSize);
packedw.assign(buffer.data());
kernel.getWeightPtr()->packQWeight(n, k, quanW.data(), ldb, scales.data(), nullptr, &packedw);
std::vector<PrologueB::Param> bparams{{&packedw}, {&packedw}, {&packedw}};
auto quanA = kernel.getActivationPtr()->createStorage(m, k, blocksize);
avector<int8_t> bufA(quanA.mSize);
quanA.assign(bufA.data());
kernel.getActivationPtr()->quantize({_data.matA.data(), lda, &quanA}, m, k);

ut::buffer_error(quanA.template get<uint8_t>(), matAu8.data(), matAu8.size());
ut::buffer_error(quanA.mZPtr, matAzp.data(), matAzp.size());
ut::buffer_error(quanA.mSPtr, AScales.data(), AScales.size(), 0.001f);
int batch = 3;
aligned_vector<float> matCBatch(3 * m * n);
std::vector<GEMMKernel::CParam> cparams(batch);
for (size_t i = 0; i < batch; i++) {
  cparams[i].ldc = ldc;
  cparams[i].C = matCBatch.data() + i * m * n;
}
GEMMKernel::Arguments args{m, n, k, batch, _data.matA.data(), lda, &quanA, bparams.data(), cparams.data(), NULL};

kernel.compute(args);

The idea is to refactor all the source code into ggml-jblas.h and ggml-jblas.cpp, the might be against the current pattern of ggml+backend since jblas is for the cpu backend. But put the jblas source code into ggml.c is not impossible since it requires a c++ jit assembler.

Option-2: Use jblas as a third party library (git submodule)

We fully understand that using 3rd party library is not the best practice in the project of llamma.cpp and ggml, but have to admit the fact adding entire jblas source code would not be considered as a minor change even though jblas is light weighted. Including it as a 3rd party library would alleviate the burden from ggml, so we put this as an option here for you to make the decision.

Option-3: Direct optimize GEMM kernels with intrinsics

This would be the old-fashioned manner to get the job done, which is directly use intrinsics to write kernels (for bfloat16, we can still use mkl, e.g. cblas_gemm_bf16bf16f32). The following is a simple example doing bf16 * int4 -> fp32 gemm with avx512, code link here. This kernel is used in our product intel-extension-for-pytorch, accelerating WOQ int4, it's just doing tiling, dequant, fma with intrincics:

static_assert(COLS == 4, "expect register block size 4 for weights");
_mm_prefetch(ADDRESS(B, k + PREFETCH_K_DIST, 0, ldb / 2), _MM_HINT_T0);
// load 64 elements from ADDRESS(B, k, 0 ldb / 2) with 4-bit each
// and then, unpack them and convert them into 64 fp32 numbers held in
// four avx512 registers: vb[0] - vb[3]
// Load the buffer into a 256-bit register
__m256i packed = _mm256_load_si256((__m256i*)ADDRESS(B, k, 0, ldb / 2));
__m512i int32[4];
{
  auto low_4bit = _mm512_cvtepu8_epi32(_mm256_castsi256_si128(packed));
  auto high_4bit = _mm512_srli_epi32(low_4bit, 4);
  int32[0] = low_4bit;
  int32[2] = high_4bit;
}
{
  auto low_4bit =
      _mm512_cvtepu8_epi32(_mm256_extracti128_si256(packed, 1));
  auto high_4bit = _mm512_srli_epi32(low_4bit, 4);
  int32[1] = low_4bit;
  int32[3] = high_4bit;
}

auto dequant_int32 = [&](auto idx) {
  vb[idx] = _mm512_permutexvar_ps(int32[idx], lut);
  vb[idx] = _mm512_sub_ps(vb[idx], float_zero_point[idx]);
  vb[idx] = _mm512_mul_ps(vb[idx], float_scale[idx]);
};
compile_time_for<COLS>::op(dequant_int32);

Please advice!

ggerganov Nov 9, 2023
Maintainer

Option 1 does not seem possible as it would be too much external code added to the project, so Option 2 seems better.

Option 3:

We don't support BF16 - only FP16.
If we want to support GEMM with quantized data, it will need to be implemented for the quantization types available in ggml, which I believe in general are different compared to other quantization implementations. For example, when you say int4 it is likely different from the 4-bit quantizations that we have in ggml. At the very least, the memory layout is different.

Our experience so far indicates that weight-only quantization is inferior in terms of performance compared to both weight and activation quantization. That is why, we quantize the activations to 8-bits and dequantize the weights to 8-bits and perform 8-bit dot-product based matrix multiplication. Implementing this as a proper GEMM routine (i.e. with proper CPU cache utilization) would be beneficial and will likely improve the batched decoding and prefill performance on the CPU, but my intuition is that we have to stick to integer-based operations (i.e. avoid unpacking to FP32 for example)

So for me, it makes sense to implement option 2 and add FP16 and FP32 jblas-based GEMM and see how this works out without any quantization involved. Maybe compare it to the existing OpenBLAS and MKL results. Having this available, it will automatically enable WOQ because ggml will dequantize the weights and call the external GEMM routine:

https://github.com/ggerganov/llama.cpp/blob/2833a6f63c1b87c7f4ac574bcf7a15a2f3bf3ede/ggml.c#L9460-L9511

I expect that dequantizing to FP32 and doing GEMM will likely remain slower as it is now, but dequantizing to FP16 and performing FP16 GEMM (assuming that this is implemented efficiently) might lead to some performance gains, so it would be interesting to see.

Adding quantized GEMM looks like a more serious undertaking. But in any case, it has to take into account the quantization types that we have in ggml and perform quantization both of the weights and the activations. See the dot product implementations in ggml-quants.c for a starting point. I think even without JIT, the performance can be improved with a better tiling implementation, compared to what we currently have:

https://github.com/ggerganov/llama.cpp/blob/2833a6f63c1b87c7f4ac574bcf7a15a2f3bf3ede/ggml.c#L9570-L9616

mingfeima Nov 9, 2023
Collaborator Author

Oh, just for clarify: option-3 is not intended for proposing WOQ to ggml (we also know that it is not the fastest way), it's just that we don't have any int4-int4 gemm kernels written in intrinsics at the moment. The idea is to directly write the kernels in intrincics.

Anyway, if we can use option-2, that would be most convinient. @airMeng could you please provide benchmark data on jblas v.s. mkl ?

airMeng Nov 9, 2023
Collaborator

Note this is only comparison of gemm since MKL not support int4 or any other data types lower than 8 bits.

Conducted on Xeon 8480+, compared with oneMKL 2023.1

			AMX_INT8			AVX512_VNNI			AMX_BF16			AVX512_F
Batch	MxNxK	threads number	onemkl	jblas	scale	onemkl	jblas	scale	onemkl	jblas	scale	onemkl	jblas	scale
64	1024x4096x4096	32x1	14004	36149	2.581333905	11595	13594	1.172401897	8013	17814	2.223137402	2892	3144	1.087136929
		48x1	13336	25169	1.88729754	12412	15433	1.24339349	8333	11901	1.428177127	3363	3843	1.142729706
		56x1	18555	35130	1.893290218	13830	18729	1.354229935	11376	18746	1.647855134	4396	4906	1.116014559
32 16(f32 bf16)	4096x4096x4096	32x1	26574	37774	1.421464589	17057	19493	1.142815266	13983	20938	1.497389687	3594	3877	1.078742348
		48x1	30186	52510	1.739548135	20244	23481	1.159899229	15727	24615	1.565142748	5289	5537	1.046889771
		56x1	41114	66636	1.620761784	22337	26381	1.181044903	16928	33357	1.970522212	6202	6097	0.983069977
16	1024x16384x4096	32x1	26971	38098	1.412554225	17538	17722	1.010491504	16248	18615	1.145679468	2898	3538	1.22084196
		48x1	32044	51528	1.608038946	19412	18783	0.967597362	18207	23932	1.314439501	3596	4553	1.266129032
		56x1	33223	44859	1.350239292	19675	21891	1.112630241	17820	20411	1.145398429	4166	4952	1.188670187

airMeng · 2023-11-08T08:00:53Z

airMeng
Nov 8, 2023
Collaborator

Add some comments might help

And a minimal example of using jblas is like:

/// unit test for int4  QKV gemm
// dynamic quantized activation, int4 static quantized weight, avx512vnni gemm, float32 scale
using GEMMKernel = wrapper::transformer_default::weight_comp::avx512_vnni::QKVGemmDynamicS4Fp32KBlock;
using PrologueB = GEMMKernel::WeightType;
GEMMKernel kernel;
// weight buffer
auto packedw = kernel.getWeightPtr()->createStorage(n, k, blocksize);
// pytorch like int4x2 weight (stored as int8)
avector<int8_t> buffer(packedw.mSize);
packedw.assign(buffer.data());
// pack the quantized weight, scale, zero_points into jblas weight buffer
kernel.getWeightPtr()->packQWeight(n, k, quanW.data(), ldb, scales.data(), zero_points.data(), &packedw);
// same weight for Q, K V
std::vector<PrologueB::Param> bparams{{&packedw}, {&packedw}, {&packedw}};
// activation buffer
auto quanA = kernel.getActivationPtr()->createStorage(m, k, blocksize);
avector<int8_t> bufA(quanA.mSize);
quanA.assign(bufA.data());
// activation dynamic quantize
kernel.getActivationPtr()->quantize({_data.matA.data(), lda, &quanA}, m, k);
// output buffer
std::vector<GEMMKernel::CParam> cparams(batch);
GEMMKernel::Arguments args{m, n, k, batch, _data.matA.data(), lda, &quanA, bparams.data(), cparams.data(), NULL};

kernel.compute(args);

The whole case would be like

end to end cases


#include "jblas/jit_blas_transformer.h"
#include "jblas/jit_blas_weight_compression.h"
#include "jit_blas_ut.h"

namespace jblas {
using namespace utils;
using CompType = jblas::prologue::weight_comp::gemm_kblcok::PrologueBIDs;
namespace wrapper {
namespace transformer {
class UT_AVX512VNNI_NN_QKV_INT4_BLOCK {
 public:
  UT_AVX512VNNI_NN_QKV_INT4_BLOCK() {
    UT_START();
    CheckISA(AVX512_VNNI);
    ut(2, 4096, 4096, 4096, 4096, 4096, 0, 1.f, 0.f, 128);
    ut(2, 4096, 4096, 4096, 4096, 4096, 0, 1.f, 0.f, 64);
    ut(2, 4096, 4096, 4096, 4096, 4096, 0, 1.f, 0.f, 32);
  }

  void ut(int m, int n, int k, int lda, int ldb, int ldc, int ldd, float alpha, float beta, int blocksize) {
    printf("Test Case %s: %d %d %d-%d %d %d %d %f %f\n", __FUNCTION__, m, n, k, blocksize, lda, ldc, ldd, alpha, beta);

    ut::UT_GEMMData_Row_f32 _data(m, n, k, lda, ldb, ldc, ldd);
    int kblk_num = utils::updiv(k, blocksize);
    utils::aligned_vector<float> scales(kblk_num * n);
    ut::fill_buffer_randn(scales.data(), scales.size(), 0.001f, 0.005f);
    ut::UT_vector_s8 quanW;
    quanW.resize(k * n);
    quanW.fill_rand(-127, 127);
    for (int i = 0; i < k; i++) {
      for (int j = 0; j < n; j++) {
        if (i % blocksize == 0) {
          quanW.data()[i * n + j] = 127;
        }
        quanW.data()[i * n + j] = quanW.data()[i * n + j] & 0xf0;
      }
    }

    utils::aligned_vector<uint8_t> matAu8(m * lda), matAzp(m * kblk_num);
    ut::fill_buffer_randn(matAu8.data(), matAu8.size(), uint8_t(0), uint8_t(255));
    ut::fill_buffer_randn(matAzp.data(), matAzp.size(), uint8_t(100), uint8_t(150));
    utils::aligned_vector<float> AScales(kblk_num * m);
    ut::fill_buffer_randn(AScales.data(), AScales.size(), 0.001f, 0.005f);
    for (int j = 0; j < m; j++) {
      for (int i = 0; i < k; i++) {
        if (i % blocksize == 0) {
          matAu8.data()[i + j * lda] = 255;
        }
        if (i % blocksize == blocksize - 1) {
          matAu8.data()[i + j * lda] = 0;
        }
      }
    }
    for (int j = 0; j < k; j++) {
      for (int i = 0; i < n; i++) {
        _data.matB[j * ldb + i] = float(quanW.data()[j * ldb + i]) * scales[j / blocksize * n + i];
      }
    }
    for (int j = 0; j < m; j++) {
      for (int i = 0; i < k; i++) {
        _data.matA[j * lda + i] = (float(matAu8.data()[j * lda + i]) - matAzp[i / blocksize + j * kblk_num]) *
                                  AScales[i / blocksize + j * kblk_num];
      }
    }
    _data.calc_ref(alpha, beta);

    using GEMMKernel = wrapper::transformer_default::weight_comp::avx512_vnni::QKVGemmDynamicS4Fp32KBlock;
    using PrologueB = GEMMKernel::WeightType;
    GEMMKernel kernel;
    auto packedw = kernel.getWeightPtr()->createStorage(n, k, blocksize);
    avector<int8_t> buffer(packedw.mSize);
    packedw.assign(buffer.data());
    kernel.getWeightPtr()->packQWeight(n, k, quanW.data(), ldb, scales.data(), nullptr, &packedw);
    std::vector<PrologueB::Param> bparams{{&packedw}, {&packedw}, {&packedw}};
    auto quanA = kernel.getActivationPtr()->createStorage(m, k, blocksize);
    avector<int8_t> bufA(quanA.mSize);
    quanA.assign(bufA.data());
    kernel.getActivationPtr()->quantize({_data.matA.data(), lda, &quanA}, m, k);

    ut::buffer_error(quanA.template get<uint8_t>(), matAu8.data(), matAu8.size());
    ut::buffer_error(quanA.mZPtr, matAzp.data(), matAzp.size());
    ut::buffer_error(quanA.mSPtr, AScales.data(), AScales.size(), 0.001f);
    int batch = 3;
    aligned_vector<float> matCBatch(3 * m * n);
    std::vector<GEMMKernel::CParam> cparams(batch);
    for (size_t i = 0; i < batch; i++) {
      cparams[i].ldc = ldc;
      cparams[i].C = matCBatch.data() + i * m * n;
    }
    GEMMKernel::Arguments args{m, n, k, batch, _data.matA.data(), lda, &quanA, bparams.data(), cparams.data(), NULL};

    kernel.compute(args);

    ut::buffer_error(_data.matRef.data(), cparams[0].C, _data.matRef.size(), 0.001f);
    for (size_t i = 1; i < batch; i++) {
      ut::buffer_error(cparams[0].C, cparams[i].C, _data.matRef.size(), 0.f);
    }
  }
};

6 replies

choochtech Dec 31, 2023

Hi @mingfeima

Are we able to use jblas with llama cpp for Intel ?

Thanks

mingfeima Jan 2, 2024
Collaborator Author

@choochtech after some internal discussion, we made a decision that to go with intrinsics first. oneDNN will be released with similar feature with jblas in the near future, and using oneDNN in ggml would be more straight-forward.

choochtech Jan 2, 2024

Thanks @mingfeima. Which library should we use for text generation on Intel ? intel-extension-for-transformers or llama cpp ?

Please advise.

hshen14 Jan 2, 2024

@choochtech We (@mingfeima @airMeng ) plan to contribute our optimizations to llama.cpp, while intel-extension-for-transformers provides the Transformer-like APIs by leveraging the underlying optimizations. We expect those Transformer-like APIs would also be upstreamed through Hugging Face libraries e.g., transformers or Optimum-Intel.

choochtech Jan 2, 2024

Thanks @hshen14 great news.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opinions about template based GEMM kernels #3965

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Opinions about template based GEMM kernels #3965

mingfeima Nov 6, 2023 Collaborator

Replies: 2 comments · 10 replies

ggerganov Nov 6, 2023 Maintainer

Background

mingfeima Nov 8, 2023 Collaborator Author

Option-1: Use jblas and refactor the source code into ggml-jblas.h + ggml-jblas.cpp:

Option-2: Use jblas as a third party library (git submodule)

Option-3: Direct optimize GEMM kernels with intrinsics

ggerganov Nov 9, 2023 Maintainer

mingfeima Nov 9, 2023 Collaborator Author

airMeng Nov 9, 2023 Collaborator

airMeng Nov 8, 2023 Collaborator

choochtech Dec 31, 2023

mingfeima Jan 2, 2024 Collaborator Author

choochtech Jan 2, 2024

hshen14 Jan 2, 2024

choochtech Jan 2, 2024

mingfeima
Nov 6, 2023
Collaborator

Replies: 2 comments 10 replies

ggerganov
Nov 6, 2023
Maintainer

mingfeima Nov 8, 2023
Collaborator Author

Option-1: Use jblas and refactor the source code into `ggml-jblas.h` + `ggml-jblas.cpp`:

ggerganov Nov 9, 2023
Maintainer

mingfeima Nov 9, 2023
Collaborator Author

airMeng Nov 9, 2023
Collaborator

airMeng
Nov 8, 2023
Collaborator

mingfeima Jan 2, 2024
Collaborator Author