-
Notifications
You must be signed in to change notification settings - Fork 11.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for AMX instructions (bf16 and/or int8) #2555
Comments
Can you share what performance you are seeing (t/sec)? Also, did you compile with Intel MKL? In our testing we are seeing about 18-20 tokens/sec on a GCP C3-44 with 35 threads (22 hardware cores, but going over that seems to make things faster by about 10%). This is by building llama.cpp inside a contianer with base image I am not familiar enough with Intel tooling to tell if MKL is correctly dynamically dispatching AMX instructions though so the plan is to just compare it to non-MKL builds later this week. |
Hi @kiratp
It depends: which model ? which inference param ? which prompt ?
Nop but I ll try. Tks.
Hum, let s make 1st sure that that cpu has AMX units: can you show the output of lscpu on such machine? |
I don't have access to the VM at this moment but Google documents this specifically https://cloud.google.com/compute/docs/general-purpose-machines#c3_series Could you run the simple "make a website" prompt from this repo and share the timing. Using MKL should in theory at least speed up prompt eval as that is done batched (as opposed to generation that is sequential). I intend to compare C2 (Cascade Lake) and C3 (Sapphire Rapids) to verify my above statement. I will share my results. |
@ggerganov how do you feel about having official Dockerfiles for this project to build and deploy the server executable (with MKL etc)? We are gearing up to deploy to production and I'm happy to share out some of our build infrastructure. |
GCP instance: c3-highcpu-44 Compiled with:
22 threads (the hardware core count) is slower at around 18-19 t/sec Q8_0
|
First time I hear about Intel AMX - would be cool if we can add support Docker images: if it is something simple - sure. |
Given that the code already uses cblas_sgemm, I would expect that at using Q8 would trigger the appropriate instructions out of oneMKL. However, the perf numbers done line up with that assumption. Huggingface is showing a 60% uplift compared to previous gen CPUs. |
I just met with the Intel team at GCP Next. They were great - have offered to have someone come take a look at this. |
Some more data of c3-44-highcpu: 4 bit
8 Bit
|
I hava test Intel AMX llama accelerate lib, base on list
Model: llama2 13b
Test method:
Run command:
The test_hybrid.py is supply by Intel team. It's extend LlamaForCausalLM, and support AMX (bf16 & fp16) OUTPUT result examples : Super-fused BF16+FP16 Llama Infer Latency: 1133237.55 ms The full result is blow:
bf16 is consume much more memory, Intel will fix it. According to my test, batch size is very useful to improve throughput. I hope llama.cpp will support batch inference, not only n_batch param. I also test llama2 70B(INT8)on different cpu. The best result is acceptable, 4.01 token/s (base on AMD 9654P platform). But I want much more throughput
|
@ggerganov I know this has been discussed here before but what is your stance on supporting batched inference? Some rough benchmarking and cost analysis indicates that lama.cpp server on spot instances would be the most cost effective way to do inference at the moment. I won't begin to claim I understand what it would take to add a batch dim through llama.cpp - if you can summarize a bit that would be cool. |
@xinchun-wang if I'm reading that correctly your results for batch size 1 are about 1/3 of llama.cpp (118ms/tok = 8.47 tok/sec). It would seem that Intel's implementation leverages batching well to hide memory bottlenecks but llama.cpp is better optimized around the throughput limitation. |
@kiratp run command: main: build = 0 (unknown) system_info: n_threads = 64 / 128 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | llama_print_timings: load time = 910.15 ms Intel (bf16+fp16 ) is better than llama.cpp, 7.41 token vs 6.12 token. |
I see now - my results were 7B q4_k_m vs you're testing FP16 13B. |
Batched inference will be implemented - see #2813 |
quick test using onednn on a aws m7i:
and
@xinchun-wang Warning: your linux kernel is perhaps too old to enable amx |
The kernel version 5.10.0-136.36.0.112.1.oe2203sp1.x86_64 should be the openEuler 22.03 LTS SP1 system, I also use the same system, which supports AMX features.
|
I'm now more motivated to get motherboards for the Intel® Xeon Phi™ Processor 7290 that I have acquired. So much untapped potential with MKL (oneAPI) and both old and new Intel processors. |
Here is the RFC (#3965) that we've discussed with @ggerganov. AMX will be part of the support in upcoming PRs. CC @mingfeima @airMeng |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Please provide a detailed written description of what you were trying to do:
running any ggml at int8 precision
what you expected
llama.cpp
to do:using AMX acceleration
Current Behavior
does not seem to use any AMX instructions
Environment and Context
Operating System, e.g. for Linux:
ref:
https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html
https://aws.amazon.com/about-aws/whats-new/2023/08/amazon-ec2-m7i-flex-m7i-instances/
The text was updated successfully, but these errors were encountered: