Add support for AMX instructions (bf16 and/or int8) #2555

WilliamTambellini · 2023-08-08T16:20:40Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Please provide a detailed written description of what you were trying to do:
running any ggml at int8 precision
what you expected llama.cpp to do:
using AMX acceleration

Current Behavior

does not seem to use any AMX instructions

Environment and Context

Physical (or virtual) hardware you are using, e.g. for Linux:

    $ lscpu Architecture:            
x86_64 CPU op-mode(s):        32-bit, 64-bit 
Address sizes:         46 bits physical, 48 bits virtual Byte Order:            Little Endian 
CPU(s):                  8 
On-line CPU(s) list:   0-7 
Vendor ID:               GenuineIntel 
Model name:            Intel(R) Xeon(R) Platinum 8488C 
CPU family:          6 
Model:               143 
Thread(s) per core:  2 
Core(s) per socket:  4 
Socket(s):           1 
Stepping:            8 
BogoMIPS:            4800.00 
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_ perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe p opcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_ adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec x getbv1 xsaves avx_vnni avx512_bf16 wbnoinvd ida arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme  avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities

Operating System, e.g. for Linux:

    fedora37
    $ uname -a Linux  6.1.9-200.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb  2 00:21:48 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ g++ --version
11.3

ref:
https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html

https://aws.amazon.com/about-aws/whats-new/2023/08/amazon-ec2-m7i-flex-m7i-instances/

The text was updated successfully, but these errors were encountered:

kiratp · 2023-08-18T06:42:45Z

Can you share what performance you are seeing (t/sec)? Also, did you compile with Intel MKL?

In our testing we are seeing about 18-20 tokens/sec on a GCP C3-44 with 35 threads (22 hardware cores, but going over that seems to make things faster by about 10%). This is by building llama.cpp inside a contianer with base image intel/oneapi-basekit:latest and -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx

I am not familiar enough with Intel tooling to tell if MKL is correctly dynamically dispatching AMX instructions though so the plan is to just compare it to non-MKL builds later this week.

WilliamTambellini · 2023-08-18T23:02:04Z

Hi @kiratp

Can you share what performance you are seeing (t/sec)?

It depends: which model ? which inference param ? which prompt ?

Also, did you compile with Intel MKL?

Nop but I ll try. Tks.

GCP C3-44

Hum, let s make 1st sure that that cpu has AMX units: can you show the output of lscpu on such machine?

kiratp · 2023-08-19T16:01:34Z

I don't have access to the VM at this moment but Google documents this specifically

https://cloud.google.com/compute/docs/general-purpose-machines#c3_series

Could you run the simple "make a website" prompt from this repo and share the timing.

Using MKL should in theory at least speed up prompt eval as that is done batched (as opposed to generation that is sequential).

I intend to compare C2 (Cascade Lake) and C3 (Sapphire Rapids) to verify my above statement. I will share my results.

kiratp · 2023-08-19T16:03:16Z

@ggerganov how do you feel about having official Dockerfiles for this project to build and deploy the server executable (with MKL etc)? We are gearing up to deploy to production and I'm happy to share out some of our build infrastructure.

kiratp · 2023-08-19T18:00:11Z

GCP instance: c3-highcpu-44

Compiled with:
cmake .. -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_CXX_FLAGS="-Ofast -mSAPPHIRERAPIDS -xSAPPHIRERAPIDS -qopt-zmm-usage=high -mno-shstk"

./build/bin/main --model /usr/src/models/<llama2 merged LORA model>.ggml.q4_k_m.bin --ctx-size 4096 --threads 42 -eps 1e-5 -p "Building a website can be done in 10 simple steps:"

llama_print_timings:        load time =   217.25 ms
llama_print_timings:      sample time =   300.20 ms /   491 runs   (    0.61 ms per token,  1635.58 tokens per second)
llama_print_timings: prompt eval time =   215.73 ms /    14 tokens (   15.41 ms per token,    64.90 tokens per second)
llama_print_timings:        eval time = 22698.24 ms /   490 runs   (   46.32 ms per token,    21.59 tokens per second)
llama_print_timings:       total time = 23328.41 ms

22 threads (the hardware core count) is slower at around 18-19 t/sec

Q8_0

llama_print_timings:        load time =   367.61 ms
llama_print_timings:      sample time =   292.28 ms /   477 runs   (    0.61 ms per token,  1631.97 tokens per second)
llama_print_timings: prompt eval time =   232.65 ms /    14 tokens (   16.62 ms per token,    60.18 tokens per second)
llama_print_timings:        eval time = 33848.05 ms /   476 runs   (   71.11 ms per token,    14.06 tokens per second)
llama_print_timings:       total time = 34484.84 ms

ggerganov · 2023-08-20T09:10:30Z

First time I hear about Intel AMX - would be cool if we can add support

Docker images: if it is something simple - sure.

kiratp · 2023-08-21T03:57:13Z

Given that the code already uses cblas_sgemm, I would expect that at using Q8 would trigger the appropriate instructions out of oneMKL. However, the perf numbers done line up with that assumption.

Huggingface is showing a 60% uplift compared to previous gen CPUs.

https://huggingface.co/blog/intel-sapphire-rapids-inference

kiratp · 2023-08-30T00:07:24Z

I just met with the Intel team at GCP Next. They were great - have offered to have someone come take a look at this.

kiratp · 2023-08-30T22:02:00Z

Some more data of c3-44-highcpu:

4 bit

root@ml-perf-testing:/usr/src/app/llama.cpp# ./build/bin/llama-bench -t 2,8,16,20,21,22,32,38,40,42,43,44 -m /usr/src/models/<llama2 finetiuned>.ggml.q4_k_m.bin -r 2
| model                          | backend    |  n_threads | test       |             t/s |
| ------------------------------ | ---------- | ---------: | ---------- | --------------: |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          2 | pp 512     |    39.63 ± 0.39 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          8 | pp 512     |    39.69 ± 0.34 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         16 | pp 512     |    39.71 ± 0.57 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         20 | pp 512     |    39.69 ± 0.45 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         21 | pp 512     |    39.63 ± 0.57 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         22 | pp 512     |    39.61 ± 0.45 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         32 | pp 512     |    39.50 ± 0.64 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         38 | pp 512     |    39.68 ± 0.56 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         40 | pp 512     |    39.52 ± 0.62 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         42 | pp 512     |    39.75 ± 0.47 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         43 | pp 512     |    39.52 ± 0.34 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         44 | pp 512     |    39.57 ± 0.74 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          2 | tg 128     |     4.23 ± 0.01 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          8 | tg 128     |    12.32 ± 0.02 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         16 | tg 128     |    18.12 ± 0.06 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         20 | tg 128     |    19.93 ± 0.12 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         21 | tg 128     |    20.17 ± 0.06 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         22 | tg 128     |    20.73 ± 0.03 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         32 | tg 128     |    21.78 ± 0.33 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         38 | tg 128     |    22.01 ± 1.05 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         40 | tg 128     |    22.21 ± 0.41 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         42 | tg 128     |    22.45 ± 0.05 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         43 | tg 128     |    22.33 ± 0.21 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         44 | tg 128     |    22.39 ± 0.01 |

build: 9e232f0 (1009)

8 Bit

root@ml-perf-testing:/usr/src/app/llama.cpp# ./build/bin/llama-bench -t 16,20,21,22,32,38,40,42,43,44 -m /usr/src/models/<llama2 finetiuned>.ggml.q8_0.bin -r 2
| model                          | backend    |  n_threads | test       |             t/s |
| ------------------------------ | ---------- | ---------: | ---------- | --------------: |
| LLaMA 7B mostly Q8_0           | BLAS       |         16 | pp 512     |    37.50 ± 0.39 |
| LLaMA 7B mostly Q8_0           | BLAS       |         20 | pp 512     |    37.62 ± 0.53 |
| LLaMA 7B mostly Q8_0           | BLAS       |         21 | pp 512     |    37.77 ± 0.45 |
| LLaMA 7B mostly Q8_0           | BLAS       |         22 | pp 512     |    37.62 ± 0.62 |
| LLaMA 7B mostly Q8_0           | BLAS       |         32 | pp 512     |    37.69 ± 0.55 |
| LLaMA 7B mostly Q8_0           | BLAS       |         38 | pp 512     |    37.66 ± 0.44 |
| LLaMA 7B mostly Q8_0           | BLAS       |         40 | pp 512     |    37.62 ± 0.46 |
| LLaMA 7B mostly Q8_0           | BLAS       |         42 | pp 512     |    37.49 ± 0.43 |
| LLaMA 7B mostly Q8_0           | BLAS       |         43 | pp 512     |    37.62 ± 0.70 |
| LLaMA 7B mostly Q8_0           | BLAS       |         44 | pp 512     |    37.63 ± 0.41 |
| LLaMA 7B mostly Q8_0           | BLAS       |         16 | tg 128     |    12.79 ± 0.03 |
| LLaMA 7B mostly Q8_0           | BLAS       |         20 | tg 128     |    13.92 ± 0.12 |
| LLaMA 7B mostly Q8_0           | BLAS       |         21 | tg 128     |    13.68 ± 0.48 |
| LLaMA 7B mostly Q8_0           | BLAS       |         22 | tg 128     |    14.12 ± 0.09 |
| LLaMA 7B mostly Q8_0           | BLAS       |         32 | tg 128     |    14.36 ± 0.08 |
| LLaMA 7B mostly Q8_0           | BLAS       |         38 | tg 128     |    14.47 ± 0.04 |
| LLaMA 7B mostly Q8_0           | BLAS       |         40 | tg 128     |    14.26 ± 0.37 |
| LLaMA 7B mostly Q8_0           | BLAS       |         42 | tg 128     |    14.40 ± 0.16 |
| LLaMA 7B mostly Q8_0           | BLAS       |         43 | tg 128     |    14.36 ± 0.11 |
| LLaMA 7B mostly Q8_0           | BLAS       |         44 | tg 128     |    14.25 ± 0.31 |

build: 9e232f0 (1009)

xinchun-wang · 2023-08-31T09:17:16Z

@kiratp @ggerganov

I hava test Intel AMX llama accelerate lib, base on list

pip install mkl==2023.1.0 intel-openmp==2023.1.0 onednn-cpu-gomp==2023.1.0
llama_hybrid-0.0.0-cp38-cp38-linux_x86_64.whl

Model: llama2 13b
Platform:

CPU: Intel(R) Xeon(R) Gold 6430
MEM: 1TB (512G Per Socket)
OS kernel: 5.10.0-136.36.0.112.1.oe2203sp1.x86_64

Test method:

use different batch size (form 1-256 )to inference, measure token throughput .

Run command:

numactl -N 0 python test_hybrid.py --model /home/apps/models/Llama-2-13b-chat-hf/ --outlength 350 --batch 8

The test_hybrid.py is supply by Intel team. It's extend LlamaForCausalLM, and support AMX (bf16 & fp16)
INPUT Tokens： Prompt Seq Length = 683

OUTPUT result examples :

Super-fused BF16+FP16 Llama Infer Latency: 1133237.55 ms
Super-fused BF16+FP16 Llama Generated Token: 350 x 256
Super-fused BF16+FP16 Llama Throughput: 79.07 token/sec

The full result is blow:

batch size	Precision	Throughput（token/sec）	Output Tokens	Infer Latency（ms）	Mem RSS（GB）
1	FP16 + BF16	7.41	324 x 1	43748.32	118.0
2	FP16 + BF16	13.24	321 x 2	48490.29	143.1
4	FP16 + BF16	23.48	321 x 4	54695.89	144.7
8	FP16 + BF16	38.94	321 x 8	65944.30	148.7
16	FP16 + BF16	57.02	321 x 16	90070.38	153.8
32	FP16 + BF16	69.56	321 x 32	147679.44	166.1
64	FP16 + BF16	76.08	321 x 64	270014.76	198.8
128	FP16 + BF16	77.99	321 x 128	526834.45	255.9
256	FP16 + BF16	79.07	350 x 256	`1133237`.55	343.3

bf16 is consume much more memory, Intel will fix it.

According to my test, batch size is very useful to improve throughput. I hope llama.cpp will support batch inference, not only n_batch param.

I also test llama2 70B(INT8)on different cpu. The best result is acceptable, 4.01 token/s (base on AMD 9654P platform). But I want much more throughput

build/bin/main -m /home/apps/models/wizardlm-70b-q8_0.bin -gqa 8 -eps 1e-5 -t 128 -n 1024 --repeat_penalty 1.0 --color -c 512 --temp 0.6 -p "Please introduce me something about vipshop holdings ltd."

kiratp · 2023-09-01T06:20:51Z

@ggerganov I know this has been discussed here before but what is your stance on supporting batched inference?

Some rough benchmarking and cost analysis indicates that lama.cpp server on spot instances would be the most cost effective way to do inference at the moment.

I won't begin to claim I understand what it would take to add a batch dim through llama.cpp - if you can summarize a bit that would be cool.

kiratp · 2023-09-01T06:23:17Z

@xinchun-wang if I'm reading that correctly your results for batch size 1 are about 1/3 of llama.cpp (118ms/tok = 8.47 tok/sec). It would seem that Intel's implementation leverages batching well to hide memory bottlenecks but llama.cpp is better optimized around the throughput limitation.

xinchun-wang · 2023-09-01T07:47:33Z

@kiratp
This is my llama.cpp test result, use same input and same platform.

run command:
numactl -N 1 build/bin/main -m /home/apps/models/llama2-13b-fp16.bin -t 64 -n 1024 --repeat_penalty 1.0 --color -c 1024 --temp 1 --repeat_penalty 1.1 -p "long text..."

main: build = 0 (unknown)
main: seed = 1693554066
llama.cpp: loading model from /home/apps/models/llama2-13b-f16.bin
llama_model_load_internal: format = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 1024
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 6912
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 1 (mostly F16)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: mem required = 24826.69 MB (+ 800.00 MB per state)
llama_new_context_with_model: kv self size = 800.00 MB
llama_new_context_with_model: compute buffer total size = 111.35 MB

system_info: n_threads = 64 / 128 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 1.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 1024, n_batch = 512, n_predict = 1024, n_keep = 0

llama_print_timings: load time = 910.15 ms
llama_print_timings: sample time = 281.63 ms / 478 runs ( 0.59 ms per token, 1697.27 tokens per second)
llama_print_timings: prompt eval time = 72985.45 ms / 1195 tokens ( 61.08 ms per token, 16.37 tokens per second)
llama_print_timings: eval time = 77995.59 ms / 477 runs ( 163.51 ms per token, 6.12 tokens per second)
llama_print_timings: total time = 151404.05 ms

Intel (bf16+fp16 ) is better than llama.cpp, 7.41 token vs 6.12 token.

kiratp · 2023-09-01T15:20:46Z

I see now - my results were 7B q4_k_m vs you're testing FP16 13B.

ggerganov · 2023-09-02T12:55:53Z

Batched inference will be implemented - see #2813

WilliamTambellini · 2023-09-06T22:36:39Z

quick test using onednn on a aws m7i:

 $ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8488C
 Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni q monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd ida arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpc lmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities

and

$ ./benchdnn --mode=P --matmul --dt=f32 1024x1024:1024x1024
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brg:avx512_core,,--mode=P --matmul 1024x1024:1024x1024,2.14748,2.60864,823.219,2.75142,780.5
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):2.60864 avg(ms):2.75142
$ ./benchdnn --mode=P --matmul --dt=bf16 1024x1024:1024x1024
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brg:avx512_core_amx,,--mode=P --matmul --dt=bf16:bf16:bf16 1024x1024:1024x1024,2.14748,0.497803,4313.92,0.543487,3951.31
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.497803 avg(ms):0.543487
$ ./benchdnn --mode=P --matmul --dt=s8 1024x1024:1024x1024
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brg:avx512_core_amx,,--mode=P --matmul --dt=s8:s8:s8 1024x1024:1024x1024,2.14748,0.215332,9972.89,0.234601,9153.77
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.215332 avg(ms):0.234601

@xinchun-wang Warning: your linux kernel is perhaps too old to enable amx

leesheen · 2023-09-08T02:18:28Z

quick test using onednn on a aws m7i:

 $ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8488C
 Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni q monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd ida arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpc lmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities

and

$ ./benchdnn --mode=P --matmul --dt=f32 1024x1024:1024x1024
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brg:avx512_core,,--mode=P --matmul 1024x1024:1024x1024,2.14748,2.60864,823.219,2.75142,780.5
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):2.60864 avg(ms):2.75142
$ ./benchdnn --mode=P --matmul --dt=bf16 1024x1024:1024x1024
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brg:avx512_core_amx,,--mode=P --matmul --dt=bf16:bf16:bf16 1024x1024:1024x1024,2.14748,0.497803,4313.92,0.543487,3951.31
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.497803 avg(ms):0.543487
$ ./benchdnn --mode=P --matmul --dt=s8 1024x1024:1024x1024
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brg:avx512_core_amx,,--mode=P --matmul --dt=s8:s8:s8 1024x1024:1024x1024,2.14748,0.215332,9972.89,0.234601,9153.77
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.215332 avg(ms):0.234601

@xinchun-wang Warning: your linux kernel is perhaps too old to enable amx

The kernel version 5.10.0-136.36.0.112.1.oe2203sp1.x86_64 should be the openEuler 22.03 LTS SP1 system, I also use the same system, which supports AMX features.

$ lscpu | grep amx
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves xfd cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req hfi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities

JohnnyOpcode · 2023-09-08T02:29:20Z

I'm now more motivated to get motherboards for the Intel® Xeon Phi™ Processor 7290 that I have acquired. So much untapped potential with MKL (oneAPI) and both old and new Intel processors.

hshen14 · 2023-12-13T05:44:45Z

Here is the RFC (#3965) that we've discussed with @ggerganov. AMX will be part of the support in upcoming PRs. CC @mingfeima @airMeng

github-actions · 2024-05-18T01:58:36Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

jagane-infinstor mentioned this issue Oct 20, 2023

Fail to build tag v2.1.0.dev+cpu.llm intel/intel-extension-for-pytorch#395

Closed

abhilash1910 mentioned this issue Oct 25, 2023

AMX isa Native addition #3777

Draft

github-actions bot added the stale label Mar 25, 2024

github-actions bot removed the stale label Apr 2, 2024

github-actions bot added the stale label May 3, 2024

github-actions bot closed this as completed May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for AMX instructions (bf16 and/or int8) #2555

Add support for AMX instructions (bf16 and/or int8) #2555

WilliamTambellini commented Aug 8, 2023 •

edited by SlyEcho

Loading

kiratp commented Aug 18, 2023

WilliamTambellini commented Aug 18, 2023 •

edited

Loading

kiratp commented Aug 19, 2023

kiratp commented Aug 19, 2023

kiratp commented Aug 19, 2023 •

edited

Loading

ggerganov commented Aug 20, 2023

kiratp commented Aug 21, 2023

kiratp commented Aug 30, 2023

kiratp commented Aug 30, 2023 •

edited

Loading

xinchun-wang commented Aug 31, 2023 •

edited

Loading

kiratp commented Sep 1, 2023 •

edited

Loading

kiratp commented Sep 1, 2023 •

edited

Loading

xinchun-wang commented Sep 1, 2023

kiratp commented Sep 1, 2023

ggerganov commented Sep 2, 2023

WilliamTambellini commented Sep 6, 2023

leesheen commented Sep 8, 2023 •

edited

Loading

JohnnyOpcode commented Sep 8, 2023

hshen14 commented Dec 13, 2023 •

edited

Loading

github-actions bot commented May 18, 2024

Add support for AMX instructions (bf16 and/or int8) #2555

Add support for AMX instructions (bf16 and/or int8) #2555

Comments

WilliamTambellini commented Aug 8, 2023 • edited by SlyEcho Loading

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

kiratp commented Aug 18, 2023

WilliamTambellini commented Aug 18, 2023 • edited Loading

kiratp commented Aug 19, 2023

kiratp commented Aug 19, 2023

kiratp commented Aug 19, 2023 • edited Loading

ggerganov commented Aug 20, 2023

kiratp commented Aug 21, 2023

kiratp commented Aug 30, 2023

kiratp commented Aug 30, 2023 • edited Loading

xinchun-wang commented Aug 31, 2023 • edited Loading

kiratp commented Sep 1, 2023 • edited Loading

kiratp commented Sep 1, 2023 • edited Loading

xinchun-wang commented Sep 1, 2023

kiratp commented Sep 1, 2023

ggerganov commented Sep 2, 2023

WilliamTambellini commented Sep 6, 2023

leesheen commented Sep 8, 2023 • edited Loading

JohnnyOpcode commented Sep 8, 2023

hshen14 commented Dec 13, 2023 • edited Loading

github-actions bot commented May 18, 2024

WilliamTambellini commented Aug 8, 2023 •

edited by SlyEcho

Loading

WilliamTambellini commented Aug 18, 2023 •

edited

Loading

kiratp commented Aug 19, 2023 •

edited

Loading

kiratp commented Aug 30, 2023 •

edited

Loading

xinchun-wang commented Aug 31, 2023 •

edited

Loading

kiratp commented Sep 1, 2023 •

edited

Loading

kiratp commented Sep 1, 2023 •

edited

Loading

leesheen commented Sep 8, 2023 •

edited

Loading

hshen14 commented Dec 13, 2023 •

edited

Loading