Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for AMX instructions (bf16 and/or int8) #2555

Closed
4 tasks done
WilliamTambellini opened this issue Aug 8, 2023 · 20 comments
Closed
4 tasks done

Add support for AMX instructions (bf16 and/or int8) #2555

WilliamTambellini opened this issue Aug 8, 2023 · 20 comments
Labels

Comments

@WilliamTambellini
Copy link
Contributor

WilliamTambellini commented Aug 8, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Please provide a detailed written description of what you were trying to do:
running any ggml at int8 precision
what you expected llama.cpp to do:
using AMX acceleration

Current Behavior

does not seem to use any AMX instructions

Environment and Context

Physical (or virtual) hardware you are using, e.g. for Linux:
    $ lscpu Architecture:            
x86_64 CPU op-mode(s):        32-bit, 64-bit 
Address sizes:         46 bits physical, 48 bits virtual Byte Order:            Little Endian 
CPU(s):                  8 
On-line CPU(s) list:   0-7 
Vendor ID:               GenuineIntel 
Model name:            Intel(R) Xeon(R) Platinum 8488C 
CPU family:          6 
Model:               143 
Thread(s) per core:  2 
Core(s) per socket:  4 
Socket(s):           1 
Stepping:            8 
BogoMIPS:            4800.00 
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_ perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe p opcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_ adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec x getbv1 xsaves avx_vnni avx512_bf16 wbnoinvd ida arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme  avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities

Operating System, e.g. for Linux:

    fedora37
    $ uname -a Linux  6.1.9-200.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb  2 00:21:48 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
$ g++ --version
11.3

ref:
https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html

https://aws.amazon.com/about-aws/whats-new/2023/08/amazon-ec2-m7i-flex-m7i-instances/

@kiratp
Copy link

kiratp commented Aug 18, 2023

Can you share what performance you are seeing (t/sec)? Also, did you compile with Intel MKL?

In our testing we are seeing about 18-20 tokens/sec on a GCP C3-44 with 35 threads (22 hardware cores, but going over that seems to make things faster by about 10%). This is by building llama.cpp inside a contianer with base image intel/oneapi-basekit:latest and -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx

I am not familiar enough with Intel tooling to tell if MKL is correctly dynamically dispatching AMX instructions though so the plan is to just compare it to non-MKL builds later this week.

@WilliamTambellini
Copy link
Contributor Author

WilliamTambellini commented Aug 18, 2023

Hi @kiratp

Can you share what performance you are seeing (t/sec)?

It depends: which model ? which inference param ? which prompt ?

Also, did you compile with Intel MKL?

Nop but I ll try. Tks.

GCP C3-44

Hum, let s make 1st sure that that cpu has AMX units: can you show the output of lscpu on such machine?

@kiratp
Copy link

kiratp commented Aug 19, 2023

I don't have access to the VM at this moment but Google documents this specifically

https://cloud.google.com/compute/docs/general-purpose-machines#c3_series

Could you run the simple "make a website" prompt from this repo and share the timing.

Using MKL should in theory at least speed up prompt eval as that is done batched (as opposed to generation that is sequential).

I intend to compare C2 (Cascade Lake) and C3 (Sapphire Rapids) to verify my above statement. I will share my results.

@kiratp
Copy link

kiratp commented Aug 19, 2023

@ggerganov how do you feel about having official Dockerfiles for this project to build and deploy the server executable (with MKL etc)? We are gearing up to deploy to production and I'm happy to share out some of our build infrastructure.

@kiratp
Copy link

kiratp commented Aug 19, 2023

GCP instance: c3-highcpu-44

Compiled with:
cmake .. -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_CXX_FLAGS="-Ofast -mSAPPHIRERAPIDS -xSAPPHIRERAPIDS -qopt-zmm-usage=high -mno-shstk"

./build/bin/main --model /usr/src/models/<llama2 merged LORA model>.ggml.q4_k_m.bin --ctx-size 4096 --threads 42 -eps 1e-5 -p "Building a website can be done in 10 simple steps:"

llama_print_timings:        load time =   217.25 ms
llama_print_timings:      sample time =   300.20 ms /   491 runs   (    0.61 ms per token,  1635.58 tokens per second)
llama_print_timings: prompt eval time =   215.73 ms /    14 tokens (   15.41 ms per token,    64.90 tokens per second)
llama_print_timings:        eval time = 22698.24 ms /   490 runs   (   46.32 ms per token,    21.59 tokens per second)
llama_print_timings:       total time = 23328.41 ms

22 threads (the hardware core count) is slower at around 18-19 t/sec

Q8_0

llama_print_timings:        load time =   367.61 ms
llama_print_timings:      sample time =   292.28 ms /   477 runs   (    0.61 ms per token,  1631.97 tokens per second)
llama_print_timings: prompt eval time =   232.65 ms /    14 tokens (   16.62 ms per token,    60.18 tokens per second)
llama_print_timings:        eval time = 33848.05 ms /   476 runs   (   71.11 ms per token,    14.06 tokens per second)
llama_print_timings:       total time = 34484.84 ms

@ggerganov
Copy link
Member

First time I hear about Intel AMX - would be cool if we can add support

Docker images: if it is something simple - sure.

@kiratp
Copy link

kiratp commented Aug 21, 2023

Given that the code already uses cblas_sgemm, I would expect that at using Q8 would trigger the appropriate instructions out of oneMKL. However, the perf numbers done line up with that assumption.

Huggingface is showing a 60% uplift compared to previous gen CPUs.

https://huggingface.co/blog/intel-sapphire-rapids-inference

@kiratp
Copy link

kiratp commented Aug 30, 2023

I just met with the Intel team at GCP Next. They were great - have offered to have someone come take a look at this.

@kiratp
Copy link

kiratp commented Aug 30, 2023

Some more data of c3-44-highcpu:

4 bit

root@ml-perf-testing:/usr/src/app/llama.cpp# ./build/bin/llama-bench -t 2,8,16,20,21,22,32,38,40,42,43,44 -m /usr/src/models/<llama2 finetiuned>.ggml.q4_k_m.bin -r 2
| model                          | backend    |  n_threads | test       |             t/s |
| ------------------------------ | ---------- | ---------: | ---------- | --------------: |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          2 | pp 512     |    39.63 ± 0.39 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          8 | pp 512     |    39.69 ± 0.34 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         16 | pp 512     |    39.71 ± 0.57 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         20 | pp 512     |    39.69 ± 0.45 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         21 | pp 512     |    39.63 ± 0.57 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         22 | pp 512     |    39.61 ± 0.45 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         32 | pp 512     |    39.50 ± 0.64 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         38 | pp 512     |    39.68 ± 0.56 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         40 | pp 512     |    39.52 ± 0.62 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         42 | pp 512     |    39.75 ± 0.47 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         43 | pp 512     |    39.52 ± 0.34 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         44 | pp 512     |    39.57 ± 0.74 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          2 | tg 128     |     4.23 ± 0.01 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          8 | tg 128     |    12.32 ± 0.02 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         16 | tg 128     |    18.12 ± 0.06 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         20 | tg 128     |    19.93 ± 0.12 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         21 | tg 128     |    20.17 ± 0.06 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         22 | tg 128     |    20.73 ± 0.03 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         32 | tg 128     |    21.78 ± 0.33 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         38 | tg 128     |    22.01 ± 1.05 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         40 | tg 128     |    22.21 ± 0.41 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         42 | tg 128     |    22.45 ± 0.05 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         43 | tg 128     |    22.33 ± 0.21 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         44 | tg 128     |    22.39 ± 0.01 |

build: 9e232f0 (1009)

8 Bit

root@ml-perf-testing:/usr/src/app/llama.cpp# ./build/bin/llama-bench -t 16,20,21,22,32,38,40,42,43,44 -m /usr/src/models/<llama2 finetiuned>.ggml.q8_0.bin -r 2
| model                          | backend    |  n_threads | test       |             t/s |
| ------------------------------ | ---------- | ---------: | ---------- | --------------: |
| LLaMA 7B mostly Q8_0           | BLAS       |         16 | pp 512     |    37.50 ± 0.39 |
| LLaMA 7B mostly Q8_0           | BLAS       |         20 | pp 512     |    37.62 ± 0.53 |
| LLaMA 7B mostly Q8_0           | BLAS       |         21 | pp 512     |    37.77 ± 0.45 |
| LLaMA 7B mostly Q8_0           | BLAS       |         22 | pp 512     |    37.62 ± 0.62 |
| LLaMA 7B mostly Q8_0           | BLAS       |         32 | pp 512     |    37.69 ± 0.55 |
| LLaMA 7B mostly Q8_0           | BLAS       |         38 | pp 512     |    37.66 ± 0.44 |
| LLaMA 7B mostly Q8_0           | BLAS       |         40 | pp 512     |    37.62 ± 0.46 |
| LLaMA 7B mostly Q8_0           | BLAS       |         42 | pp 512     |    37.49 ± 0.43 |
| LLaMA 7B mostly Q8_0           | BLAS       |         43 | pp 512     |    37.62 ± 0.70 |
| LLaMA 7B mostly Q8_0           | BLAS       |         44 | pp 512     |    37.63 ± 0.41 |
| LLaMA 7B mostly Q8_0           | BLAS       |         16 | tg 128     |    12.79 ± 0.03 |
| LLaMA 7B mostly Q8_0           | BLAS       |         20 | tg 128     |    13.92 ± 0.12 |
| LLaMA 7B mostly Q8_0           | BLAS       |         21 | tg 128     |    13.68 ± 0.48 |
| LLaMA 7B mostly Q8_0           | BLAS       |         22 | tg 128     |    14.12 ± 0.09 |
| LLaMA 7B mostly Q8_0           | BLAS       |         32 | tg 128     |    14.36 ± 0.08 |
| LLaMA 7B mostly Q8_0           | BLAS       |         38 | tg 128     |    14.47 ± 0.04 |
| LLaMA 7B mostly Q8_0           | BLAS       |         40 | tg 128     |    14.26 ± 0.37 |
| LLaMA 7B mostly Q8_0           | BLAS       |         42 | tg 128     |    14.40 ± 0.16 |
| LLaMA 7B mostly Q8_0           | BLAS       |         43 | tg 128     |    14.36 ± 0.11 |
| LLaMA 7B mostly Q8_0           | BLAS       |         44 | tg 128     |    14.25 ± 0.31 |

build: 9e232f0 (1009)

@xinchun-wang
Copy link

xinchun-wang commented Aug 31, 2023

@kiratp @ggerganov

I hava test Intel AMX llama accelerate lib, base on list

  • pip install mkl==2023.1.0 intel-openmp==2023.1.0 onednn-cpu-gomp==2023.1.0
  • llama_hybrid-0.0.0-cp38-cp38-linux_x86_64.whl

Model: llama2 13b
Platform:

  • CPU: Intel(R) Xeon(R) Gold 6430
  • MEM: 1TB (512G Per Socket)
  • OS kernel: 5.10.0-136.36.0.112.1.oe2203sp1.x86_64

Test method:

  • use different batch size (form 1-256 )to inference, measure token throughput .

Run command:

  • numactl -N 0 python test_hybrid.py --model /home/apps/models/Llama-2-13b-chat-hf/ --outlength 350 --batch 8

The test_hybrid.py is supply by Intel team. It's extend LlamaForCausalLM, and support AMX (bf16 & fp16)
INPUT Tokens: Prompt Seq Length = 683

OUTPUT result examples :

Super-fused BF16+FP16 Llama Infer Latency: 1133237.55 ms
Super-fused BF16+FP16 Llama Generated Token: 350 x 256
Super-fused BF16+FP16 Llama Throughput: 79.07 token/sec

The full result is blow:

batch size Precision Throughput(token/sec) Output Tokens Infer Latency(ms) Mem RSS(GB)
1 FP16 + BF16 7.41 324 x 1 43748.32 118.0
2 FP16 + BF16 13.24  321 x 2 48490.29 143.1
4 FP16 + BF16 23.48 321 x 4 54695.89 144.7
8 FP16 + BF16 38.94  321 x 8 65944.30 148.7
16 FP16 + BF16 57.02 321 x 16 90070.38 153.8
32 FP16 + BF16 69.56  321 x 32 147679.44 166.1
64 FP16 + BF16 76.08 321 x 64 270014.76 198.8
128 FP16 + BF16 77.99 321 x 128 526834.45 255.9
256 FP16 + BF16 79.07 350 x 256 1133237.55 343.3

bf16 is consume much more memory, Intel will fix it.

According to my test, batch size is very useful to improve throughput. I hope llama.cpp will support batch inference, not only n_batch param.

I also test llama2 70B(INT8)on different cpu. The best result is acceptable, 4.01 token/s (base on AMD 9654P platform). But I want much more throughput

  • build/bin/main -m /home/apps/models/wizardlm-70b-q8_0.bin -gqa 8 -eps 1e-5 -t 128 -n 1024 --repeat_penalty 1.0 --color -c 512 --temp 0.6 -p "Please introduce me something about vipshop holdings ltd."

@kiratp
Copy link

kiratp commented Sep 1, 2023

@ggerganov I know this has been discussed here before but what is your stance on supporting batched inference?

Some rough benchmarking and cost analysis indicates that lama.cpp server on spot instances would be the most cost effective way to do inference at the moment.

I won't begin to claim I understand what it would take to add a batch dim through llama.cpp - if you can summarize a bit that would be cool.

@kiratp
Copy link

kiratp commented Sep 1, 2023

@xinchun-wang if I'm reading that correctly your results for batch size 1 are about 1/3 of llama.cpp (118ms/tok = 8.47 tok/sec). It would seem that Intel's implementation leverages batching well to hide memory bottlenecks but llama.cpp is better optimized around the throughput limitation.

@xinchun-wang
Copy link

@kiratp
This is my llama.cpp test result, use same input and same platform.

run command:
numactl -N 1 build/bin/main -m /home/apps/models/llama2-13b-fp16.bin -t 64 -n 1024  --repeat_penalty 1.0 --color -c 1024 --temp 1 --repeat_penalty 1.1 -p "long text..."

main: build = 0 (unknown)
main: seed = 1693554066
llama.cpp: loading model from /home/apps/models/llama2-13b-f16.bin
llama_model_load_internal: format = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 1024
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 6912
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 1 (mostly F16)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: mem required = 24826.69 MB (+ 800.00 MB per state)
llama_new_context_with_model: kv self size = 800.00 MB
llama_new_context_with_model: compute buffer total size = 111.35 MB

system_info: n_threads = 64 / 128 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 1.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 1024, n_batch = 512, n_predict = 1024, n_keep = 0

llama_print_timings: load time = 910.15 ms
llama_print_timings: sample time = 281.63 ms / 478 runs ( 0.59 ms per token, 1697.27 tokens per second)
llama_print_timings: prompt eval time = 72985.45 ms / 1195 tokens ( 61.08 ms per token, 16.37 tokens per second)
llama_print_timings: eval time = 77995.59 ms / 477 runs ( 163.51 ms per token, 6.12 tokens per second)
llama_print_timings: total time = 151404.05 ms

Intel (bf16+fp16 ) is better than llama.cpp, 7.41 token vs 6.12 token.

@kiratp
Copy link

kiratp commented Sep 1, 2023

I see now - my results were 7B q4_k_m vs you're testing FP16 13B.

@ggerganov
Copy link
Member

Batched inference will be implemented - see #2813

@WilliamTambellini
Copy link
Contributor Author

quick test using onednn on a aws m7i:

 $ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8488C
 Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni q monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd ida arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpc lmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities

and

$ ./benchdnn --mode=P --matmul --dt=f32 1024x1024:1024x1024
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brg:avx512_core,,--mode=P --matmul 1024x1024:1024x1024,2.14748,2.60864,823.219,2.75142,780.5
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):2.60864 avg(ms):2.75142
$ ./benchdnn --mode=P --matmul --dt=bf16 1024x1024:1024x1024
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brg:avx512_core_amx,,--mode=P --matmul --dt=bf16:bf16:bf16 1024x1024:1024x1024,2.14748,0.497803,4313.92,0.543487,3951.31
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.497803 avg(ms):0.543487
$ ./benchdnn --mode=P --matmul --dt=s8 1024x1024:1024x1024
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brg:avx512_core_amx,,--mode=P --matmul --dt=s8:s8:s8 1024x1024:1024x1024,2.14748,0.215332,9972.89,0.234601,9153.77
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.215332 avg(ms):0.234601

@xinchun-wang Warning: your linux kernel is perhaps too old to enable amx

@leesheen
Copy link

leesheen commented Sep 8, 2023

quick test using onednn on a aws m7i:

 $ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8488C
 Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni q monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd ida arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpc lmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities

and

$ ./benchdnn --mode=P --matmul --dt=f32 1024x1024:1024x1024
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brg:avx512_core,,--mode=P --matmul 1024x1024:1024x1024,2.14748,2.60864,823.219,2.75142,780.5
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):2.60864 avg(ms):2.75142
$ ./benchdnn --mode=P --matmul --dt=bf16 1024x1024:1024x1024
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brg:avx512_core_amx,,--mode=P --matmul --dt=bf16:bf16:bf16 1024x1024:1024x1024,2.14748,0.497803,4313.92,0.543487,3951.31
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.497803 avg(ms):0.543487
$ ./benchdnn --mode=P --matmul --dt=s8 1024x1024:1024x1024
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brg:avx512_core_amx,,--mode=P --matmul --dt=s8:s8:s8 1024x1024:1024x1024,2.14748,0.215332,9972.89,0.234601,9153.77
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.215332 avg(ms):0.234601

@xinchun-wang Warning: your linux kernel is perhaps too old to enable amx

The kernel version 5.10.0-136.36.0.112.1.oe2203sp1.x86_64 should be the openEuler 22.03 LTS SP1 system, I also use the same system, which supports AMX features.

$ lscpu | grep amx
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves xfd cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req hfi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities

@JohnnyOpcode
Copy link

I'm now more motivated to get motherboards for the Intel® Xeon Phi™ Processor 7290 that I have acquired. So much untapped potential with MKL (oneAPI) and both old and new Intel processors.

@hshen14
Copy link

hshen14 commented Dec 13, 2023

Here is the RFC (#3965) that we've discussed with @ggerganov. AMX will be part of the support in upcoming PRs. CC @mingfeima @airMeng

@github-actions github-actions bot added the stale label Mar 25, 2024
@github-actions github-actions bot removed the stale label Apr 2, 2024
@github-actions github-actions bot added the stale label May 3, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants