[ACL] Convolution performance regression #2588

alvoron · 2025-02-04T09:00:52Z

The performance issue has been reproduced on Apple M2 Pro.

Several benchdnn reproducers:

benchdnn --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-post-ops=eltwise_clip:0:6:1.0 --attr-scratchpad=user mb1_ic16oc96_ih112oh112kh1sh1dh0ph0_iw112ow112kw1sw1dw0pw0
benchdnn --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user mb1_ic144oc24_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0
benchdnn --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-post-ops=eltwise_clip:0:6:1.0 --attr-scratchpad=user mb1_ic24oc144_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0

oneDNN 3.6.2 with ACL 24.09 gives 0.135 / 0.1 / 0.099 ms. respectively on Apple M2 Pro.
oneDNN 3.6.2 with ACL 24.11 gives 0.4 / 0.22 / 0.196 ms. respectively on Apple M2 Pro.

The text was updated successfully, but these errors were encountered:

milpuz01 · 2025-02-04T23:14:08Z

Hi @alvoron,

How did you build oneDNN 3.6.2 with ACL 24.09 because the minimum version required for building oneDNN 3.6.2 is 24.11.1 as per here:

oneDNN/cmake/ACL.cmake

Line 34 in 2eb3dd1

set(ACL_MINIMUM_VERSION "24.11.1")

. Also what is the number of threads that you are using to run benchdnn?

If I build oneDNN 3.6.2 with ACL 24.11.1 and oneDNN 3.5.3 with ACL 24.09 on Apple M3 Pro I do not get performance regression as you do. There is small regression with 3.6.2 but given that runtime is less then 1ms I don't think it is significant.

This is result from running the first reproducible with oneDNN 3.5.3:

perf,cpu,gemm:acl,,--mode=P --conv --allow-enum-tags-only=false --stag=acdb --dtag=acdb --attr-post-ops=clip:0:6 --attr-scratchpad=user mb1ic16ih112oc96oh112kh1ph0,0.0385352,3.45062,0.107125,359.722,0.13201,291.911
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.107125 avg(ms):0.13201
total: 3.02s; fill: 0.01s (0%);

And this is when running with oneDNN 3.6.2:

perf,cpu,gemm:acl,,--mode=P --conv --allow-enum-tags-only=false --stag=acdb --dtag=acdb --attr-post-ops=clip:0:6 --attr-scratchpad=user mb1ic16ih112oc96oh112kh1ph0,0.0385352,4.19671,0.114708,335.941,0.146712,262.658
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.114708 avg(ms):0.146712
total: 3.02s; fill: 0.01s (0%);

Serenagirl · 2025-02-11T09:18:53Z

perf,cpu,gemm:acl,,--mode=P --conv --allow-enum-tags-only=false --stag=acdb --dtag=acdb --attr-post-ops=clip:0:6 --attr-scratchpad=user mb1ic16ih112oc96oh112kh1ph0,0.0385352,3.45062,0.107125,359.722,0.13201,291.911

the first

./benchdnn --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-post-ops=eltwise_clip:0:6:1.0 --attr-scratchpad=user mb1_ic16oc96_ih112oh112kh1sh1dh0ph0_iw112ow112kw1sw1dw0pw0

@vpirogov why i got this：o(╥﹏╥)o aarch onednn3.4 and ACL 24.11

Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,gemm:ref,,--mode=P --conv --allow-enum-tags-only=false --stag=acdb --dtag=acdb --attr-post-ops=clip:0:6 --attr-scratchpad=user mb1ic16ih112oc96oh112kh1ph0,0.0385352,1.80981,1.04932,36.7241,2.24633,17.1547
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):1.04932 avg(ms):2.24633
total: 3.02s; fill: 0.01s (0%);

vpirogov · 2025-02-11T15:30:55Z

@Serenagirl, oneDNN v3.4 is rather old, I'd suggest investigating issues with current version instead. If you want to understand why gemm:ref implementation is dispatched in your case try running benchdnn with ONEDNN_VERBOSE=dispatch.

alvoron · 2025-02-21T18:54:44Z

@Serenagirl
I have a sync with @morgolock where Pablo mentioned that the patch affected convolution was reverted.
Indeed, I don't see such heavy regressions I've seen initially, however there are still some affected models like ssdlite-mobilenet-v3-small-320

The following benchdnn config gives 0.038 ms on ACL v25.02 and 0.028 ms on ACL v24.09:

benchdnn --cold-cache=all --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user mb1_ic144oc144_ih2oh2kh3sh1dh0ph1_iw2ow2kw3sw1dw0pw1

Could you please try this reproducer on v25.02 and v24.09 on Apple silicon?

Serenagirl · 2025-02-22T02:03:44Z

@Serenagirl I have a sync with @morgolock where Pablo mentioned that the patch affected convolution was reverted. Indeed, I don't see such heavy regressions I've seen initially, however there are still some affected models like ssdlite-mobilenet-v3-small-320

The following benchdnn config gives 0.038 ms on ACL v25.02 and 0.028 ms on ACL v24.09:
benchdnn --cold-cache=all --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user mb1_ic144oc144_ih2oh2kh3sh1dh0ph1_iw2ow2kw3sw1dw0pw1
Could you please try this reproducer on v25.02 and v24.09 on Apple silicon?

sorry I test on aarch64 not Apple silicon, my test results with onednn 3.4 alone were:

Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,gemm:ref,,--mode=P --conv --allow-enum-tags-only=false --cold-cache=all --stag=acdb --dtag=acdb --attr-scratchpad=user mb1ic144ih2oc144oh2kh3ph1,0.000663552,0.450195,4.73242,0.140214,5.52453,0.12011
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):4.73242 avg(ms):5.52453

, and the test results with acl23.11 were:

Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,gemm:ref,,--mode=P --conv --allow-enum-tags-only=false --cold-cache=all --stag=acdb --dtag=acdb --attr-scratchpad=user mb1ic144ih2oc144oh2kh3ph1,0.000663552,1.94092,4.5459,0.145967,5.51199,0.120383
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):4.5459 avg(ms):5.51199

, when I set --stag=any --dtag=any I got some 2.9s:
ONEDNN_VERBOSE=dispatch ./benchdnn --cold-cache=all --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=any --wtag=any --dtag=any --attr-scratchpad=user mb1_ic144oc144_ih2oh2kh3sh1dh0ph1_iw2ow2kw3sw1dw0pw1 onednn_verbose,info,oneDNN v3.4.0 (commit N/A) onednn_verbose,info,cpu,runtime:OpenMP,nthr:128 onednn_verbose,info,cpu,isa:AArch64 SVE (256 bits) onednn_verbose,info,gpu,runtime:none onednn_verbose,info,graph,backend,0:dnnl_backend onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:848: We could not find an optimized kernel for F32 input onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:848: We could not find an optimized kernel for F32 input Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops% perf,cpu,gemm:ref,,--mode=P --conv --allow-enum-tags-only=false --cold-cache=all --attr-scratchpad=user mb1ic144ih2oc144oh2kh3ph1,0.000663552,2.06494,1.4436,0.45965,2.94146,0.225586 tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0 total perf: min(ms):1.4436 avg(ms):2.94146 total: 3.23s; fill: 0.01s (0%);
but I tested the first one with acl :

./benchdnn --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-post-ops=eltwise_clip:0:6:1.0 --attr-scratchpad=user mb1_ic16oc96_ih112oh112kh1sh1dh0ph0_iw112ow112kw1sw1dw0pw0
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,gemm:ref,,--mode=P --conv --allow-enum-tags-only=false --stag=acdb --dtag=acdb --attr-post-ops=clip:0:6 --attr-scratchpad=user mb1ic16ih112oc96oh112kh1ph0,0.0385352,2.12256,0.216309,178.149,0.377764,102.008
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.216309 avg(ms):0.377764
total: 3.02s; fill: 0.01s (0%);

, the test results were normal, although the acl was still used with gemm:ref, and I'll analyze it.

alvoron added the sighting Suspicious library behavior. Should be promoted to a bug when confirmed label Feb 4, 2025

alvoron mentioned this issue Feb 4, 2025

Convolution (GEMM and Winograd) regression ARM-software/ComputeLibrary#1157

Open

vpirogov added help wanted platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 labels Feb 4, 2025

vpirogov assigned milpuz01 Feb 4, 2025

vpirogov removed the help wanted label Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ACL] Convolution performance regression #2588

[ACL] Convolution performance regression #2588

alvoron commented Feb 4, 2025

milpuz01 commented Feb 4, 2025

Serenagirl commented Feb 11, 2025 •

edited by vpirogov

Loading

vpirogov commented Feb 11, 2025

alvoron commented Feb 21, 2025

Serenagirl commented Feb 22, 2025 •

edited

Loading

[ACL] Convolution performance regression #2588

[ACL] Convolution performance regression #2588

Comments

alvoron commented Feb 4, 2025

milpuz01 commented Feb 4, 2025

Serenagirl commented Feb 11, 2025 • edited by vpirogov Loading

vpirogov commented Feb 11, 2025

alvoron commented Feb 21, 2025

Serenagirl commented Feb 22, 2025 • edited Loading

Serenagirl commented Feb 11, 2025 •

edited by vpirogov

Loading

Serenagirl commented Feb 22, 2025 •

edited

Loading