Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ACL] Convolution performance regression #2588

Open
alvoron opened this issue Feb 4, 2025 · 5 comments
Open

[ACL] Convolution performance regression #2588

alvoron opened this issue Feb 4, 2025 · 5 comments
Assignees
Labels
platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 sighting Suspicious library behavior. Should be promoted to a bug when confirmed

Comments

@alvoron
Copy link
Contributor

alvoron commented Feb 4, 2025

The performance issue has been reproduced on Apple M2 Pro.

Several benchdnn reproducers:

benchdnn --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-post-ops=eltwise_clip:0:6:1.0 --attr-scratchpad=user mb1_ic16oc96_ih112oh112kh1sh1dh0ph0_iw112ow112kw1sw1dw0pw0
benchdnn --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user mb1_ic144oc24_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0
benchdnn --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-post-ops=eltwise_clip:0:6:1.0 --attr-scratchpad=user mb1_ic24oc144_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0

oneDNN 3.6.2 with ACL 24.09 gives 0.135 / 0.1 / 0.099 ms. respectively on Apple M2 Pro.
oneDNN 3.6.2 with ACL 24.11 gives 0.4 / 0.22 / 0.196 ms. respectively on Apple M2 Pro.

@alvoron alvoron added the sighting Suspicious library behavior. Should be promoted to a bug when confirmed label Feb 4, 2025
@vpirogov vpirogov added help wanted platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 labels Feb 4, 2025
@milpuz01
Copy link
Contributor

milpuz01 commented Feb 4, 2025

Hi @alvoron,

How did you build oneDNN 3.6.2 with ACL 24.09 because the minimum version required for building oneDNN 3.6.2 is 24.11.1 as per here:

set(ACL_MINIMUM_VERSION "24.11.1")
. Also what is the number of threads that you are using to run benchdnn?

If I build oneDNN 3.6.2 with ACL 24.11.1 and oneDNN 3.5.3 with ACL 24.09 on Apple M3 Pro I do not get performance regression as you do. There is small regression with 3.6.2 but given that runtime is less then 1ms I don't think it is significant.

This is result from running the first reproducible with oneDNN 3.5.3:

perf,cpu,gemm:acl,,--mode=P --conv --allow-enum-tags-only=false --stag=acdb --dtag=acdb --attr-post-ops=clip:0:6 --attr-scratchpad=user mb1ic16ih112oc96oh112kh1ph0,0.0385352,3.45062,0.107125,359.722,0.13201,291.911
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.107125 avg(ms):0.13201
total: 3.02s; fill: 0.01s (0%);

And this is when running with oneDNN 3.6.2:

perf,cpu,gemm:acl,,--mode=P --conv --allow-enum-tags-only=false --stag=acdb --dtag=acdb --attr-post-ops=clip:0:6 --attr-scratchpad=user mb1ic16ih112oc96oh112kh1ph0,0.0385352,4.19671,0.114708,335.941,0.146712,262.658
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.114708 avg(ms):0.146712
total: 3.02s; fill: 0.01s (0%);

@Serenagirl
Copy link

Serenagirl commented Feb 11, 2025

perf,cpu,gemm:acl,,--mode=P --conv --allow-enum-tags-only=false --stag=acdb --dtag=acdb --attr-post-ops=clip:0:6 --attr-scratchpad=user mb1ic16ih112oc96oh112kh1ph0,0.0385352,3.45062,0.107125,359.722,0.13201,291.911

the first

./benchdnn --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-post-ops=eltwise_clip:0:6:1.0 --attr-scratchpad=user mb1_ic16oc96_ih112oh112kh1sh1dh0ph0_iw112ow112kw1sw1dw0pw0

@vpirogov why i got this:o(╥﹏╥)o aarch onednn3.4 and ACL 24.11

Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,gemm:ref,,--mode=P --conv --allow-enum-tags-only=false --stag=acdb --dtag=acdb --attr-post-ops=clip:0:6 --attr-scratchpad=user mb1ic16ih112oc96oh112kh1ph0,0.0385352,1.80981,1.04932,36.7241,2.24633,17.1547
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):1.04932 avg(ms):2.24633
total: 3.02s; fill: 0.01s (0%);

@vpirogov
Copy link
Member

@Serenagirl, oneDNN v3.4 is rather old, I'd suggest investigating issues with current version instead. If you want to understand why gemm:ref implementation is dispatched in your case try running benchdnn with ONEDNN_VERBOSE=dispatch.

@alvoron
Copy link
Contributor Author

alvoron commented Feb 21, 2025

@Serenagirl
I have a sync with @morgolock where Pablo mentioned that the patch affected convolution was reverted.
Indeed, I don't see such heavy regressions I've seen initially, however there are still some affected models like ssdlite-mobilenet-v3-small-320

The following benchdnn config gives 0.038 ms on ACL v25.02 and 0.028 ms on ACL v24.09:

benchdnn --cold-cache=all --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user mb1_ic144oc144_ih2oh2kh3sh1dh0ph1_iw2ow2kw3sw1dw0pw1

Could you please try this reproducer on v25.02 and v24.09 on Apple silicon?

@Serenagirl
Copy link

Serenagirl commented Feb 22, 2025

@Serenagirl I have a sync with @morgolock where Pablo mentioned that the patch affected convolution was reverted. Indeed, I don't see such heavy regressions I've seen initially, however there are still some affected models like ssdlite-mobilenet-v3-small-320

The following benchdnn config gives 0.038 ms on ACL v25.02 and 0.028 ms on ACL v24.09:

benchdnn --cold-cache=all --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user mb1_ic144oc144_ih2oh2kh3sh1dh0ph1_iw2ow2kw3sw1dw0pw1

Could you please try this reproducer on v25.02 and v24.09 on Apple silicon?

sorry I test on aarch64 not Apple silicon, my test results with onednn 3.4 alone were:

Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,gemm:ref,,--mode=P --conv --allow-enum-tags-only=false --cold-cache=all --stag=acdb --dtag=acdb --attr-scratchpad=user mb1ic144ih2oc144oh2kh3ph1,0.000663552,0.450195,4.73242,0.140214,5.52453,0.12011
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):4.73242 avg(ms):5.52453

, and the test results with acl23.11 were:

Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,gemm:ref,,--mode=P --conv --allow-enum-tags-only=false --cold-cache=all --stag=acdb --dtag=acdb --attr-scratchpad=user mb1ic144ih2oc144oh2kh3ph1,0.000663552,1.94092,4.5459,0.145967,5.51199,0.120383
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):4.5459 avg(ms):5.51199

, when I set --stag=any --dtag=any I got some 2.9s:
ONEDNN_VERBOSE=dispatch ./benchdnn --cold-cache=all --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=any --wtag=any --dtag=any --attr-scratchpad=user mb1_ic144oc144_ih2oh2kh3sh1dh0ph1_iw2ow2kw3sw1dw0pw1 onednn_verbose,info,oneDNN v3.4.0 (commit N/A) onednn_verbose,info,cpu,runtime:OpenMP,nthr:128 onednn_verbose,info,cpu,isa:AArch64 SVE (256 bits) onednn_verbose,info,gpu,runtime:none onednn_verbose,info,graph,backend,0:dnnl_backend onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:848: We could not find an optimized kernel for F32 input onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:848: We could not find an optimized kernel for F32 input Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops% perf,cpu,gemm:ref,,--mode=P --conv --allow-enum-tags-only=false --cold-cache=all --attr-scratchpad=user mb1ic144ih2oc144oh2kh3ph1,0.000663552,2.06494,1.4436,0.45965,2.94146,0.225586 tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0 total perf: min(ms):1.4436 avg(ms):2.94146 total: 3.23s; fill: 0.01s (0%);
but I tested the first one with acl :

./benchdnn --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-post-ops=eltwise_clip:0:6:1.0 --attr-scratchpad=user mb1_ic16oc96_ih112oh112kh1sh1dh0ph0_iw112ow112kw1sw1dw0pw0
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,gemm:ref,,--mode=P --conv --allow-enum-tags-only=false --stag=acdb --dtag=acdb --attr-post-ops=clip:0:6 --attr-scratchpad=user mb1ic16ih112oc96oh112kh1ph0,0.0385352,2.12256,0.216309,178.149,0.377764,102.008
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.216309 avg(ms):0.377764
total: 3.02s; fill: 0.01s (0%);

, the test results were normal, although the acl was still used with gemm:ref, and I'll analyze it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 sighting Suspicious library behavior. Should be promoted to a bug when confirmed
Projects
None yet
Development

No branches or pull requests

4 participants