2-Pass Sdpa Inference Kernel #1597

angeloskath · 2024-11-18T10:00:45Z

This PR aims to improve long context generation performance by increasing parallelization for large numbers of keys/values. There are mild benefits for smaller machines and very significant benefits for Ultra machines.

The main benefit for small machines stems from accessing the keys and values in a more cache friendly way when there is GQA and for the Ultra machines it stems from launching more thread groups which allows using more of the chip.

Speedup for M2 Max

The following speedup is in total tokens per second and not attention speedup. Note the phi model which does not improve does not have GQA. The 1 pass SDPA on the M2 Max achieves ~350 to 380 GB/s read for sequence length ~2048 so there isn't really much room for speedup.

Speedup for M2 Ultra

Again the speedup is in total tokens per second and not attention specific. The M2 Ultra is sped up for all cases, no GQA required. The 2048 sequence length without GQA peaks at >800GB/s which also means there is probably little room for improvement (there could be for longer sequences).

awni

Amazing speedup!! LGTM!

awni · 2024-11-18T14:49:01Z

mlx/backend/metal/kernels/sdpa_vector.h

+    keys += blocks * stride;
+    values += blocks * stride;
+  }
+  threadgroup_barrier(mem_flags::mem_threadgroup);


That barrier may be uneccessary?

This is actually a great catch. It should be inside the loop. Same for the 1 pass. The reasoning is it makes sure the thread group all reads the block at the same time so one simdgroup cannot just run ahead. I had seen it provides a small improvement but then in one of the edits it probably got restored back outside the loop.

I take it back... I tested again and I get mixed results which is probably why I reverted it in some previous commit. I will remove it since indeed conceptually it needn't be there.

This reverts commit e023acf.

angeloskath requested review from awni, barronalex and jagrit06 November 18, 2024 10:00

awni approved these changes Nov 18, 2024

View reviewed changes

angeloskath added 5 commits November 18, 2024 11:37

Add a 2-pass fused sdpa for long context inference

4311eb4

Change the order of blocks and heads in sdpa 2-pass

009745b

Revert "Change the order of blocks and heads in sdpa 2-pass"

176ed54

This reverts commit e023acf.

Reduce threadgroup size

1854a67

Add the rules for routing

e293c04

angeloskath force-pushed the sdpa branch from 0dc5d72 to e293c04 Compare November 18, 2024 19:37

Remove unnecessary barrier

2d0ec42

angeloskath merged commit 073076a into main Nov 19, 2024
5 checks passed

angeloskath deleted the sdpa branch November 19, 2024 01:31

This was referenced Nov 26, 2024

Integrate fast MLX kernel for SDPA with long seqlen EricLBuehler/candle#45

Merged

Tracking: Metal performance vs. MLX, llama.cpp EricLBuehler/mistral.rs#903

Open

EricLBuehler mentioned this pull request Jan 16, 2025

Sync upstream MLX sdpa vector kernels with mask huggingface/candle#2718

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2-Pass Sdpa Inference Kernel #1597

2-Pass Sdpa Inference Kernel #1597

angeloskath commented Nov 18, 2024

awni left a comment

awni Nov 18, 2024

angeloskath Nov 19, 2024

angeloskath Nov 19, 2024

2-Pass Sdpa Inference Kernel #1597

2-Pass Sdpa Inference Kernel #1597

Conversation

angeloskath commented Nov 18, 2024

Speedup for M2 Max

Speedup for M2 Ultra

awni left a comment

Choose a reason for hiding this comment

awni Nov 18, 2024

Choose a reason for hiding this comment

angeloskath Nov 19, 2024

Choose a reason for hiding this comment

angeloskath Nov 19, 2024

Choose a reason for hiding this comment