Use SSE for 128-bit atomic load/store on Intel CPU with AVX #16

taiki-e · 2022-06-18T02:53:57Z

x86_64 part of #10

The following are the results of a simple microbenchmark:

bench_portable_atomic_arch/u128_load
                        time:   [1.4598 ns 1.4671 ns 1.4753 ns]
                        change: [-81.510% -81.210% -80.950%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
bench_portable_atomic_arch/u128_store
                        time:   [1.3852 ns 1.3937 ns 1.4024 ns]
                        change: [-82.318% -81.989% -81.621%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
bench_portable_atomic_arch/u128_concurrent_load
                        time:   [56.422 us 56.767 us 57.204 us]
                        change: [-70.807% -70.143% -69.443%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
bench_portable_atomic_arch/u128_concurrent_load_store
                        time:   [136.53 us 139.96 us 145.39 us]
                        change: [-82.570% -81.879% -80.820%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) high mild
  11 (11.00%) high severe
bench_portable_atomic_arch/u128_concurrent_store
                        time:   [146.03 us 147.67 us 149.98 us]
                        change: [-90.486% -90.124% -89.483%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) high mild
  8 (8.00%) high severe
bench_portable_atomic_arch/u128_concurrent_store_swap
                        time:   [765.11 us 766.69 us 768.29 us]
                        change: [-51.204% -50.967% -50.745%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

Closes #10

taiki-e · 2022-06-18T04:16:46Z

src/imp/atomic128/x86_64.rs

+            // TODO: GCC always uses mfence here, so the current implementation
+            // follows that behavior. However, probably only SeqCst store
+            // actually requires mfence.
+            "mfence",


That said, I'm not sure why GCC always uses mfence.

Yeah I don't think this needs an mfence unless it's SeqCst as with regular atomic stores. GCC might have overlooked it? Probably worth filing an issue there.

taiki-e · 2022-06-18T04:30:27Z

src/imp/atomic128/x86_64.rs

+        // Check cmpxchg16b anyway to prevent mixing atomic and non-atomic access.
+        if detect::has_cmpxchg16b() && detect::is_intel_and_has_avx() {
+            _atomic_load_vmovdqa
+        } else {
+            _atomic_load_cmpxchg16b
+        })


Technically, we could support determining this at compile time by parsing the target-cpu in the build script, but there are many target-cpu's in x86_64...

We can probably use info from https://github.com/llvm/llvm-project/blob/1f69f7ea9af4bd4e8001a25044aa026557f366f7/llvm/include/llvm/Support/X86TargetParser.def#L31-L92

taiki-e · 2022-06-18T05:03:50Z

cc @ibraheemdev

taiki-e · 2022-06-19T06:38:31Z

Removed mfence from non-SeqCst stores.

bors r+

16: Use SSE for 128-bit atomic load/store on Intel CPU with AVX r=taiki-e a=taiki-e x86_64 part of #10 The following are the results of a simple microbenchmark: ``` bench_portable_atomic_arch/u128_load time: [1.4598 ns 1.4671 ns 1.4753 ns] change: [-81.510% -81.210% -80.950%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 4 (4.00%) high mild 2 (2.00%) high severe bench_portable_atomic_arch/u128_store time: [1.3852 ns 1.3937 ns 1.4024 ns] change: [-82.318% -81.989% -81.621%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) low mild 5 (5.00%) high mild 2 (2.00%) high severe bench_portable_atomic_arch/u128_concurrent_load time: [56.422 us 56.767 us 57.204 us] change: [-70.807% -70.143% -69.443%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 3 (3.00%) high mild 6 (6.00%) high severe bench_portable_atomic_arch/u128_concurrent_load_store time: [136.53 us 139.96 us 145.39 us] change: [-82.570% -81.879% -80.820%] (p = 0.00 < 0.05) Performance has improved. Found 15 outliers among 100 measurements (15.00%) 4 (4.00%) high mild 11 (11.00%) high severe bench_portable_atomic_arch/u128_concurrent_store time: [146.03 us 147.67 us 149.98 us] change: [-90.486% -90.124% -89.483%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 1 (1.00%) high mild 8 (8.00%) high severe bench_portable_atomic_arch/u128_concurrent_store_swap time: [765.11 us 766.69 us 768.29 us] change: [-51.204% -50.967% -50.745%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 4 (4.00%) low mild 2 (2.00%) high mild 1 (1.00%) high severe ``` Closes #10 Co-authored-by: Taiki Endo <[email protected]>

bors · 2022-06-19T07:22:02Z

Build failed:

ci

taiki-e · 2022-06-19T07:25:45Z

bors r+

bors · 2022-06-19T08:18:05Z

Build succeeded:

ci

taiki-e · 2022-06-19T17:27:06Z

Published in 0.3.2

taiki-e · 2022-06-25T14:34:55Z

Hmm, due to odd rebase error (?) remove of mfence from non-SeqCst stores is not in main branch... https://github.com/taiki-e/portable-atomic/compare/48fbed348144c699819952286739cb1a28a196ef..078b4abd69676297b14f0426859fddc7f2fcacc2

taiki-e · 2022-06-25T15:17:04Z

Hmm, due to odd rebase error (?) remove of mfence from non-SeqCst stores is not in main branch...

Fixed in 0.3.4.

taiki-e added the O-x86 Target: x86/x64 processors label Jun 18, 2022

taiki-e force-pushed the vmovdqa branch 8 times, most recently from 1a16ce1 to d1320d1 Compare June 18, 2022 04:26

taiki-e commented Jun 18, 2022

View reviewed changes

taiki-e force-pushed the vmovdqa branch 2 times, most recently from 124bbbb to 94aa896 Compare June 18, 2022 05:03

taiki-e mentioned this pull request Jun 18, 2022

Optimize AtomicU128::{load, store} #10

Closed

taiki-e force-pushed the vmovdqa branch 2 times, most recently from de3b7d0 to 48fbed3 Compare June 19, 2022 06:33

Use SSE for 128-bit atomic load/store on Intel CPU with AVX

078b4ab

taiki-e force-pushed the vmovdqa branch from 48fbed3 to 078b4ab Compare June 19, 2022 07:25

bors bot merged commit 10b561a into main Jun 19, 2022

bors bot deleted the vmovdqa branch June 19, 2022 08:18

taiki-e mentioned this pull request Jul 27, 2022

x86_64: Detect Intel chips from target-cpu #21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use SSE for 128-bit atomic load/store on Intel CPU with AVX #16

Use SSE for 128-bit atomic load/store on Intel CPU with AVX #16

taiki-e commented Jun 18, 2022 •

edited

Loading

taiki-e Jun 18, 2022 •

edited

Loading

ibraheemdev Jun 18, 2022

taiki-e Jun 18, 2022

taiki-e Jun 25, 2022

taiki-e commented Jun 18, 2022

taiki-e commented Jun 19, 2022

bors bot commented Jun 19, 2022

taiki-e commented Jun 19, 2022

bors bot commented Jun 19, 2022

taiki-e commented Jun 19, 2022

taiki-e commented Jun 25, 2022

taiki-e commented Jun 25, 2022

Use SSE for 128-bit atomic load/store on Intel CPU with AVX #16

Use SSE for 128-bit atomic load/store on Intel CPU with AVX #16

Conversation

taiki-e commented Jun 18, 2022 • edited Loading

taiki-e Jun 18, 2022 • edited Loading

Choose a reason for hiding this comment

ibraheemdev Jun 18, 2022

Choose a reason for hiding this comment

taiki-e Jun 18, 2022

Choose a reason for hiding this comment

taiki-e Jun 25, 2022

Choose a reason for hiding this comment

taiki-e commented Jun 18, 2022

taiki-e commented Jun 19, 2022

bors bot commented Jun 19, 2022

taiki-e commented Jun 19, 2022

bors bot commented Jun 19, 2022

taiki-e commented Jun 19, 2022

taiki-e commented Jun 25, 2022

taiki-e commented Jun 25, 2022

taiki-e commented Jun 18, 2022 •

edited

Loading

taiki-e Jun 18, 2022 •

edited

Loading