Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SSE for 128-bit atomic load/store on Intel CPU with AVX #16

Merged
merged 1 commit into from
Jun 19, 2022
Merged

Conversation

taiki-e
Copy link
Owner

@taiki-e taiki-e commented Jun 18, 2022

x86_64 part of #10

The following are the results of a simple microbenchmark:

bench_portable_atomic_arch/u128_load
                        time:   [1.4598 ns 1.4671 ns 1.4753 ns]
                        change: [-81.510% -81.210% -80.950%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
bench_portable_atomic_arch/u128_store
                        time:   [1.3852 ns 1.3937 ns 1.4024 ns]
                        change: [-82.318% -81.989% -81.621%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
bench_portable_atomic_arch/u128_concurrent_load
                        time:   [56.422 us 56.767 us 57.204 us]
                        change: [-70.807% -70.143% -69.443%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
bench_portable_atomic_arch/u128_concurrent_load_store
                        time:   [136.53 us 139.96 us 145.39 us]
                        change: [-82.570% -81.879% -80.820%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) high mild
  11 (11.00%) high severe
bench_portable_atomic_arch/u128_concurrent_store
                        time:   [146.03 us 147.67 us 149.98 us]
                        change: [-90.486% -90.124% -89.483%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) high mild
  8 (8.00%) high severe
bench_portable_atomic_arch/u128_concurrent_store_swap
                        time:   [765.11 us 766.69 us 768.29 us]
                        change: [-51.204% -50.967% -50.745%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

Closes #10

@taiki-e taiki-e added the O-x86 Target: x86/x64 processors label Jun 18, 2022
@taiki-e taiki-e force-pushed the vmovdqa branch 8 times, most recently from 1a16ce1 to d1320d1 Compare June 18, 2022 04:26
Comment on lines +178 to +181
// TODO: GCC always uses mfence here, so the current implementation
// follows that behavior. However, probably only SeqCst store
// actually requires mfence.
"mfence",
Copy link
Owner Author

@taiki-e taiki-e Jun 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I'm not sure why GCC always uses mfence.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't think this needs an mfence unless it's SeqCst as with regular atomic stores. GCC might have overlooked it? Probably worth filing an issue there.

Comment on lines +219 to +224
// Check cmpxchg16b anyway to prevent mixing atomic and non-atomic access.
if detect::has_cmpxchg16b() && detect::is_intel_and_has_avx() {
_atomic_load_vmovdqa
} else {
_atomic_load_cmpxchg16b
})
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, we could support determining this at compile time by parsing the target-cpu in the build script, but there are many target-cpu's in x86_64...

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taiki-e taiki-e force-pushed the vmovdqa branch 2 times, most recently from 124bbbb to 94aa896 Compare June 18, 2022 05:03
@taiki-e
Copy link
Owner Author

taiki-e commented Jun 18, 2022

cc @ibraheemdev

@taiki-e taiki-e force-pushed the vmovdqa branch 2 times, most recently from de3b7d0 to 48fbed3 Compare June 19, 2022 06:33
@taiki-e
Copy link
Owner Author

taiki-e commented Jun 19, 2022

Removed mfence from non-SeqCst stores.

bors r+

bors bot added a commit that referenced this pull request Jun 19, 2022
16: Use SSE for 128-bit atomic load/store on Intel CPU with AVX r=taiki-e a=taiki-e

x86_64 part of #10

The following are the results of a simple microbenchmark:

```
bench_portable_atomic_arch/u128_load
                        time:   [1.4598 ns 1.4671 ns 1.4753 ns]
                        change: [-81.510% -81.210% -80.950%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
bench_portable_atomic_arch/u128_store
                        time:   [1.3852 ns 1.3937 ns 1.4024 ns]
                        change: [-82.318% -81.989% -81.621%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
bench_portable_atomic_arch/u128_concurrent_load
                        time:   [56.422 us 56.767 us 57.204 us]
                        change: [-70.807% -70.143% -69.443%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
bench_portable_atomic_arch/u128_concurrent_load_store
                        time:   [136.53 us 139.96 us 145.39 us]
                        change: [-82.570% -81.879% -80.820%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) high mild
  11 (11.00%) high severe
bench_portable_atomic_arch/u128_concurrent_store
                        time:   [146.03 us 147.67 us 149.98 us]
                        change: [-90.486% -90.124% -89.483%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) high mild
  8 (8.00%) high severe
bench_portable_atomic_arch/u128_concurrent_store_swap
                        time:   [765.11 us 766.69 us 768.29 us]
                        change: [-51.204% -50.967% -50.745%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
```

Closes #10

Co-authored-by: Taiki Endo <[email protected]>
@bors
Copy link
Contributor

bors bot commented Jun 19, 2022

Build failed:

@taiki-e
Copy link
Owner Author

taiki-e commented Jun 19, 2022

bors r+

@bors
Copy link
Contributor

bors bot commented Jun 19, 2022

Build succeeded:

@bors bors bot merged commit 10b561a into main Jun 19, 2022
@bors bors bot deleted the vmovdqa branch June 19, 2022 08:18
@taiki-e
Copy link
Owner Author

taiki-e commented Jun 19, 2022

Published in 0.3.2

@taiki-e
Copy link
Owner Author

taiki-e commented Jun 25, 2022

Hmm, due to odd rebase error (?) remove of mfence from non-SeqCst stores is not in main branch... https://github.com/taiki-e/portable-atomic/compare/48fbed348144c699819952286739cb1a28a196ef..078b4abd69676297b14f0426859fddc7f2fcacc2

@taiki-e
Copy link
Owner Author

taiki-e commented Jun 25, 2022

Hmm, due to odd rebase error (?) remove of mfence from non-SeqCst stores is not in main branch...

Fixed in 0.3.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-x86 Target: x86/x64 processors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize AtomicU128::{load, store}
2 participants