-
-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use SSE for 128-bit atomic load/store on Intel CPU with AVX #16
Conversation
1a16ce1
to
d1320d1
Compare
// TODO: GCC always uses mfence here, so the current implementation | ||
// follows that behavior. However, probably only SeqCst store | ||
// actually requires mfence. | ||
"mfence", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, I'm not sure why GCC always uses mfence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't think this needs an mfence
unless it's SeqCst
as with regular atomic stores. GCC might have overlooked it? Probably worth filing an issue there.
// Check cmpxchg16b anyway to prevent mixing atomic and non-atomic access. | ||
if detect::has_cmpxchg16b() && detect::is_intel_and_has_avx() { | ||
_atomic_load_vmovdqa | ||
} else { | ||
_atomic_load_cmpxchg16b | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, we could support determining this at compile time by parsing the target-cpu in the build script, but there are many target-cpu's in x86_64...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
124bbbb
to
94aa896
Compare
cc @ibraheemdev |
de3b7d0
to
48fbed3
Compare
Removed mfence from non-SeqCst stores. bors r+ |
16: Use SSE for 128-bit atomic load/store on Intel CPU with AVX r=taiki-e a=taiki-e x86_64 part of #10 The following are the results of a simple microbenchmark: ``` bench_portable_atomic_arch/u128_load time: [1.4598 ns 1.4671 ns 1.4753 ns] change: [-81.510% -81.210% -80.950%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 4 (4.00%) high mild 2 (2.00%) high severe bench_portable_atomic_arch/u128_store time: [1.3852 ns 1.3937 ns 1.4024 ns] change: [-82.318% -81.989% -81.621%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) low mild 5 (5.00%) high mild 2 (2.00%) high severe bench_portable_atomic_arch/u128_concurrent_load time: [56.422 us 56.767 us 57.204 us] change: [-70.807% -70.143% -69.443%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 3 (3.00%) high mild 6 (6.00%) high severe bench_portable_atomic_arch/u128_concurrent_load_store time: [136.53 us 139.96 us 145.39 us] change: [-82.570% -81.879% -80.820%] (p = 0.00 < 0.05) Performance has improved. Found 15 outliers among 100 measurements (15.00%) 4 (4.00%) high mild 11 (11.00%) high severe bench_portable_atomic_arch/u128_concurrent_store time: [146.03 us 147.67 us 149.98 us] change: [-90.486% -90.124% -89.483%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 1 (1.00%) high mild 8 (8.00%) high severe bench_portable_atomic_arch/u128_concurrent_store_swap time: [765.11 us 766.69 us 768.29 us] change: [-51.204% -50.967% -50.745%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 4 (4.00%) low mild 2 (2.00%) high mild 1 (1.00%) high severe ``` Closes #10 Co-authored-by: Taiki Endo <[email protected]>
Build failed: |
bors r+ |
Build succeeded: |
Published in 0.3.2 |
Hmm, due to odd rebase error (?) remove of mfence from non-SeqCst stores is not in main branch... https://github.com/taiki-e/portable-atomic/compare/48fbed348144c699819952286739cb1a28a196ef..078b4abd69676297b14f0426859fddc7f2fcacc2 |
Fixed in 0.3.4. |
x86_64 part of #10
The following are the results of a simple microbenchmark:
Closes #10