-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize shorter buffers for CRC-32 on Intel #86539
Conversation
Tagging subscribers to this area: @dotnet/area-system-io Issue DetailsThe current vectorized implementation for CRC-32 requires at least 64 bytes to function. However, it is possible to vectorize spans as short as 16 bytes for significant performance gains for spans from 16 to 63 bytes in length on Intel. Additionally, when processing on ARM it appears CRC-32 intrinsics are actually preferable for lengths up to approximately 128 bytes. This change therefore no longer vectorizes from 64 to 127 bytes on ARM if the CRC-32 intrinsics are available. Finally, reuse VectorHelper methods added with CRC-64 support for a cleaner implementation. This also appears to produce better JIT output, showing some minor performance improvements on vectorized processing. x64BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1776) PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
Arm64BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 22.04 PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
|
/cc @adamsitnik |
Nice! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for another very impressive contribution @brantburnett !
And thank you for providing both x64 and arm64 numbers!
&& source.Length >= Vector128<byte>.Count * 4; | ||
// Vectorization can process spans as short as a single vector (16 bytes), but if ARM intrinsics are supported they | ||
// seem to be more performant for spans less than 8 vectors (128 bytes). | ||
&& source.Length >= Vector128<byte>.Count * (System.Runtime.Intrinsics.Arm.Crc32.IsSupported ? 8 : 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is JIT smart enough to turn Vector128<byte>.Count * (System.Runtime.Intrinsics.Arm.Crc32.IsSupported ? 8 : 1)
into a constant?
cc @EgorBo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is JIT smart enough to turn
Vector128<byte>.Count * (System.Runtime.Intrinsics.Arm.Crc32.IsSupported ? 8 : 1)
into a constant?cc @EgorBo
Just checked: folded to a constant (on both x64 and arm64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EgorBo thank you very much!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was also one of my concerns. My testing in sharplab.io also shows it folding into a constant.
@brantburnett I just realized that we have no We are using these benchmarks to detect regressions (and improvements). |
Certainly, they're nothing fancy but that's where I was writing them locally in the first place. I was wondering about adding them permanently, but I wasn't sure what the threshold for warranting inclusion was. I'll take a look at it and put in a PR. The one odd spot is I had to do some trickery in the benchmark csproj since this library is distributed via NuGet, I'll have to figure out the "right" way to wire it up. I'll call out if I need help with that detail. |
I agree, this part of In theory following line is all we need: <PackageReference Include="System.IO.Hashing" Version="$(SystemVersion)" /> |
The failure is unrelated (#85145), merging. |
The current vectorized implementation for CRC-32 requires at least 64 bytes to function. However, it is possible to vectorize spans as short as 16 bytes for significant performance gains for spans from 16 to 63 bytes in length on Intel.
Additionally, when processing on ARM it appears CRC-32 intrinsics are actually preferable for lengths up to approximately 128 bytes. This change therefore no longer vectorizes from 64 to 127 bytes on ARM if the CRC-32 intrinsics are available.
Finally, reuse VectorHelper methods added with CRC-64 support for a cleaner implementation. This also appears to produce better JIT output, showing some minor performance improvements on vectorized processing.
x64
BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1776)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.4.23260.5
[Host] : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
Job-BLNPAG : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-XFRTDF : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
Arm64
BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 22.04
AWS m6g.xlarge Graviton2
[Host] : .NET 8.0.0 (8.0.23.25905), Arm64 RyuJIT AdvSIMD
Job-FJUDNU : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-JYHJTX : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1