Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize shorter buffers for CRC-32 on Intel #86539

Merged
merged 1 commit into from
May 22, 2023

Conversation

brantburnett
Copy link
Contributor

The current vectorized implementation for CRC-32 requires at least 64 bytes to function. However, it is possible to vectorize spans as short as 16 bytes for significant performance gains for spans from 16 to 63 bytes in length on Intel.

Additionally, when processing on ARM it appears CRC-32 intrinsics are actually preferable for lengths up to approximately 128 bytes. This change therefore no longer vectorizes from 64 to 127 bytes on ARM if the CRC-32 intrinsics are available.

Finally, reuse VectorHelper methods added with CRC-64 support for a cleaner implementation. This also appears to produce better JIT output, showing some minor performance improvements on vectorized processing.

x64

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1776)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.4.23260.5
[Host] : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
Job-BLNPAG : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-XFRTDF : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method Job BufferSize Mean Error StdDev Median Min Max Ratio
Append Current 16 27.322 ns 0.4068 ns 0.3606 ns 27.334 ns 26.834 ns 28.051 ns 1.00
Append New 16 8.069 ns 0.0169 ns 0.0141 ns 8.068 ns 8.045 ns 8.102 ns 0.29
Append Current 32 54.541 ns 0.3201 ns 0.2994 ns 54.431 ns 54.078 ns 55.090 ns 1.00
Append New 32 10.021 ns 0.0296 ns 0.0247 ns 10.009 ns 10.001 ns 10.087 ns 0.18
Append Current 64 13.722 ns 0.0278 ns 0.0246 ns 13.726 ns 13.687 ns 13.773 ns 1.00
Append New 64 13.811 ns 0.0454 ns 0.0402 ns 13.792 ns 13.777 ns 13.909 ns 1.01
Append Current 1024 46.089 ns 0.6261 ns 0.5857 ns 45.768 ns 45.654 ns 47.135 ns 1.00
Append New 1024 43.048 ns 0.0765 ns 0.0716 ns 43.030 ns 42.970 ns 43.187 ns 0.93

Arm64

BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 22.04
AWS m6g.xlarge Graviton2
[Host] : .NET 8.0.0 (8.0.23.25905), Arm64 RyuJIT AdvSIMD
Job-FJUDNU : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-JYHJTX : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method Job BufferSize Mean Error StdDev Median Min Max Ratio
Append Current 64 24.849 ns 0.0079 ns 0.0074 ns 24.847 ns 24.840 ns 24.861 ns 1.00
Append New 64 10.546 ns 0.0635 ns 0.0594 ns 10.517 ns 10.484 ns 10.652 ns 0.42
Append Current 128 28.898 ns 0.0318 ns 0.0297 ns 28.907 ns 28.843 ns 28.926 ns 1.00
Append New 128 27.023 ns 0.0131 ns 0.0122 ns 27.019 ns 27.003 ns 27.041 ns 0.94
Append Current 1024 90.481 ns 0.0451 ns 0.0400 ns 90.487 ns 90.404 ns 90.532 ns 1.00
Append New 1024 72.729 ns 0.0619 ns 0.0549 ns 72.700 ns 72.684 ns 72.858 ns 0.80

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label May 20, 2023
@ghost
Copy link

ghost commented May 20, 2023

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

The current vectorized implementation for CRC-32 requires at least 64 bytes to function. However, it is possible to vectorize spans as short as 16 bytes for significant performance gains for spans from 16 to 63 bytes in length on Intel.

Additionally, when processing on ARM it appears CRC-32 intrinsics are actually preferable for lengths up to approximately 128 bytes. This change therefore no longer vectorizes from 64 to 127 bytes on ARM if the CRC-32 intrinsics are available.

Finally, reuse VectorHelper methods added with CRC-64 support for a cleaner implementation. This also appears to produce better JIT output, showing some minor performance improvements on vectorized processing.

x64

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1776)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.4.23260.5
[Host] : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
Job-BLNPAG : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-XFRTDF : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method Job BufferSize Mean Error StdDev Median Min Max Ratio
Append Current 16 27.322 ns 0.4068 ns 0.3606 ns 27.334 ns 26.834 ns 28.051 ns 1.00
Append New 16 8.069 ns 0.0169 ns 0.0141 ns 8.068 ns 8.045 ns 8.102 ns 0.29
Append Current 32 54.541 ns 0.3201 ns 0.2994 ns 54.431 ns 54.078 ns 55.090 ns 1.00
Append New 32 10.021 ns 0.0296 ns 0.0247 ns 10.009 ns 10.001 ns 10.087 ns 0.18
Append Current 64 13.722 ns 0.0278 ns 0.0246 ns 13.726 ns 13.687 ns 13.773 ns 1.00
Append New 64 13.811 ns 0.0454 ns 0.0402 ns 13.792 ns 13.777 ns 13.909 ns 1.01
Append Current 1024 46.089 ns 0.6261 ns 0.5857 ns 45.768 ns 45.654 ns 47.135 ns 1.00
Append New 1024 43.048 ns 0.0765 ns 0.0716 ns 43.030 ns 42.970 ns 43.187 ns 0.93

Arm64

BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 22.04
AWS m6g.xlarge Graviton2
[Host] : .NET 8.0.0 (8.0.23.25905), Arm64 RyuJIT AdvSIMD
Job-FJUDNU : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-JYHJTX : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method Job BufferSize Mean Error StdDev Median Min Max Ratio
Append Current 64 24.849 ns 0.0079 ns 0.0074 ns 24.847 ns 24.840 ns 24.861 ns 1.00
Append New 64 10.546 ns 0.0635 ns 0.0594 ns 10.517 ns 10.484 ns 10.652 ns 0.42
Append Current 128 28.898 ns 0.0318 ns 0.0297 ns 28.907 ns 28.843 ns 28.926 ns 1.00
Append New 128 27.023 ns 0.0131 ns 0.0122 ns 27.019 ns 27.003 ns 27.041 ns 0.94
Append Current 1024 90.481 ns 0.0451 ns 0.0400 ns 90.487 ns 90.404 ns 90.532 ns 1.00
Append New 1024 72.729 ns 0.0619 ns 0.0549 ns 72.700 ns 72.684 ns 72.858 ns 0.80
Author: brantburnett
Assignees: -
Labels:

area-System.IO, community-contribution

Milestone: -

@brantburnett brantburnett marked this pull request as ready for review May 20, 2023 16:41
@brantburnett
Copy link
Contributor Author

/cc @adamsitnik

@danmoseley
Copy link
Member

Nice!

@adamsitnik adamsitnik added the tenet-performance Performance related issue label May 22, 2023
Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for another very impressive contribution @brantburnett !

And thank you for providing both x64 and arm64 numbers!

&& source.Length >= Vector128<byte>.Count * 4;
// Vectorization can process spans as short as a single vector (16 bytes), but if ARM intrinsics are supported they
// seem to be more performant for spans less than 8 vectors (128 bytes).
&& source.Length >= Vector128<byte>.Count * (System.Runtime.Intrinsics.Arm.Crc32.IsSupported ? 8 : 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is JIT smart enough to turn Vector128<byte>.Count * (System.Runtime.Intrinsics.Arm.Crc32.IsSupported ? 8 : 1) into a constant?

cc @EgorBo

Copy link
Member

@EgorBo EgorBo May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is JIT smart enough to turn Vector128<byte>.Count * (System.Runtime.Intrinsics.Arm.Crc32.IsSupported ? 8 : 1) into a constant?

cc @EgorBo

Just checked: folded to a constant (on both x64 and arm64)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EgorBo thank you very much!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was also one of my concerns. My testing in sharplab.io also shows it folding into a constant.

@adamsitnik
Copy link
Member

@brantburnett I just realized that we have no System.IO.Hashing benchmarks in https://github.com/dotnet/performance/tree/main/src/benchmarks/micro/libraries. Would you be interested in contributing the ones you were using?

We are using these benchmarks to detect regressions (and improvements).

@brantburnett
Copy link
Contributor Author

@brantburnett I just realized that we have no System.IO.Hashing benchmarks in https://github.com/dotnet/performance/tree/main/src/benchmarks/micro/libraries. Would you be interested in contributing the ones you were using?

We are using these benchmarks to detect regressions (and improvements).

Certainly, they're nothing fancy but that's where I was writing them locally in the first place. I was wondering about adding them permanently, but I wasn't sure what the threshold for warranting inclusion was.

I'll take a look at it and put in a PR. The one odd spot is I had to do some trickery in the benchmark csproj since this library is distributed via NuGet, I'll have to figure out the "right" way to wire it up. I'll call out if I need help with that detail.

@adamsitnik
Copy link
Member

The one odd spot is I had to do some trickery in the benchmark csproj since this library is distributed via NuGet, I'll have to figure out the "right" way to wire it up.

I agree, this part of MicroBenchmarks.csproj is not intuitive or easy to use:

https://github.com/dotnet/performance/blob/c9f9cf9a31795e3a2f61a84801218d1e1af5019c/src/benchmarks/micro/MicroBenchmarks.csproj#L23-L46

In theory following line is all we need:

<PackageReference Include="System.IO.Hashing" Version="$(SystemVersion)" />

@adamsitnik
Copy link
Member

The failure is unrelated (#85145), merging.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.IO.Hashing community-contribution Indicates that the PR has been added by a community member tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants