Vectorize shorter buffers for CRC-32 on Intel #86539

brantburnett · 2023-05-20T15:09:39Z

The current vectorized implementation for CRC-32 requires at least 64 bytes to function. However, it is possible to vectorize spans as short as 16 bytes for significant performance gains for spans from 16 to 63 bytes in length on Intel.

Additionally, when processing on ARM it appears CRC-32 intrinsics are actually preferable for lengths up to approximately 128 bytes. This change therefore no longer vectorizes from 64 to 127 bytes on ARM if the CRC-32 intrinsics are available.

Finally, reuse VectorHelper methods added with CRC-64 support for a cleaner implementation. This also appears to produce better JIT output, showing some minor performance improvements on vectorized processing.

x64

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1776)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.4.23260.5
[Host] : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
Job-BLNPAG : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-XFRTDF : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method	Job	BufferSize	Mean	Error	StdDev	Median	Min	Max	Ratio
Append	Current	16	27.322 ns	0.4068 ns	0.3606 ns	27.334 ns	26.834 ns	28.051 ns	1.00
Append	New	16	8.069 ns	0.0169 ns	0.0141 ns	8.068 ns	8.045 ns	8.102 ns	0.29

Append	Current	32	54.541 ns	0.3201 ns	0.2994 ns	54.431 ns	54.078 ns	55.090 ns	1.00
Append	New	32	10.021 ns	0.0296 ns	0.0247 ns	10.009 ns	10.001 ns	10.087 ns	0.18

Append	Current	64	13.722 ns	0.0278 ns	0.0246 ns	13.726 ns	13.687 ns	13.773 ns	1.00
Append	New	64	13.811 ns	0.0454 ns	0.0402 ns	13.792 ns	13.777 ns	13.909 ns	1.01

Append	Current	1024	46.089 ns	0.6261 ns	0.5857 ns	45.768 ns	45.654 ns	47.135 ns	1.00
Append	New	1024	43.048 ns	0.0765 ns	0.0716 ns	43.030 ns	42.970 ns	43.187 ns	0.93

Arm64

BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 22.04
AWS m6g.xlarge Graviton2
[Host] : .NET 8.0.0 (8.0.23.25905), Arm64 RyuJIT AdvSIMD
Job-FJUDNU : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-JYHJTX : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method	Job	BufferSize	Mean	Error	StdDev	Median	Min	Max	Ratio
Append	Current	64	24.849 ns	0.0079 ns	0.0074 ns	24.847 ns	24.840 ns	24.861 ns	1.00
Append	New	64	10.546 ns	0.0635 ns	0.0594 ns	10.517 ns	10.484 ns	10.652 ns	0.42

Append	Current	128	28.898 ns	0.0318 ns	0.0297 ns	28.907 ns	28.843 ns	28.926 ns	1.00
Append	New	128	27.023 ns	0.0131 ns	0.0122 ns	27.019 ns	27.003 ns	27.041 ns	0.94

Append	Current	1024	90.481 ns	0.0451 ns	0.0400 ns	90.487 ns	90.404 ns	90.532 ns	1.00
Append	New	1024	72.729 ns	0.0619 ns	0.0549 ns	72.700 ns	72.684 ns	72.858 ns	0.80

ghost · 2023-05-20T15:09:52Z

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

The current vectorized implementation for CRC-32 requires at least 64 bytes to function. However, it is possible to vectorize spans as short as 16 bytes for significant performance gains for spans from 16 to 63 bytes in length on Intel.

Additionally, when processing on ARM it appears CRC-32 intrinsics are actually preferable for lengths up to approximately 128 bytes. This change therefore no longer vectorizes from 64 to 127 bytes on ARM if the CRC-32 intrinsics are available.

Finally, reuse VectorHelper methods added with CRC-64 support for a cleaner implementation. This also appears to produce better JIT output, showing some minor performance improvements on vectorized processing.

x64

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1776)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.4.23260.5
[Host] : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
Job-BLNPAG : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-XFRTDF : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method	Job	BufferSize	Mean	Error	StdDev	Median	Min	Max	Ratio
Append	Current	16	27.322 ns	0.4068 ns	0.3606 ns	27.334 ns	26.834 ns	28.051 ns	1.00
Append	New	16	8.069 ns	0.0169 ns	0.0141 ns	8.068 ns	8.045 ns	8.102 ns	0.29

Append	Current	32	54.541 ns	0.3201 ns	0.2994 ns	54.431 ns	54.078 ns	55.090 ns	1.00
Append	New	32	10.021 ns	0.0296 ns	0.0247 ns	10.009 ns	10.001 ns	10.087 ns	0.18

Append	Current	64	13.722 ns	0.0278 ns	0.0246 ns	13.726 ns	13.687 ns	13.773 ns	1.00
Append	New	64	13.811 ns	0.0454 ns	0.0402 ns	13.792 ns	13.777 ns	13.909 ns	1.01

Append	Current	1024	46.089 ns	0.6261 ns	0.5857 ns	45.768 ns	45.654 ns	47.135 ns	1.00
Append	New	1024	43.048 ns	0.0765 ns	0.0716 ns	43.030 ns	42.970 ns	43.187 ns	0.93

Arm64

BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 22.04
AWS m6g.xlarge Graviton2
[Host] : .NET 8.0.0 (8.0.23.25905), Arm64 RyuJIT AdvSIMD
Job-FJUDNU : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-JYHJTX : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method	Job	BufferSize	Mean	Error	StdDev	Median	Min	Max	Ratio
Append	Current	64	24.849 ns	0.0079 ns	0.0074 ns	24.847 ns	24.840 ns	24.861 ns	1.00
Append	New	64	10.546 ns	0.0635 ns	0.0594 ns	10.517 ns	10.484 ns	10.652 ns	0.42

Append	Current	128	28.898 ns	0.0318 ns	0.0297 ns	28.907 ns	28.843 ns	28.926 ns	1.00
Append	New	128	27.023 ns	0.0131 ns	0.0122 ns	27.019 ns	27.003 ns	27.041 ns	0.94

Append	Current	1024	90.481 ns	0.0451 ns	0.0400 ns	90.487 ns	90.404 ns	90.532 ns	1.00
Append	New	1024	72.729 ns	0.0619 ns	0.0549 ns	72.700 ns	72.684 ns	72.858 ns	0.80

Author:	brantburnett
Assignees:	-
Labels:	`area-System.IO`, `community-contribution`
Milestone:	-

brantburnett · 2023-05-20T16:42:11Z

/cc @adamsitnik

danmoseley · 2023-05-21T04:22:27Z

Nice!

adamsitnik

LGTM, thank you for another very impressive contribution @brantburnett !

And thank you for providing both x64 and arm64 numbers!

adamsitnik · 2023-05-22T12:18:51Z

src/libraries/System.IO.Hashing/src/System/IO/Hashing/Crc32.Vectorized.cs

-            && source.Length >= Vector128<byte>.Count * 4;
+            // Vectorization can process spans as short as a single vector (16 bytes), but if ARM intrinsics are supported they
+            // seem to be more performant for spans less than 8 vectors (128 bytes).
+            && source.Length >= Vector128<byte>.Count * (System.Runtime.Intrinsics.Arm.Crc32.IsSupported ? 8 : 1);


Is JIT smart enough to turn Vector128<byte>.Count * (System.Runtime.Intrinsics.Arm.Crc32.IsSupported ? 8 : 1) into a constant?

cc @EgorBo

Is JIT smart enough to turn Vector128<byte>.Count * (System.Runtime.Intrinsics.Arm.Crc32.IsSupported ? 8 : 1) into a constant?

cc @EgorBo

Just checked: folded to a constant (on both x64 and arm64)

@EgorBo thank you very much!

It was also one of my concerns. My testing in sharplab.io also shows it folding into a constant.

adamsitnik · 2023-05-22T12:57:30Z

@brantburnett I just realized that we have no System.IO.Hashing benchmarks in https://github.com/dotnet/performance/tree/main/src/benchmarks/micro/libraries. Would you be interested in contributing the ones you were using?

We are using these benchmarks to detect regressions (and improvements).

brantburnett · 2023-05-22T13:02:52Z

@brantburnett I just realized that we have no System.IO.Hashing benchmarks in https://github.com/dotnet/performance/tree/main/src/benchmarks/micro/libraries. Would you be interested in contributing the ones you were using?

We are using these benchmarks to detect regressions (and improvements).

Certainly, they're nothing fancy but that's where I was writing them locally in the first place. I was wondering about adding them permanently, but I wasn't sure what the threshold for warranting inclusion was.

I'll take a look at it and put in a PR. The one odd spot is I had to do some trickery in the benchmark csproj since this library is distributed via NuGet, I'll have to figure out the "right" way to wire it up. I'll call out if I need help with that detail.

adamsitnik · 2023-05-22T14:56:49Z

The one odd spot is I had to do some trickery in the benchmark csproj since this library is distributed via NuGet, I'll have to figure out the "right" way to wire it up.

I agree, this part of MicroBenchmarks.csproj is not intuitive or easy to use:

https://github.com/dotnet/performance/blob/c9f9cf9a31795e3a2f61a84801218d1e1af5019c/src/benchmarks/micro/MicroBenchmarks.csproj#L23-L46

In theory following line is all we need:

<PackageReference Include="System.IO.Hashing" Version="$(SystemVersion)" />

adamsitnik · 2023-05-22T14:58:50Z

The failure is unrelated (#85145), merging.

Vectorize shorter buffers for CRC-32 on Intel

b04adce

ghost added the community-contribution Indicates that the PR has been added by a community member label May 20, 2023

dotnet-issue-labeler bot added the area-System.IO label May 20, 2023

brantburnett marked this pull request as ready for review May 20, 2023 16:41

adamsitnik added the tenet-performance Performance related issue label May 22, 2023

adamsitnik approved these changes May 22, 2023

View reviewed changes

adamsitnik merged commit ed33e6c into dotnet:main May 22, 2023

adamsitnik added this to the 8.0.0 milestone May 22, 2023

adamsitnik added area-System.IO.Hashing and removed area-System.IO labels May 22, 2023

This was referenced May 22, 2023

Failed USB connection via port 54050, error 61, in tvOS arm64 Release AllSubsets_Mono #82637

Open

Build ios-arm64 Release AllSubsets_Mono failures dotnet/arcade#13625

Closed

brantburnett deleted the crc32-shorter-vectors branch May 22, 2023 15:46

brantburnett mentioned this pull request May 22, 2023

Add microbenchmarks for Crc32 and Crc64 algorithms dotnet/performance#3035

Merged

ghost locked as resolved and limited conversation to collaborators Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize shorter buffers for CRC-32 on Intel #86539

Vectorize shorter buffers for CRC-32 on Intel #86539

brantburnett commented May 20, 2023

ghost commented May 20, 2023

x64

Arm64

brantburnett commented May 20, 2023

danmoseley commented May 21, 2023

adamsitnik left a comment

adamsitnik May 22, 2023

EgorBo May 22, 2023 •

edited

Loading

adamsitnik May 22, 2023

brantburnett May 22, 2023

adamsitnik commented May 22, 2023

brantburnett commented May 22, 2023

adamsitnik commented May 22, 2023

adamsitnik commented May 22, 2023

Vectorize shorter buffers for CRC-32 on Intel #86539

Vectorize shorter buffers for CRC-32 on Intel #86539

Conversation

brantburnett commented May 20, 2023

x64

Arm64

ghost commented May 20, 2023

x64

Arm64

brantburnett commented May 20, 2023

danmoseley commented May 21, 2023

adamsitnik left a comment

Choose a reason for hiding this comment

adamsitnik May 22, 2023

Choose a reason for hiding this comment

EgorBo May 22, 2023 • edited Loading

Choose a reason for hiding this comment

adamsitnik May 22, 2023

Choose a reason for hiding this comment

brantburnett May 22, 2023

Choose a reason for hiding this comment

adamsitnik commented May 22, 2023

brantburnett commented May 22, 2023

adamsitnik commented May 22, 2023

adamsitnik commented May 22, 2023

EgorBo May 22, 2023 •

edited

Loading