-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional cross-platform hardware intrinsic APIs for loading/storing, reordering, and extracting a per-element "mask" #63331
Comments
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics Issue DetailsSummaryWith #49397 we approved and exposed cross platform APIs on Vector64/128/256 to help developers more easily support multiple platforms. This was done by mirroring the surface area exposed by The APIs expose would include the following:
API Proposalnamespace System.Runtime.Intrinsics
{
public static partial class Vector64
{
public static uint ExtractMostSignificantBits<T>(Vector64<T> vector);
public static Vector64<T> Load<T>(T* address);
public static Vector64<T> LoadAligned<T>(T* address);
public static Vector64<T> LoadAlignedNonTemporal<T>(T* address);
public static Vector64<T> LoadUnsafe<T>(ref T address);
public static Vector64<T> LoadUnsafe<T>(ref T address, nuint index);
public static void Store<T>(T* address, Vector64<T> source);
public static void StoreAligned<T>(T* address, Vector64<T> source);
public static void StoreAlignedNonTemporal<T>(T* address, Vector64<T> source);
public static void StoreUnsafe<T>(ref T address, Vector64<T> source);
public static void StoreUnsafe<T>(ref T address, nuint index, Vector64<T> source);
}
public static partial class Vector128
{
public static uint ExtractMostSignificantBits<T>(Vector128<T> vector);
public static Vector128<T> Load<T>(T* address);
public static Vector128<T> LoadAligned<T>(T* address);
public static Vector128<T> LoadAlignedNonTemporal<T>(T* address);
public static Vector128<T> LoadUnsafe<T>(ref T address);
public static Vector128<T> LoadUnsafe<T>(ref T address, nuint index);
public static void Store<T>(T* address, Vector128<T> source);
public static void StoreAligned<T>(T* address, Vector128<T> source);
public static void StoreAlignedNonTemporal<T>(T* address, Vector128<T> source);
public static void StoreUnsafe<T>(ref T address, Vector128<T> source);
public static void StoreUnsafe<T>(ref T address, nuint index, Vector128<T> source);
}
public static partial class Vector256
{
public static uint ExtractMostSignificantBits<T>(Vector256<T> vector);
public static Vector256<T> Load<T>(T* address);
public static Vector256<T> LoadAligned<T>(T* address);
public static Vector256<T> LoadAlignedNonTemporal<T>(T* address);
public static Vector256<T> LoadUnsafe<T>(ref T address);
public static Vector256<T> LoadUnsafe<T>(ref T address, nuint index);
public static void Store<T>(T* address, Vector256<T> source);
public static void StoreAligned<T>(T* address, Vector256<T> source);
public static void StoreAlignedNonTemporal<T>(T* address, Vector256<T> source);
public static void StoreUnsafe<T>(ref T address, Vector256<T> source);
public static void StoreUnsafe<T>(ref T address, nuint index, Vector256<T> source);
}
} Additional NotesIdeally we would also expose "shuffle" APIs allowing the elements of a single or multiple vectors to be reordered:
Due to the complexities of same APIs, they can't trivially be exposed as a "single" generic API. Likewise, while the behavior for For the single-vector reordering, the APIs are "trivial": public static Vector128<byte> Shuffle(Vector128<byte> vector, Vector128<byte> indices)
public static Vector128<sbyte> Shuffle(Vector128<sbyte> vector, Vector128<sbyte> indices)
public static Vector128<short> Shuffle(Vector128<short> vector, Vector128<short> indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> vector, Vector128<ushort> indices)
public static Vector128<int> Shuffle(Vector128<int> vector, Vector128<int> indices)
public static Vector128<uint> Shuffle(Vector128<uint> vector, Vector128<uint> indices)
public static Vector128<float> Shuffle(Vector128<float> vector, Vector128<int> indices)
public static Vector128<long> Shuffle(Vector128<long> vector, Vector128<long> indices)
public static Vector128<ulong> Shuffle(Vector128<ulong> vector, Vector128<ulong> indices)
public static Vector128<double> Shuffle(Vector128<double> vector, Vector128<long> indices) For the two-vector reordering, the APIs are generally the same: public static Vector128<short> Shuffle(Vector128<short> lower, Vector128<short> upper, Vector128<sbyte> indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> lower, Vector128<ushort> upper, Vector128<byte> indices)
public static Vector128<int> Shuffle(Vector128<int> lower, Vector128<int> upper, Vector128<short> indices)
public static Vector128<uint> Shuffle(Vector128<uint> lower, Vector128<uint> upper, Vector128<ushort> indices)
public static Vector128<float> Shuffle(Vector128<float> lower, Vector128<float> upper, Vector128<short> indices)
public static Vector128<long> Shuffle(Vector128<long> lower, Vector128<long> upper, Vector128<int> indices)
public static Vector128<ulong> Shuffle(Vector128<ulong> lower, Vector128<ulong> upper, Vector128<uint> indices)
public static Vector128<double> Shuffle(Vector128<double> lower, Vector128<double> upper, Vector128<int> indices) The exception here is for public static Vector128<byte> Shuffle(Vector128<byte> lower, Vector128<byte> upper, Vector128<sbyte> lowerIndices, Vector128<sbyte> upperIndices)
public static Vector128<sbyte> Shuffle(Vector128<sbyte> lower, Vector128<sbyte> upper, Vector128<byte> lowerIndices, Vector128<byte> upperIndices) The names for the An upside of these APIs is that for common input scenarios involving constant indices, these can be massively simplified. This is ultimately no worse than a few other scenarios on each platform where one platform may have slightly better instruction generation due to the instructions it provides.
|
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics Issue DetailsSummaryWith #49397 we approved and exposed cross platform APIs on Vector64/128/256 to help developers more easily support multiple platforms. This was done by mirroring the surface area exposed by The APIs expose would include the following:
API Proposalnamespace System.Runtime.Intrinsics
{
public static partial class Vector64
{
public static uint ExtractMostSignificantBits<T>(Vector64<T> vector);
public static Vector64<T> Load<T>(T* address);
public static Vector64<T> LoadAligned<T>(T* address);
public static Vector64<T> LoadAlignedNonTemporal<T>(T* address);
public static Vector64<T> LoadUnsafe<T>(ref T address);
public static Vector64<T> LoadUnsafe<T>(ref T address, nuint index);
public static void Store<T>(T* address, Vector64<T> source);
public static void StoreAligned<T>(T* address, Vector64<T> source);
public static void StoreAlignedNonTemporal<T>(T* address, Vector64<T> source);
public static void StoreUnsafe<T>(ref T address, Vector64<T> source);
public static void StoreUnsafe<T>(ref T address, nuint index, Vector64<T> source);
}
public static partial class Vector128
{
public static uint ExtractMostSignificantBits<T>(Vector128<T> vector);
public static Vector128<T> Load<T>(T* address);
public static Vector128<T> LoadAligned<T>(T* address);
public static Vector128<T> LoadAlignedNonTemporal<T>(T* address);
public static Vector128<T> LoadUnsafe<T>(ref T address);
public static Vector128<T> LoadUnsafe<T>(ref T address, nuint index);
public static void Store<T>(T* address, Vector128<T> source);
public static void StoreAligned<T>(T* address, Vector128<T> source);
public static void StoreAlignedNonTemporal<T>(T* address, Vector128<T> source);
public static void StoreUnsafe<T>(ref T address, Vector128<T> source);
public static void StoreUnsafe<T>(ref T address, nuint index, Vector128<T> source);
}
public static partial class Vector256
{
public static uint ExtractMostSignificantBits<T>(Vector256<T> vector);
public static Vector256<T> Load<T>(T* address);
public static Vector256<T> LoadAligned<T>(T* address);
public static Vector256<T> LoadAlignedNonTemporal<T>(T* address);
public static Vector256<T> LoadUnsafe<T>(ref T address);
public static Vector256<T> LoadUnsafe<T>(ref T address, nuint index);
public static void Store<T>(T* address, Vector256<T> source);
public static void StoreAligned<T>(T* address, Vector256<T> source);
public static void StoreAlignedNonTemporal<T>(T* address, Vector256<T> source);
public static void StoreUnsafe<T>(ref T address, Vector256<T> source);
public static void StoreUnsafe<T>(ref T address, nuint index, Vector256<T> source);
}
} Additional NotesIdeally we would also expose "shuffle" APIs allowing the elements of a single or multiple vectors to be reordered:
Due to the complexities of same APIs, they can't trivially be exposed as a "single" generic API. Likewise, while the behavior for For the single-vector reordering, the APIs are "trivial": public static Vector128<byte> Shuffle(Vector128<byte> vector, Vector128<byte> indices)
public static Vector128<sbyte> Shuffle(Vector128<sbyte> vector, Vector128<sbyte> indices)
public static Vector128<short> Shuffle(Vector128<short> vector, Vector128<short> indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> vector, Vector128<ushort> indices)
public static Vector128<int> Shuffle(Vector128<int> vector, Vector128<int> indices)
public static Vector128<uint> Shuffle(Vector128<uint> vector, Vector128<uint> indices)
public static Vector128<float> Shuffle(Vector128<float> vector, Vector128<int> indices)
public static Vector128<long> Shuffle(Vector128<long> vector, Vector128<long> indices)
public static Vector128<ulong> Shuffle(Vector128<ulong> vector, Vector128<ulong> indices)
public static Vector128<double> Shuffle(Vector128<double> vector, Vector128<long> indices) For the two-vector reordering, the APIs are generally the same: public static Vector128<short> Shuffle(Vector128<short> lower, Vector128<short> upper, Vector128<sbyte> indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> lower, Vector128<ushort> upper, Vector128<byte> indices)
public static Vector128<int> Shuffle(Vector128<int> lower, Vector128<int> upper, Vector128<short> indices)
public static Vector128<uint> Shuffle(Vector128<uint> lower, Vector128<uint> upper, Vector128<ushort> indices)
public static Vector128<float> Shuffle(Vector128<float> lower, Vector128<float> upper, Vector128<short> indices)
public static Vector128<long> Shuffle(Vector128<long> lower, Vector128<long> upper, Vector128<int> indices)
public static Vector128<ulong> Shuffle(Vector128<ulong> lower, Vector128<ulong> upper, Vector128<uint> indices)
public static Vector128<double> Shuffle(Vector128<double> lower, Vector128<double> upper, Vector128<int> indices) The exception here is for public static Vector128<byte> Shuffle(Vector128<byte> lower, Vector128<byte> upper, Vector128<sbyte> lowerIndices, Vector128<sbyte> upperIndices)
public static Vector128<sbyte> Shuffle(Vector128<sbyte> lower, Vector128<sbyte> upper, Vector128<byte> lowerIndices, Vector128<byte> upperIndices) The names for the An upside of these APIs is that for common input scenarios involving constant indices, these can be massively simplified. This is ultimately no worse than a few other scenarios on each platform where one platform may have slightly better instruction generation due to the instructions it provides.
|
namespace System.Runtime.Intrinsics
{
public static partial class Vector64
{
public static uint ExtractMostSignificantBits<T>(this Vector64<T> vector);
public static Vector64<T> Load<T>(T* source);
public static Vector64<T> LoadAligned<T>(T* source);
public static Vector64<T> LoadAlignedNonTemporal<T>(T* source);
public static Vector64<T> LoadUnsafe<T>(ref T source);
public static Vector64<T> LoadUnsafe<T>(ref T source, nuint index);
public static void Store<T>(this Vector64<T> source, T* destination);
public static void StoreAligned<T>(this Vector64<T> source, T* destination);
public static void StoreAlignedNonTemporal<T>(this Vector64<T> source, T* destination);
public static void StoreUnsafe<T>(this Vector64<T> source, ref T destination);
public static void StoreUnsafe<T>(this Vector64<T> source, ref T destination, nuint index);
public static Vector64<byte> Shuffle(Vector64<byte> vector, Vector64<byte> indices);
public static Vector64<sbyte> Shuffle(Vector64<sbyte> vector, Vector64<sbyte> indices);
public static Vector64<short> Shuffle(Vector64<short> vector, Vector64<short> indices);
public static Vector64<ushort> Shuffle(Vector64<ushort> vector, Vector64<ushort> indices);
public static Vector64<int> Shuffle(Vector64<int> vector, Vector64<int> indices);
public static Vector64<uint> Shuffle(Vector64<uint> vector, Vector64<uint> indices);
public static Vector64<float> Shuffle(Vector64<float> vector, Vector64<int> indices);
public static Vector64<byte> Shuffle(Vector64<byte> lower, Vector64<byte> upper, Vector64<byte> indices);
public static Vector64<sbyte> Shuffle(Vector64<sbyte> lower, Vector64<sbyte> upper, Vector64<sbyte> indices);
public static Vector64<short> Shuffle(Vector64<short> lower, Vector64<short> upper, Vector64<short> indices);
public static Vector64<ushort> Shuffle(Vector64<ushort> lower, Vector64<ushort> upper, Vector64<ushort> indices);
public static Vector64<int> Shuffle(Vector64<int> lower, Vector64<int> upper, Vector64<int> indices);
public static Vector64<uint> Shuffle(Vector64<uint> lower, Vector64<uint> upper, Vector64<uint> indices);
public static Vector64<float> Shuffle(Vector64<float> lower, Vector64<float> upper, Vector64<int> indices);
}
public static partial class Vector128
{
public static uint ExtractMostSignificantBits<T>(this Vector128<T> vector);
public static Vector128<T> Load<T>(T* source);
public static Vector128<T> LoadAligned<T>(T* source);
public static Vector128<T> LoadAlignedNonTemporal<T>(T* source);
public static Vector128<T> LoadUnsafe<T>(ref T source);
public static Vector128<T> LoadUnsafe<T>(ref T source, nuint index);
public static void Store<T>(this Vector128<T> source, T* destination);
public static void StoreAligned<T>(this Vector128<T> source, T* destination);
public static void StoreAlignedNonTemporal<T>(this Vector128<T> source, T* destination);
public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination);
public static void StoreUnsafe<T>(this Vector128<T> source, ref T destination, nuint index);
public static Vector128<byte> Shuffle(Vector128<byte> vector, Vector128<byte> indices);
public static Vector128<sbyte> Shuffle(Vector128<sbyte> vector, Vector128<sbyte> indices);
public static Vector128<short> Shuffle(Vector128<short> vector, Vector128<short> indices);
public static Vector128<ushort> Shuffle(Vector128<ushort> vector, Vector128<ushort> indices);
public static Vector128<int> Shuffle(Vector128<int> vector, Vector128<int> indices);
public static Vector128<uint> Shuffle(Vector128<uint> vector, Vector128<uint> indices);
public static Vector128<float> Shuffle(Vector128<float> vector, Vector128<int> indices);
public static Vector128<long> Shuffle(Vector128<long> vector, Vector128<long> indices);
public static Vector128<ulong> Shuffle(Vector128<ulong> vector, Vector128<ulong> indices);
public static Vector128<double> Shuffle(Vector128<double> vector, Vector128<long> indices);
public static Vector128<byte> Shuffle(Vector128<byte> lower, Vector128<byte> upper, Vector128<byte> indices);
public static Vector128<sbyte> Shuffle(Vector128<sbyte> lower, Vector128<sbyte> upper, Vector128<sbyte> indices);
public static Vector128<short> Shuffle(Vector128<short> lower, Vector128<short> upper, Vector128<short> indices);
public static Vector128<ushort> Shuffle(Vector128<ushort> lower, Vector128<ushort> upper, Vector128<ushort> indices);
public static Vector128<int> Shuffle(Vector128<int> lower, Vector128<int> upper, Vector128<int> indices);
public static Vector128<uint> Shuffle(Vector128<uint> lower, Vector128<uint> upper, Vector128<uint> indices);
public static Vector128<float> Shuffle(Vector128<float> lower, Vector128<float> upper, Vector128<int> indices);
public static Vector128<long> Shuffle(Vector128<long> lower, Vector128<long> upper, Vector128<long> indices);
public static Vector128<ulong> Shuffle(Vector128<ulong> lower, Vector128<ulong> upper, Vector128<ulong> indices);
public static Vector128<double> Shuffle(Vector128<double> lower, Vector128<double> upper, Vector128<long> indices);
}
public static partial class Vector256
{
public static uint ExtractMostSignificantBits<T>(this Vector256<T> vector);
public static Vector256<T> Load<T>(T* source);
public static Vector256<T> LoadAligned<T>(T* source);
public static Vector256<T> LoadAlignedNonTemporal<T>(T* source);
public static Vector256<T> LoadUnsafe<T>(ref T source);
public static Vector256<T> LoadUnsafe<T>(ref T source, nuint index);
public static void Store<T>(this Vector256<T> source, T* destination);
public static void StoreAligned<T>(this Vector256<T> source, T* destination);
public static void StoreAlignedNonTemporal<T>(this Vector256<T> source, T* destination);
public static void StoreUnsafe<T>(this Vector256<T> source, ref T destination);
public static void StoreUnsafe<T>(this Vector256<T> source, ref T destination, nuint index);
public static Vector256<byte> Shuffle(Vector256<byte> vector, Vector256<byte> indices);
public static Vector256<sbyte> Shuffle(Vector256<sbyte> vector, Vector256<sbyte> indices);
public static Vector256<short> Shuffle(Vector256<short> vector, Vector256<short> indices);
public static Vector256<ushort> Shuffle(Vector256<ushort> vector, Vector256<ushort> indices);
public static Vector256<int> Shuffle(Vector256<int> vector, Vector256<int> indices);
public static Vector256<uint> Shuffle(Vector256<uint> vector, Vector256<uint> indices);
public static Vector256<float> Shuffle(Vector256<float> vector, Vector256<int> indices);
public static Vector256<long> Shuffle(Vector256<long> vector, Vector256<long> indices);
public static Vector256<ulong> Shuffle(Vector256<ulong> vector, Vector256<ulong> indices);
public static Vector256<double> Shuffle(Vector256<double> vector, Vector256<long> indices);
public static Vector256<byte> Shuffle(Vector256<byte> lower, Vector256<byte> upper, Vector256<byte> indices);
public static Vector256<sbyte> Shuffle(Vector256<sbyte> lower, Vector256<sbyte> upper, Vector256<sbyte> indices);
public static Vector256<short> Shuffle(Vector256<short> lower, Vector256<short> upper, Vector256<short> indices);
public static Vector256<ushort> Shuffle(Vector256<ushort> lower, Vector256<ushort> upper, Vector256<ushort> indices);
public static Vector256<int> Shuffle(Vector256<int> lower, Vector256<int> upper, Vector256<int> indices);
public static Vector256<uint> Shuffle(Vector256<uint> lower, Vector256<uint> upper, Vector256<uint> indices);
public static Vector256<float> Shuffle(Vector256<float> lower, Vector256<float> upper, Vector256<int> indices);
public static Vector256<long> Shuffle(Vector256<long> lower, Vector256<long> upper, Vector256<long> indices);
public static Vector256<ulong> Shuffle(Vector256<ulong> lower, Vector256<ulong> upper, Vector256<ulong> indices);
public static Vector256<double> Shuffle(Vector256<double> lower, Vector256<double> upper, Vector256<long> indices);
}
} |
@tannergooding is there any more APIs left to implement from this issue? |
The three operand shuffle APIs: public static Vector256<byte> Shuffle(Vector256<byte> lower, Vector256<byte> upper, Vector256<byte> indices);
public static Vector256<sbyte> Shuffle(Vector256<sbyte> lower, Vector256<sbyte> upper, Vector256<sbyte> indices);
public static Vector256<short> Shuffle(Vector256<short> lower, Vector256<short> upper, Vector256<short> indices);
public static Vector256<ushort> Shuffle(Vector256<ushort> lower, Vector256<ushort> upper, Vector256<ushort> indices);
public static Vector256<int> Shuffle(Vector256<int> lower, Vector256<int> upper, Vector256<int> indices);
public static Vector256<uint> Shuffle(Vector256<uint> lower, Vector256<uint> upper, Vector256<uint> indices);
public static Vector256<float> Shuffle(Vector256<float> lower, Vector256<float> upper, Vector256<int> indices);
public static Vector256<long> Shuffle(Vector256<long> lower, Vector256<long> upper, Vector256<long> indices);
public static Vector256<ulong> Shuffle(Vector256<ulong> lower, Vector256<ulong> upper, Vector256<ulong> indices);
public static Vector256<double> Shuffle(Vector256<double> lower, Vector256<double> upper, Vector256<long> indices); I'm working on them still and expect them to be in by code complete. That being said in the worst case these ones won't be available in .NET 7 due to time constraints and other work also on my plate taking precedence. The two operand shuffle APIs are in already and cover a large number of the scenarios so this scenario being missing won't significantly hurt the feature. |
cc @fanyang-mono for Mono implementations |
Everything is done here except for the shuffle APIs that take two inputs and the index mask. Perf improvements are still needed for the other shuffle APIs. |
The three operand APIs missed .NET 8 as well. We did manage to land the AVX-512 full permute instructions and the AdvSimd multi-input table lookup instructions, however. So we should be able to more easily land this support in the future. |
Summary
With #49397 we approved and exposed cross platform APIs on Vector64/128/256 to help developers more easily support multiple platforms.
This was done by mirroring the surface area exposed by
Vector<T>
. However, due to their fixed sized there are some additional APIs that would be beneficial to expose. Likewise, there are a few APIs for loading/storing vectors that are commonly used for hardware intrinsics that would be beneficial to have cross platform helpers for.The APIs expose would include the following:
ExtractMostSignificantBits
MoveMask
and performs exactly as expectedand, element-wise shift-right, 64-bit pairwise add, extract
. The JIT could optionally detect if theinput
is the result of aCompare
instruction and elide theshift-right
.bitmask
and works identically toMoveMask
Load/Store
LoadAligned/StoreAligned
LoadAlignedNonTemporal/StoreAlignedNonTemporal
LoadAligned/StoreAligned
but may optionally treat the memory access asnon-temporal
and avoid polluting the cacheLoadUnsafe/StoreUnsafe
ref T
behaves exactly like the version that takes apointer
, just without requiring pinningnuint index
behaves likeref Unsafe.Add(ref value, index)
and avoids needing to further bloat IL and hinder readabilityAPI Proposal
Additional Notes
Ideally we would also expose "shuffle" APIs allowing the elements of a single or multiple vectors to be reordered:
Shuffle
orPermute
(generally takes two elements and one element, respectively; but that isn't always the case)VectorTableLookup
(only takes two elements)Shuffle
(takes two elements) andSwizzle
(takes one element).VectorShuffle
and only take two elementsDue to the complexities of same APIs, they can't trivially be exposed as a "single" generic API. Likewise, while the behavior for
Vector128<T>
is consistent on all platforms.Vector64<T>
is ARM64 specific andVector256<T>
is x86/x64 specific. The former behaves likeVector128<T>
while the latter generally behaves like2x Vector128<T>
(outside a few APIs calledPermute#x#
). For consistency, theVector256<T>
APIs exposed here would behave identically toVector128<T>
and allow "cross lane permutation".For the single-vector reordering, the APIs are "trivial":
For the two-vector reordering, the APIs are generally the same:
An upside of these APIs is that for common input scenarios involving constant indices, these can be massively simplified.
A downside for these APIs is that non-constant inputs on older hardware or certain
Vector256<T>
shuffles involvingbyte
,sbyte
,short
, orushort
that cross the 128-bit lane boundary can take a couple instructions rather than being a single instruction.This is ultimately no worse than a few other scenarios on each platform where one platform may have slightly better instruction generation due to the instructions it provides.
The text was updated successfully, but these errors were encountered: