Essay on latency with regards to SIMD #87

thecppzoo · 2024-05-27T19:36:28Z

No description provided.

jamierpond · 2024-05-29T06:45:10Z

essay/swar/latency.md

+    return uint32_t(by10001base2to32.value() >> 32);
+}
+```
+resulting in identical code generation even with minimal optimization, and we will keep iterating so that the code reflects this is just one example of what we call "associative iteration" --[Glossary](../../glossary.md#associative-iteration)--.


A minor suggestion:

This results in identical code generation even with minimal optimization. Furthermore, we are going to iterate on this code so we can express it as a specialisation of an abstraction we call "Associative Iteration".

I think it will add nice context to link to the code, for example. It makes it feel concrete.

@thecppzoo I think this expresses my idea more clearly. You'll agree I'm sure that the point I was making was minor.

But I am not writing specialiSation.

jamierpond

I enjoyed reading it! I had some small comments to help with readability.

jamierpond · 2024-05-29T07:15:43Z

essay/swar/latency.md

+
+I think our colleagues in GLIBC that are so fond of assembler (we have proven this since our equivalents are more than 90% just C++ with very topical use of intrinsics, but theirs are through and through assembler code) would have identified the architectural primitive that would have spared them to have to use general purpose computation to identify the index into the block where the first null occurs.  Since they need to create a "mask" of booleans, it seems that those horizontal primitives are missing from AVX2 and Arm Neon.
+
+In our Robin Hood implementation there are many examples of our generation of a SIMD of boolean results and the immediate use of those booleans as SIMD values for further computation. I believe this helps our SWAR, ["software SIMD"](https://github.com/thecppzoo/zoo/blob/em/essay-swar-latency.md/glossary.md#software-simd) implementations to be competitive with the software that uses "hardware SIMD".


looks like this link to the glossary is absolute and the other is relative, does this matter?

I think absolute/relative links are a "pick your poison" situation, it is not clear to me which is better. The absolute link is easier because I copy and paste and it's done; but I make some links relative by taking the trouble to craft the link.

jamierpond · 2024-05-29T07:16:38Z

essay/swar/latency.md

+
+This also shows a notorious deficiency in the efforts to standardize in C++ the interfaces for hardware SIMD: The results of lane-wise processing (vertical processing) that generate booleans would generate "mask" values of types that are bitfields of the results of comparisons, not SIMD type values.  So, if you were to consume these results in further SIMD processing, you'd need to convert and adjust the bitfields in general purpose registers into SIMD registers, this "back and forth SIMD-scalar" conversions would add further performance penalties that compiler optimizers might not be able to eliminate, for a variety of reasons that are out of the scope of this document, wait for us to write an essay that contains the phrase "cuckoo-bananas".
+
+Our SWAR library, in this regard, is "future proof": boolean results are just an specific form of SIMD values.  We use this already for SIMD operations such as saturated addition: unsigned overflow (carry) is detected lane wise and lane-wise propagated to the whole lane, without conversion to scalar.  Saturation is implemented seamlessly as after-processing of computation, without having to convert to "scalar" (the result mask) and back.


Saturation is implemented seamlessly as after-processing of computation, without having to convert to "scalar" (the result mask) and back.

This feels like a productive point to mention.

Do you mean simply that I covered an important point, or are you implying this needs to be featured more prominently?

I think he's just pointing out it is a critical thing to mention, and you've mentioned it.

jamierpond · 2024-05-29T07:19:49Z

essay/swar/latency.md

+
+The essence of parallel algorithms is the "mapping" of an operation that is associative to multiple data elements simultaneously or in parallel, and the collection (reduction, gathering, combination) of the partial results into the final result.
+
+The combination of the partial results induce a tree structure, whose height is in the order of the logarithm of the number of elements N.


maybe outside the scope of what you want to do, but possibly a diagram of the "tree structure" could be illuminating?

jamierpond · 2024-05-29T07:23:00Z

essay/swar/latency.md

+
+A comparison with a fundamentally different technique is to revert to the linear, serial mechanism of "multiplying by 10 the previous result and adding the current digit".  That obviously leads to a latency of 7 partial results, multiplication is present in the stages of both schemes, and multiplication is the dominant latency[^1], 7 is worse than 3, so, Lemire's mechanism has, relatively, much lower, not higher, latency.
+
+Then Lemire is comparing his mechanism to some unknown mechanism that might have smaller latency that can not be the normal mathematics of multiplication by a base.  The mathematical need is to convert numbers in base 10 into base 2.  There may be exotic mathematics for base conversions that use other methods, perhaps something based on fourier transform as the best multiplication algorithms, but for sure would be practical only for immense numbers of digits.


It seems like this paragraph starts by referencing a quote from the blog.

Then Lemire is comparing his mechanism to some unknown mechanism...

Could it be useful to mention what this is, or maybe have it as an inline link?

jamierpond · 2024-05-29T07:30:07Z

essay/swar/latency.md

+
+Lemire's point of view that his mechanism has high latency seems to be just wrong.
+
+This example goes to show why SWAR mechanisms also have good latencies:  If the SWAR operation is the parallel mapping of computation and the parallel combination of partial results, they very well be of optimal latency.


I don't understand this final clause here.

If the SWAR operation is the parallel mapping of computation and the parallel combination of partial results, they very well be of optimal latency.

Either I'm missing context or the phrasing might be unclear.

s/very well/will, maybe?

jamierpond · 2024-05-29T07:32:50Z

essay/swar/latency.md

+
+In our Robin Hood implementation there are many examples of our generation of a SIMD of boolean results and the immediate use of those booleans as SIMD values for further computation. I believe this helps our SWAR, ["software SIMD"](https://github.com/thecppzoo/zoo/blob/em/essay-swar-latency.md/glossary.md#software-simd) implementations to be competitive with the software that uses "hardware SIMD".
+
+This also shows a notorious deficiency in the efforts to standardize in C++ the interfaces for hardware SIMD: The results of lane-wise processing (vertical processing) that generate booleans would generate "mask" values of types that are bitfields of the results of comparisons, not SIMD type values.  So, if you were to consume these results in further SIMD processing, you'd need to convert and adjust the bitfields in general purpose registers into SIMD registers, this "back and forth SIMD-scalar" conversions would add further performance penalties that compiler optimizers might not be able to eliminate, for a variety of reasons that are out of the scope of this document, wait for us to write an essay that contains the phrase "cuckoo-bananas".


this "back and forth SIMD-scalar" conversions would add further performance penalties that compiler optimizers might not be able to eliminate

I think this is something that people overlook as an overhead to their hardware SIMD optimisations (it's not something I thought about when I was first writing and speaking about SIMD!), and is possibly worth highlighting. But as you say, maybe people need to wait for the next instalment!

jamierpond · 2024-05-29T07:34:39Z

glossary.md

+### Associative Iteration:
+Defined in code, [associative_iteration.h](https://github.com/thecppzoo/zoo/blob/8f2e29d48194fb17bbf79688106bd28f44f7e11c/inc/zoo/swar/associative_iteration.h#L366-L387).
+Will be upgraded to a recursive template that changes function arguments and return types in a way in which the new return type is the same as the next recursive invocation argument type.
+
+### Hardware SIMD
+Refers to SIMD implemented in actual, real, SIMD architectures such as AVX or ARM Neon.  Implies the direct application of architectural intrinsics, or compiler intrinsics (`__builtin_popcount`) that translate almost directly to assembler instructions.  Note: there are compiler intrinsics such as GCC/Clang's `__builtin_ctz` that would be implemented, in ARM litte endian implementations, as bit reversal followed with counting *leading* zeros: [compiler explorer](https://godbolt.org/z/xsKerbKzK)
+
+### Software SIMD
+Refers to the emulation of SIMD using software.  The most prominent example is our SWAR library.


This is an absolutely excellent idea.

Explains the relationship between our code and Lemire's. Includes appendix to multiplications

Scottbruceheart · 2024-06-13T22:50:56Z

essay/swar/latency.md

-I strongly suspect not, since the cheapest instructions such as bitwise operations and additions have latencies in the order of 1 cycle.  This method requires 1 multiplication, 1 mask, and 1 shift per stage, not more than 6 cycles, let's say that all of the register renaming, constant loading take one cycle.  That's 7 cycles per stage, and the number of stages is minimal.  See that multiplying by 10 as an addition chain requires 5 steps, so, that's a dead end.  Even the trick of using "Load Effective Address" in x86-64 using its very complex addressing modes ends quickly.  We can see this in the [compiler explorer](https://godbolt.org/z/PjMoGzbPa), multiplying by 10 might be implemented with just 2 instructions of minimal latency (adding and `LEA`), but the compiler optimizer would not multiply by 100 without using the actual multiplication instruction.
+I strongly suspect not, since the cheapest instructions such as bitwise operations and additions have latencies in the order of 1 cycle.  This method requires 1 multiplication, 1 mask, and 1 shift per stage, not more than 6 cycles, let's say that all of the register renaming, constant loading take one cycle.  That's 7 cycles per stage, and the number of stages is minimal.  See that multiplying by 10 as an addition chain requires 5 steps, so, that's a dead end.  Even the trick of using "Load Effective Address" in x86-64 using its very complex addressing modes could calculate the necessary things for base 10 conversion, but with instruction counts that won't outperform the 64 bit multiplication in each Lemire's stage.  We can see this in the [compiler explorer](https://godbolt.org/z/PjMoGzbPa), multiplying by 10 might be implemented with just 2 instructions of minimal latency (adding and `LEA`), but the compiler optimizer would not multiply by 100 without using the actual multiplication instruction.
+
+See the appendix where we list the generated code in x86-64 for multiplications of factors up to 100, you'll see that the optimizer seems to "give up" on addition chains using the `LEA` instruction and the indexed addressing modes at a rate of four instructions of addition chains for one multiplication.  Since other architectures can at most give the complexity that x86-64 `LEA` and addressing modes, we can be **certain that there isn't a "budget" of non-multiplication operations that would outperform the base conversion that relies on multiplcation.


Very nice explanatory sentence here.

Scottbruceheart · 2024-06-13T23:14:26Z

essay/swar/latency.md

@@ -93,13 +98,398 @@ This also shows a notorious deficiency in the efforts to standardize in C++ the

 Our SWAR library, in this regard, is "future proof": boolean results are just an specific form of SIMD values.  We use this already for SIMD operations such as saturated addition: unsigned overflow (carry) is detected lane wise and lane-wise propagated to the whole lane, without conversion to scalar.  Saturation is implemented seamlessly as after-processing of computation, without having to convert to "scalar" (the result mask) and back.

-In the early planning stages for support of "hardware" SIMD (architectures that do SIMD), I've thought that perhaps we will have to keep our emulated SIMD implementations on "hardware SIMD" as a flavor for users to choose from, since they may be the only way to avoid performance penalties that are higher than the cost of emulation, to prevent the error in the standard library.
+In the early planning stages for support of "hardware" SIMD (architectures that do SIMD), I've thought that perhaps we will have to keep our emulated SIMD implementations on "hardware SIMD" as a flavor for users to choose from, since they may be the only way to avoid performance penalties that are higher than the cost of emulation, to prevent the error in the architecture, an error that no standard library could avoid except synthetizing the operations as we do.


Run on could use a break apart. I think it's saying: Early, we planned hardware vs software SIMD as flavors for users to choose from. Using a flavor seems like the only way to avoid <some performance penalty and solution I can't quite understand from the sentence>

Scottbruceheart · 2024-06-13T23:15:19Z

essay/swar/latency.md

+
+## Appendix, multiplication codegen for x86-64
+
+For this code, multiplying by a compile-time constant, we get this generated code [nice Compiler Explorer link](https://godbolt.org/z/z3b7e1Tnx):


The example assembly is wildly long: I suspect it would be more illustrative to break out a few critical examples and rely on the godbolt link for those that wish to see them all?

Scottbruceheart · 2024-06-13T23:16:33Z

I still see a double 'for example' in a single sentence, probably notes from email not yet addressed. Thanks for the iteration.

Important footnotes (treatises)

Scottbruceheart · 2024-06-14T03:01:13Z

essay/swar/latency.md


 Curiously enough, our SWAR library gives that capability directly; in the `strlen` case, the operation is "count trailing zeros", however, most SIMD architectures, what we call "hardware SIMD", do not!

 I think our colleagues in GLIBC that are so fond of assembler (we have proven this since our equivalents are more than 90% just C++ with very topical use of intrinsics, but theirs are through and through assembler code) would have identified the architectural primitive that would have spared them to have to use general purpose computation to identify the index into the block where the first null occurs.  Since they need to create a "mask" of booleans, it seems that those horizontal primitives are missing from AVX2 and Arm Neon.

 In our Robin Hood implementation there are many examples of our generation of a SIMD of boolean results and the immediate use of those booleans as SIMD values for further computation. I believe this helps our SWAR, ["software SIMD"](https://github.com/thecppzoo/zoo/blob/em/essay-swar-latency.md/glossary.md#software-simd) implementations to be competitive with the software that uses "hardware SIMD".

-This also shows a notorious deficiency in the efforts to standardize in C++ the interfaces for hardware SIMD: The results of lane-wise processing (vertical processing) that generate booleans would generate "mask" values of types that are bitfields of the results of comparisons, not SIMD type values.  So, if you were to consume these results in further SIMD processing, you'd need to convert and adjust the bitfields in general purpose registers into SIMD registers, this "back and forth SIMD-scalar" conversions would add further performance penalties that compiler optimizers might not be able to eliminate, for a variety of reasons that are out of the scope of this document, wait for us to write an essay that contains the phrase "cuckoo-bananas".
+This also shows a notorious deficiency in the efforts to standardize in C++ the interfaces for hardware SIMD: The results of lane-wise processing (vertical processing) that generate booleans would generate "mask" values of types that are bitfields of the results of comparisons, an "scalar" to be used in [ the non-vector "General Purpose Registers", GPRs, not SIMD type values [^3].  So, if you were to consume these results in further SIMD processing, you'd need to convert and adjust the bitfields in general purpose registers into SIMD registers, this "back and forth SIMD-scalar" conversions would add further performance penalties that compiler optimizers might not be able to eliminate, for a variety of reasons that are out of the scope of this document. Wait for us to write an essay that contains the phrase "cuckoo-bananas".


Suggest:
s/an "scalar"/a "scalar"
s/So, if/If
It might be more compact to say, "Using the output of one hardware SIMD bifield to control the next hardware SIMD instruction requires processing via a general purpose register, incurring a devastating overhead" or similar?

Scottbruceheart · 2024-06-14T03:05:36Z

essay/swar/latency.md

+
+[^2] We're beginning to work on explaining the basics of processor instructions and operations costs.
+
+[^3] In AVX/AVX2 (of x86-64) typically a SIMD comparison such as "greater than" would generate a mask: for inputs `a` and `b`, the assembler would look like this (ChatGPT generated):


Ugh, please, never mention the stochastic parrots by name. Best case they gave you something correct, worst case it is a crutch understood by your readers as a near verboten substitute for the sentence "I didn't understand what I published."

GitHub markdown does not work well with footnotes of several paragraphs

Scottbruceheart · 2024-06-14T03:20:31Z

essay/swar/latency.md

+Take AVX512, for example, the latest and best of x86_64 SIMD:  in AVX512, there are mask registers, of 64 bits, so they can map one bit to every byte-lane of a `ZMM`register.  It recognizes the need to prevent computation on certain lanes, the ones that are "masked out" by the mask register.  However, this is not the same thing as having a bit in the lane itself that you can use in computation.  To do that, you need to go to the GPRs moving a mask, back to the `ZMM` registers via a move followed by a bit-broadcast, just like in AVX/AVX2.  The reason why apparently nobody else has "felt" this pain might be, I speculate, because SIMD is mostly used with floating point numbers.  It wouldn't make sense to turn on a bit in lanes.
+Our SWAR work demonstrates that there are plentiful opportunities to turn normally serial computation into ad-hoc parallelism, for this we use integers always, and we can really benefit from the lanes themselves having a bit in them, as opposed to predicating each lane by a mask.  Our implementation of saturated addition is one example: we detect overflow (carry) as a `BooleanSWAR`, we know that is the most significant bit of each lane, we can copy down that bit by subtracting the MSB as the LSB of each lane.
+
+[^4] When we mean "future proof" we mean that the discernible tendencies agree with our approach:  Fundamental physics is impeding further miniaturization of transistors, I speculate that, for example, the excution units dedicated to SIMD and "scalar" or General Purpose operations would have to be physically separated in the chip, moving data from one side to the other would be ever more expensive.  Actually, In x86-64 similar penalties existed such as having the result of a floating point operation being an input to an integer operation, changing the "type" of value held, back when AVX and AVX2 were relatively new, exhibited a penalty that was not negligible.  Because the execution units for FP and Integers were different.  This changed, nowadays they use the same execution units, as evidenced in the port allocations for the microarchitectures, but I conjecture that "something has to give", I see that operations on 64 bit "scalar" integers and 512 bit `ZMM` values with internal structure of lanes are so different that "one type fits all" execution units would be less effective.  We can foresee ever wider SIMD computation, ARM's design has an extension that would allow immense widths of 2048 bits of SIMD, it is called "Scalable Vector Extensions", RISC-V has widening baked into the very design, they call it "Vector Length Agnostic" with its acronym VLA, so that you program in a way that is not specific to a width... Just thinking about the extreme challenge of merely propagating signals in buses of 2048, that the wider they are they also have to be longer, that the speed of signal propagation, the speed of light, the maxium speed in the universe, has been a limiting factor that is ever more limiting; I really think that the round trip of SIMD boolean lane-wise test to mask in SIMD register to GPR and back to SIMD will be more significant, our SWAR library already avoids this.  Yet, it is crucial to work with the result of boolean tests in a SIMD way, this is the key to branchless programming techniques, the only way to avoid control dependencies.


I thought that converting between int and float types generally had a cost overhead? Page 41 in Agner Fog C++ optimization manual: https://www.agner.org/optimize/optimizing_cpp.pdf is where I saw this ten plus years ago. It still says going from int to float takes 4-16 clocks, but says float to int either takes 50-100 or 'is fast' if SSE2 is enabled. I'm not sure what 'is fast' means there. Not that it matters to this footnote, which is an excellent illustration of "as pushing things back and forth across the SIMD/GP boundary is already bad, and it is going to get much worse"

thecppzoo · 2024-06-14T07:06:51Z

I believe the essay is ready at this point.
I ended up digressing and expanding points but hopefully without harming the main point.

Scottbruceheart · 2024-06-15T01:45:55Z

essay/swar/latency.md


-Our SWAR library, in this regard, is "future proof": boolean results are just an specific form of SIMD values.  We use this already for SIMD operations such as saturated addition: unsigned overflow (carry) is detected lane wise and lane-wise propagated to the whole lane, without conversion to scalar.  Saturation is implemented seamlessly as after-processing of computation, without having to convert to "scalar" (the result mask) and back.
+This also shows a notorious deficiency in the efforts to standardize in C++ the interfaces for hardware SIMD: The results of lane-wise processing (vertical processing) that generate booleans would generate "mask" values of types that are bitfields of the results of comparisons, an "scalar" to be used in the non-vector "General Purpose Registers", GPRs, not SIMD type values [^3].  So, if you were to consume these results in further SIMD processing, you'd need to convert and adjust the bitfields in general purpose registers into SIMD registers, this "back and forth SIMD-scalar" conversions would add further performance penalties that compiler optimizers might not be able to eliminate, for a variety of reasons that are out of the scope of this document. Wait for us to write an essay that contains the phrase "cuckoo-bananas".


s/an "scalar"/a "scalar"
s/this "back and forth SIMD-scalar" conversions/these "back and forth SIMD-scalar" conversions
" compiler optimizers might not be able to eliminate" feels like a WILD understatement. I have no idea how a compiler would eliminate a single one of these cases currently.

Scottbruceheart · 2024-06-15T01:47:19Z

essay/swar/latency.md


-In the early planning stages for support of "hardware" SIMD (architectures that do SIMD), I've thought that perhaps we will have to keep our emulated SIMD implementations on "hardware SIMD" as a flavor for users to choose from, since they may be the only way to avoid performance penalties that are higher than the cost of emulation, to prevent the error in the architecture, an error that no standard library could avoid except synthetizing the operations as we do.
+Our SWAR library, in this regard, is "future proof" [^4]: boolean results are just an specific form of SIMD values.  We use this already for SIMD operations such as saturated addition: unsigned overflow (carry) is detected lane wise and lane-wise propagated to the whole lane, without conversion to scalar.  Saturation is implemented seamlessly as after-processing of computation, without having to convert to "scalar" (the result mask) and back.


s/an specific/a specific

Scottbruceheart · 2024-06-15T01:48:12Z

essay/swar/latency.md

-In the early planning stages for support of "hardware" SIMD (architectures that do SIMD), I've thought that perhaps we will have to keep our emulated SIMD implementations on "hardware SIMD" as a flavor for users to choose from, since they may be the only way to avoid performance penalties that are higher than the cost of emulation, to prevent the error in the architecture, an error that no standard library could avoid except synthetizing the operations as we do.
+Our SWAR library, in this regard, is "future proof" [^4]: boolean results are just an specific form of SIMD values.  We use this already for SIMD operations such as saturated addition: unsigned overflow (carry) is detected lane wise and lane-wise propagated to the whole lane, without conversion to scalar.  Saturation is implemented seamlessly as after-processing of computation, without having to convert to "scalar" (the result mask) and back.
+
+In the early planning stages for support of "hardware" SIMD (architectures that do SIMD), I've thought that perhaps we will have to keep our emulated SIMD implementations on "hardware SIMD" as a flavor for users to choose from.  If the round trip to GPRs is expensive, which might be the only way, that "hardware SIMD" gives to use the result of SIMD-boolean tests in further SIMD processing, then the emulation of some operations might be more performant...


gives to use the result
Is 'use' supposed to be 'us' or something else?

Scottbruceheart · 2024-06-15T01:53:07Z

essay/swar/latency.md

+
+Which seems the same idea, the wonder that it works.  I prefer to side with Richard P. Feynman, also NPW, and perhaps more famous than E. Wigner, that for him, mathematics is just "organized reasoning".  Then Wigner's point of view gets reduced by Feynman to the apparent fact that nature is amenable to be described by reasoning, or that nature is reasonable.  This begs for a Wittgenstein's retort of how would an unreasonable nature would look like given that all we know is our apparently reasonable universe --complex numbers end up making sense after deep study, the sooner you reject the perspective that real numbers describe the universe the sooner you'd regain your senses, the universe is not unreasonable, it is that it is complex numbers what describe the universe, not the too-simple real numbers--
+
+Fortunately, as software engineers, we know what it looks like when SIMD is not approached from a principled stand point, you get Intel's SSE, AVX, AVX512: an unholy mess of marketing-driven engineering (Google for what Agner Fog says about this design, and all of Intel's ISAs for that matter); back to SWAR, it really doesn't surprise me the effectiveness of integer arithmetic in combination with bitwise operations to express tactical parallelism: they are what was designed and proved to work as the basis for all computation.  In the end, it is all functions with binary values, binary functions are the simplest, **where you have to be smart is into combining them in ways that are combinable further**


s/stand point/standpoint
s/where you have to be smart is into combining them in ways that are combinable further/where you have to be smart is in combining them in ways that are combinable further/
or maybe 'combining them in ways that allow more combinations' or something.

Scottbruceheart · 2024-06-15T01:56:32Z

essay/swar/latency.md

+
+[^3]: See: [SIMD-GPR round trips](#round-trips-between-simd-and-gpr)
+
+The SIMD architectures I know all turn lane-wise comparisons such as "greater than" into SIMD registers with each lane full of 1s or 0s depending on the comparison.  This can be used for further SIMD computation.  The problem arises when you want to apply a "horizontal" operation on the SIMD, for example, "what is the first, or least significant lane in which the condition is true?", as in `strlen`, "what is the index of the lane with the first null byte?"  Typically, the SIMD ISA does **not** have an instruction for this, forcing you to create a "mask" extracting bits from each lane, as in x86_64 "vpmovmskb", then process the mask using GPRs, scalar, registers.  If you then want to convert your scalar result back to a SIMD value/register, for further branchless SIMD processing, you are very much like going up a creek without a paddle.


s/you are very much like going up a creek/you are going up a creek/
or possibly /you are absolutely going up a creek/

Scottbruceheart · 2024-06-15T02:02:47Z

essay/swar/latency.md

-[^4] When we mean "future proof" we mean that the discernible tendencies agree with our approach:  Fundamental physics is impeding further miniaturization of transistors, I speculate that, for example, the excution units dedicated to SIMD and "scalar" or General Purpose operations would have to be physically separated in the chip, moving data from one side to the other would be ever more expensive.  Actually, In x86-64 similar penalties existed such as having the result of a floating point operation being an input to an integer operation, changing the "type" of value held, back when AVX and AVX2 were relatively new, exhibited a penalty that was not negligible.  Because the execution units for FP and Integers were different.  This changed, nowadays they use the same execution units, as evidenced in the port allocations for the microarchitectures, but I conjecture that "something has to give", I see that operations on 64 bit "scalar" integers and 512 bit `ZMM` values with internal structure of lanes are so different that "one type fits all" execution units would be less effective.  We can foresee ever wider SIMD computation, ARM's design has an extension that would allow immense widths of 2048 bits of SIMD, it is called "Scalable Vector Extensions", RISC-V has widening baked into the very design, they call it "Vector Length Agnostic" with its acronym VLA, so that you program in a way that is not specific to a width... Just thinking about the extreme challenge of merely propagating signals in buses of 2048, that the wider they are they also have to be longer, that the speed of signal propagation, the speed of light, the maxium speed in the universe, has been a limiting factor that is ever more limiting; I really think that the round trip of SIMD boolean lane-wise test to mask in SIMD register to GPR and back to SIMD will be more significant, our SWAR library already avoids this.  Yet, it is crucial to work with the result of boolean tests in a SIMD way, this is the key to branchless programming techniques, the only way to avoid control dependencies.
+Take AVX512, for example, the latest and best of x86_64 SIMD:  in AVX512, there are mask registers, of 64 bits, so they can map one bit to every byte-lane of a `ZMM`register.  It recognizes the need to prevent computation on certain lanes, the ones that are "masked out" by the mask register, the operation in that lane is a no-op.  However, this is not the same thing as having a bit in the lane itself that you can use in computation.  To do that, you need to go to the GPRs moving a mask, do the processing of the mask with scalar instructions and back to the `ZMM` registers via a move followed by a complicated sequence of instructions to achieve the equivalent of a missing instruction of scattering the bits of the mask into the lanes of a register, just like in AVX/AVX2.  The reason why apparently nobody else has "felt" this pain might be, I speculate, because SIMD is mostly used with floating point numbers.  It wouldn't make sense to turn on a bit in lanes.
+
+Our SWAR work demonstrates that there are plentiful opportunities to turn normally serial computation into ad-hoc parallelism, for this we use integers always, and we can really benefit from the lanes themselves having a bit in them, as opposed to predicating each lane by a mask.  Our implementation of saturated addition is one example: we detect overflow (carry) as a `BooleanSWAR`, we know that is the most significant bit of each lane, we can copy down that bit by subtracting the MSB as the LSB of each lane; it is true that SIMD ISAs have saturation instructions, the point is that SIMD ISAs are seldom complete, actually not even consistent, so, you might need to modify slightly an instruction with additional processing, and if at any step you need a missing horizontal instruction, you then have to do the round trip SIMD to moving the mask to GPR to then back to SIMD from a mask: generally, a performance crater.


into ad-hoc parallelism or just...parallelism?
s/actually not even consistent/or even consistent/
Second sentence is extremely run-on.

Scottbruceheart · 2024-06-15T02:03:55Z

essay/swar/latency.md

+
+When we mean "future proof" we mean that the discernible tendencies agree with our approach:
+
+Fundamental physics is impeding further miniaturization of transistors, I speculate that, among other things, the excution units dedicated to SIMD and "scalar" or General Purpose operations would have to be physically separated in the chip, moving data from one side to the other would be ever more expensive.  Actually, In x86_64 similar penalties existed such as having the result of a floating point operation being an input to an integer operation, or viceversa.  I used to do this: I would have 64 bit or 32 bit lanes of integer processing, and to retrieve the most significant bit, the one that I cared about, I would use `MOVMSKPD` and `MOVMSKPS`, instructions that would extract the sign bit, the MSB of the corresponding "D" double-precision floating point (the 64-bit MSB) or the "S" single precision.  I had to use the "floating point" gathering of mask bits because there was no integer equivalent for 32 or 64 bit lane size.  The only "integer" operation is `movmskb` (with either/both `v` and `p` prefixes) that gathers the MSB of each byte... for further aggravation, x86_64, did not have 16-bit lane size gathering of MSB...


s/viceversa/vice versa/
I'd switch the ... to just make them periods.

Scottbruceheart · 2024-06-15T02:05:24Z

essay/swar/latency.md

+A+B+C+D       B+C+D     C+D        D
+The summation is in the most significant lane.
+```
+The issue with emulating missing SIMD horizontal operations is that you have to combine simple operations that diminish the gain factor of parallelism and require sophisticated mathematics, or what other people would call tricks.


I'd throw the "or what other people would call tricks" in parenthesis or just drop it altogether.

Scottbruceheart · 2024-06-15T02:07:13Z

essay/swar/latency.md

+```
+The issue with emulating missing SIMD horizontal operations is that you have to combine simple operations that diminish the gain factor of parallelism and require sophisticated mathematics, or what other people would call tricks.
+
+Up to when AVX and AVX2 were relatively new, changing the "type" of value had a non negligible penalty.  Because the execution units for FP and Integers were different.  This changed, nowadays they use the same execution units, as evidenced in the port allocations for the microarchitectures.  Apparently SIMD ALUs could be made that worke well with both scalar and SIMD values; but I conjecture that "something has got to give", I see that operations on 64 bit "scalar" integers and 512 bit `ZMM` values with internal structure of lanes are so different that "one type fits all" execution units would be less effective.  We can foresee ever wider SIMD computation, ARM's design has an extension that would allow immense widths of 2048 bits of SIMD, it is called "Scalable Vector Extensions", RISC-V has widening baked into the very design, they call it "Vector Length Agnostic" with its acronym VLA, so that you program in a way that is not specific to a width... Just thinking about the extreme challenge of merely propagating signals in buses of 2048, that the wider they are they also have to be longer, that the speed of signal propagation, the speed of light, the maxium speed in the universe, has been a limiting factor that is ever more limiting; I really think that the round trip of SIMD boolean lane-wise test to mask in SIMD register to GPR and back to SIMD will be more significant, our SWAR library already avoids this.  Yet, it is crucial to work with the result of boolean tests in a SIMD way, this is the key to branchless programming techniques, the only way to avoid control dependencies that more than nullify the performance advantage of SIMD parallelism.


s/worke/work/
ellipsis to period after width.
penultime sentence is seriously run-on.

Scottbruceheart · 2024-06-15T02:07:58Z

Yeah, looks good now. Left a bunch of typo / nits/ slight rephrasings, and pointed out two new run on sentences that might work better if broken apart. Otherwise LGTM.

thecppzoo added 4 commits May 27, 2024 12:36

Create latency.md

bde8a80

Update latency.md

765b856

Update latency.md

8652c01

Update latency.md

b1f69ab

thecppzoo changed the title ~~Create latency.md~~ Essay on latency with regards to SIMD May 27, 2024

thecppzoo requested review from rambl30n, jamierpond and Scottbruceheart May 27, 2024 21:06

thecppzoo added 4 commits May 28, 2024 15:01

PR Review: Create glossary.md

8f2e29d

Update glossary.md: associative iteration

329ad9e

Update glossary.md

b579a8c

Update latency.md: Addressing PR review

52e9f9d

jamierpond reviewed May 29, 2024

View reviewed changes

jamierpond force-pushed the master branch from bbe3a56 to e45cca2 Compare June 4, 2024 09:05

Update latency.md

2e10165

Explains the relationship between our code and Lemire's. Includes appendix to multiplications

Scottbruceheart reviewed Jun 13, 2024

View reviewed changes

Update latency.md

d8c6e92

Important footnotes (treatises)

Scottbruceheart reviewed Jun 14, 2024

View reviewed changes

thecppzoo added 2 commits June 13, 2024 20:14

Struggling a little bit with footnotes

96357d2

GitHub markdown does not work well with footnotes of several paragraphs

Missing '#'

8ec5f2d

Scottbruceheart reviewed Jun 14, 2024

View reviewed changes

thecppzoo added 3 commits June 13, 2024 21:33

Merge candidate

a0b3fc4

Improvements, elimination of incorrect ChatGPT generated example

190c9a2

Final (hopefully) clean up

29833a0

Scottbruceheart reviewed Jun 15, 2024

View reviewed changes

thecppzoo added 2 commits July 7, 2024 13:22

Update latency.md

269ff4a

Merge branch 'master' into em/essay-swar-latency.md

8b499f6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Essay on latency with regards to SIMD #87

Essay on latency with regards to SIMD #87

thecppzoo commented May 27, 2024

jamierpond May 29, 2024 •

edited

Loading

jamierpond May 29, 2024 •

edited

Loading

thecppzoo May 29, 2024

jamierpond left a comment

jamierpond May 29, 2024

thecppzoo Jun 13, 2024

jamierpond May 29, 2024

thecppzoo Jun 13, 2024

Scottbruceheart Jun 14, 2024

jamierpond May 29, 2024

jamierpond May 29, 2024

jamierpond May 29, 2024

Scottbruceheart Jun 14, 2024

jamierpond May 29, 2024

jamierpond May 29, 2024

Scottbruceheart Jun 13, 2024

Scottbruceheart Jun 13, 2024

Scottbruceheart Jun 13, 2024

Scottbruceheart commented Jun 13, 2024

Scottbruceheart Jun 14, 2024

Scottbruceheart Jun 14, 2024

Scottbruceheart Jun 14, 2024

thecppzoo commented Jun 14, 2024

Scottbruceheart Jun 15, 2024

Scottbruceheart Jun 15, 2024

Scottbruceheart Jun 15, 2024

Scottbruceheart Jun 15, 2024

Scottbruceheart Jun 15, 2024

Scottbruceheart Jun 15, 2024

Scottbruceheart Jun 15, 2024

Scottbruceheart Jun 15, 2024

Scottbruceheart Jun 15, 2024

Scottbruceheart commented Jun 15, 2024


		I think our colleagues in GLIBC that are so fond of assembler (we have proven this since our equivalents are more than 90% just C++ with very topical use of intrinsics, but theirs are through and through assembler code) would have identified the architectural primitive that would have spared them to have to use general purpose computation to identify the index into the block where the first null occurs. Since they need to create a "mask" of booleans, it seems that those horizontal primitives are missing from AVX2 and Arm Neon.

		In our Robin Hood implementation there are many examples of our generation of a SIMD of boolean results and the immediate use of those booleans as SIMD values for further computation. I believe this helps our SWAR, ["software SIMD"](https://github.com/thecppzoo/zoo/blob/em/essay-swar-latency.md/glossary.md#software-simd) implementations to be competitive with the software that uses "hardware SIMD".


		This also shows a notorious deficiency in the efforts to standardize in C++ the interfaces for hardware SIMD: The results of lane-wise processing (vertical processing) that generate booleans would generate "mask" values of types that are bitfields of the results of comparisons, not SIMD type values. So, if you were to consume these results in further SIMD processing, you'd need to convert and adjust the bitfields in general purpose registers into SIMD registers, this "back and forth SIMD-scalar" conversions would add further performance penalties that compiler optimizers might not be able to eliminate, for a variety of reasons that are out of the scope of this document, wait for us to write an essay that contains the phrase "cuckoo-bananas".

		Our SWAR library, in this regard, is "future proof": boolean results are just an specific form of SIMD values. We use this already for SIMD operations such as saturated addition: unsigned overflow (carry) is detected lane wise and lane-wise propagated to the whole lane, without conversion to scalar. Saturation is implemented seamlessly as after-processing of computation, without having to convert to "scalar" (the result mask) and back.


		The essence of parallel algorithms is the "mapping" of an operation that is associative to multiple data elements simultaneously or in parallel, and the collection (reduction, gathering, combination) of the partial results into the final result.

		The combination of the partial results induce a tree structure, whose height is in the order of the logarithm of the number of elements N.


		A comparison with a fundamentally different technique is to revert to the linear, serial mechanism of "multiplying by 10 the previous result and adding the current digit". That obviously leads to a latency of 7 partial results, multiplication is present in the stages of both schemes, and multiplication is the dominant latency[^1], 7 is worse than 3, so, Lemire's mechanism has, relatively, much lower, not higher, latency.

		Then Lemire is comparing his mechanism to some unknown mechanism that might have smaller latency that can not be the normal mathematics of multiplication by a base. The mathematical need is to convert numbers in base 10 into base 2. There may be exotic mathematics for base conversions that use other methods, perhaps something based on fourier transform as the best multiplication algorithms, but for sure would be practical only for immense numbers of digits.


		Lemire's point of view that his mechanism has high latency seems to be just wrong.

		This example goes to show why SWAR mechanisms also have good latencies: If the SWAR operation is the parallel mapping of computation and the parallel combination of partial results, they very well be of optimal latency.


		## Appendix, multiplication codegen for x86-64

		For this code, multiplying by a compile-time constant, we get this generated code [nice Compiler Explorer link](https://godbolt.org/z/z3b7e1Tnx):


		[^2] We're beginning to work on explaining the basics of processor instructions and operations costs.

		[^3] In AVX/AVX2 (of x86-64) typically a SIMD comparison such as "greater than" would generate a mask: for inputs `a` and `b`, the assembler would look like this (ChatGPT generated):


		In the early planning stages for support of "hardware" SIMD (architectures that do SIMD), I've thought that perhaps we will have to keep our emulated SIMD implementations on "hardware SIMD" as a flavor for users to choose from, since they may be the only way to avoid performance penalties that are higher than the cost of emulation, to prevent the error in the architecture, an error that no standard library could avoid except synthetizing the operations as we do.
		Our SWAR library, in this regard, is "future proof" [^4]: boolean results are just an specific form of SIMD values. We use this already for SIMD operations such as saturated addition: unsigned overflow (carry) is detected lane wise and lane-wise propagated to the whole lane, without conversion to scalar. Saturation is implemented seamlessly as after-processing of computation, without having to convert to "scalar" (the result mask) and back.


		Which seems the same idea, the wonder that it works. I prefer to side with Richard P. Feynman, also NPW, and perhaps more famous than E. Wigner, that for him, mathematics is just "organized reasoning". Then Wigner's point of view gets reduced by Feynman to the apparent fact that nature is amenable to be described by reasoning, or that nature is reasonable. This begs for a Wittgenstein's retort of how would an unreasonable nature would look like given that all we know is our apparently reasonable universe --complex numbers end up making sense after deep study, the sooner you reject the perspective that real numbers describe the universe the sooner you'd regain your senses, the universe is not unreasonable, it is that it is complex numbers what describe the universe, not the too-simple real numbers--

		Fortunately, as software engineers, we know what it looks like when SIMD is not approached from a principled stand point, you get Intel's SSE, AVX, AVX512: an unholy mess of marketing-driven engineering (Google for what Agner Fog says about this design, and all of Intel's ISAs for that matter); back to SWAR, it really doesn't surprise me the effectiveness of integer arithmetic in combination with bitwise operations to express tactical parallelism: they are what was designed and proved to work as the basis for all computation. In the end, it is all functions with binary values, binary functions are the simplest, where you have to be smart is into combining them in ways that are combinable further


		[^3]: See: [SIMD-GPR round trips](#round-trips-between-simd-and-gpr)

		The SIMD architectures I know all turn lane-wise comparisons such as "greater than" into SIMD registers with each lane full of 1s or 0s depending on the comparison. This can be used for further SIMD computation. The problem arises when you want to apply a "horizontal" operation on the SIMD, for example, "what is the first, or least significant lane in which the condition is true?", as in `strlen`, "what is the index of the lane with the first null byte?" Typically, the SIMD ISA does not have an instruction for this, forcing you to create a "mask" extracting bits from each lane, as in x86_64 "vpmovmskb", then process the mask using GPRs, scalar, registers. If you then want to convert your scalar result back to a SIMD value/register, for further branchless SIMD processing, you are very much like going up a creek without a paddle.


		When we mean "future proof" we mean that the discernible tendencies agree with our approach:

		Fundamental physics is impeding further miniaturization of transistors, I speculate that, among other things, the excution units dedicated to SIMD and "scalar" or General Purpose operations would have to be physically separated in the chip, moving data from one side to the other would be ever more expensive. Actually, In x86_64 similar penalties existed such as having the result of a floating point operation being an input to an integer operation, or viceversa. I used to do this: I would have 64 bit or 32 bit lanes of integer processing, and to retrieve the most significant bit, the one that I cared about, I would use `MOVMSKPD` and `MOVMSKPS`, instructions that would extract the sign bit, the MSB of the corresponding "D" double-precision floating point (the 64-bit MSB) or the "S" single precision. I had to use the "floating point" gathering of mask bits because there was no integer equivalent for 32 or 64 bit lane size. The only "integer" operation is `movmskb` (with either/both `v` and `p` prefixes) that gathers the MSB of each byte... for further aggravation, x86_64, did not have 16-bit lane size gathering of MSB...

Essay on latency with regards to SIMD #87

Are you sure you want to change the base?

Essay on latency with regards to SIMD #87

Conversation

thecppzoo commented May 27, 2024

jamierpond May 29, 2024 • edited Loading

Choose a reason for hiding this comment

jamierpond May 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamierpond left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Scottbruceheart commented Jun 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thecppzoo commented Jun 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Scottbruceheart commented Jun 15, 2024

jamierpond May 29, 2024 •

edited

Loading

jamierpond May 29, 2024 •

edited

Loading