Evaluate `i64.add_wide` plus `i64.add_wide3` #34

alexcrichton · 2025-03-04T15:24:02Z

I wanted to extract the proposal from here to a separate issue to avoid getting lost too much in that thread. Specifically it seems worthwhile to me to evaluate new addition/subtraction instructions:

i64.add_wide:   [i64, i64] ->      [i64, i64]  ;; a + b -> (low, high=carry)
i64.add3_wide:  [i64, i64, i64] -> [i64, i64]  ;; a + b + c -> (low, high=carry)
i64.sub_wide:   [i64, i64] ->      [i64, i64]  ;; a - b -> (difference, borrow)
i64.sub2_wide:  [i64, i64, i64] -> [i64, i64]  ;; a - b - c -> (difference, borrow)

(names bikesheddable over time of course)

My thinking for evaluating this would be:

Implement them in a fork of wasm-tools (not too bad).
Implement them in Wasmtime (probably not too bad).
Implement them in LLVM (pretty nontrivial for me) under a new flag temporarily.
Evaluate performance of instructions before+after, only before, only after, etc.

Much of the performance here will probably related to how good the LLVM implementation is would be my guess. To be determined!

The text was updated successfully, but these errors were encountered:

jakobkummerow · 2025-03-09T23:05:40Z

Back in the other thread, we discussed the feasibility of expressing an actual i128 addition with an add_wide+add3_wide pair on the Wasm level, and then having the engine recognize that pattern and internally replace it back to an add128.

I wanted to find out whether that engine-side optimization is as feasible as I thought it should be, so I've toyed around with an implementation a bit, with mixed results.

I was able to implement a version of it in our baseline compiler (!) without too much effort. It detects the specific sequence add_wide, local.get, local.get, add3_wide, drop, and replaces that with an add128. It works, and generates exactly the code we'd want, but it's admittedly limited in its applicability: e.g. if we weren't looking at a standalone "add128" function, but rather the result of inlining that function into its caller, then it likely wouldn't trigger any more. Also, it's quite debatable whether a baseline compiler should engage in this kind of pattern-matching in the first place; I consider this more of an experiment than a plan for actual production code.

Thinking about how to do it properly in the optimizing compiler, I understand better now what you mean when you say it's hard to optimize this. Fundamentally, we want to replace this:

          A    B
          ↓    ↓
        (add_wide)
          ↓    ↓    C    D
          X  carry  |    |
               ↓    ↓    ↓
               (add3_wide)
                 ↓     ↓
                 Y  (ignored)

with this:

and that's tricky in particular because there's no guarantee, in general, that C and D are available before X is used. In fact, it's quite conceivable that X might be a store to memory and is placed before C and D, which are loads from memory. While it would be possible for a sufficiently smart compiler to realize that these memory operations aren't aliasing, and then reorder them as needed, that's definitely not easy. The key benefit that add128 offers for this use case is the guarantee that all four inputs are available at the same time.

In light of these insights, my current understanding summarizes as:

Use case "bigint, 64-bit chunks with propagated carry":

this is the expected majority use case
add_wide + add3_wide is the perfect fit for the desired functionality (but requires some pattern matching in toolchains)
add128 is likely easier to emit for toolchains (because human-written helper functions emulating add3_wide use it), and is easy enough to simplify for engines. It's a bit wasteful on the wire bytes though.

Use case "i128 addition, no overflow to 129th bit":

unclear if this is relevant in practice at all (chances are no: the vast majority of existing code probably either targets 64-bit integers because they fit in registers and are hence fast, or goes for arbitrary-width bigints right away)
add128 is the perfect fit
using add_wide and add3_wide instead of add128 is no problem as far as functionality goes (and easy to emit for module producers), but highly nontrivial for engines to generate optimal code for. (If engines don't want to go to that effort, the difference in generated code isn't huge though, so given the lack of known practical relevance it may be totally fine.)

So, if we can have only one set of operations, I'm warming up to the idea of add128 being that one. While it's not perfect for the primary use case, it's close enough, and efficiently supports the other use case too.
If the spec doesn't have to be all that frugal and we can have add_wide and add3_wide too, then that would allow more efficient support for the expected primary use case.
Having only add_wide and add3_wide (and no add128), as I previously suggested, would serve the expected primary use case well, but would admittedly not work so well for the "i128" use case.

alexcrichton · 2025-03-12T14:36:14Z

Thanks for experimenting with this! Personally I've been wary of requiring a very specific structure of wasm to optimize well in the sense that AFAIK that's not the case today. I wouldn't be opposed to that necessary but it sounds like you're roughly thinking that add128 might work well.

Nevertheless what I hope to do when I get a chance is to game out what it would take in LLVM to pattern-match inputs and generate add_wide and add3_wide. I'm no LLVM expert though so it'll take awhile. My hope is that with the instructions some benchmarks can be performed with the various matrix of combinations of instructions to get an idea of concrete tradeoffs, if there are any.

alexcrichton · 2025-03-12T23:04:12Z

I'm going to write some notes to my future self.

To the best of my knowledge LLVM has syntax for pattern-matching and easily converting from LLVM IR to machine-dependent IR through *.td files. Also to the best of my knowledge none of this works when the instructions in question have multiple outputs. As far as I'm aware all lowerings in all backends (e.g. not just wasm but also x86) of LLVM use handwritten C++ code to manage lowerings/instructions that generate multi-return instructions.

Thus to instruction-select add_wide and add_wide3 I think that it's going to require a gob of handwritten C++. Basically I probably shouldn't expect to be able to write some *.td syntax and "just" use these new instructions. Instead it'll be a wad of C++ and I don't know immediately how that will be done. All of wide-arithmetic in LLVM is currently drawing inspiration from other backends and various operations and I figured out how to slice it for these ops, but that involved no pattern matching and was purely "hey if you see 128-bit add go do this wasm thing instead".

jakobkummerow · 2025-03-13T00:33:29Z

Your future self may find it instructive to find out (I don't know!) how mul_with_carry is being transformed into a Wasm i64.mul_wide_u (which I guess is working already?), and follow that example for transforming adc into a Wasm i64.add3_wide. I suppose it won't work exactly the same way, but I do presume it'll be inspirational :-)

alexcrichton · 2025-03-13T15:13:38Z

Oh that one's actually "super easy". LLVM already had a lowering "node" corresponding to exactly i64.mul_wide_* which I implemented here. That means the support there was effectively "take this exact thing LLVM knows and translate that to the exact same thing in wasm". It's multi-return so requires C++ goop, but it's a simple translation with no pattern matching. (128-bit addition/subtraction fall in this category too)

I do remember seeing various legalizations/lowerings in x86/aarch64 for various things having to do with overflow flags, and I might be able to adapt those and their pattern-matching to add_wide and add3_wide.

alexcrichton · 2025-03-14T14:31:53Z

I got some inspiration and started on this:

haven't evaluated/benchmarked anything yet though

alexcrichton mentioned this issue Mar 4, 2025

Alternative or addition: instructions with overflow/carry flags #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate `i64.add_wide` plus `i64.add_wide3` #34

Evaluate `i64.add_wide` plus `i64.add_wide3` #34

alexcrichton commented Mar 4, 2025

jakobkummerow commented Mar 9, 2025

alexcrichton commented Mar 12, 2025

alexcrichton commented Mar 12, 2025

jakobkummerow commented Mar 13, 2025

alexcrichton commented Mar 13, 2025

alexcrichton commented Mar 14, 2025

Evaluate i64.add_wide plus i64.add_wide3 #34

Evaluate i64.add_wide plus i64.add_wide3 #34

Comments

alexcrichton commented Mar 4, 2025

jakobkummerow commented Mar 9, 2025

alexcrichton commented Mar 12, 2025

alexcrichton commented Mar 12, 2025

jakobkummerow commented Mar 13, 2025

alexcrichton commented Mar 13, 2025

alexcrichton commented Mar 14, 2025

Evaluate `i64.add_wide` plus `i64.add_wide3` #34

Evaluate `i64.add_wide` plus `i64.add_wide3` #34