-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate i64.add_wide
plus i64.add_wide3
#34
Comments
Back in the other thread, we discussed the feasibility of expressing an actual i128 addition with an I wanted to find out whether that engine-side optimization is as feasible as I thought it should be, so I've toyed around with an implementation a bit, with mixed results. I was able to implement a version of it in our baseline compiler (!) without too much effort. It detects the specific sequence Thinking about how to do it properly in the optimizing compiler, I understand better now what you mean when you say it's hard to optimize this. Fundamentally, we want to replace this:
with this:
and that's tricky in particular because there's no guarantee, in general, that C and D are available before X is used. In fact, it's quite conceivable that X might be a store to memory and is placed before C and D, which are loads from memory. While it would be possible for a sufficiently smart compiler to realize that these memory operations aren't aliasing, and then reorder them as needed, that's definitely not easy. The key benefit that In light of these insights, my current understanding summarizes as: Use case "bigint, 64-bit chunks with propagated carry":
Use case "i128 addition, no overflow to 129th bit":
So, if we can have only one set of operations, I'm warming up to the idea of |
Thanks for experimenting with this! Personally I've been wary of requiring a very specific structure of wasm to optimize well in the sense that AFAIK that's not the case today. I wouldn't be opposed to that necessary but it sounds like you're roughly thinking that Nevertheless what I hope to do when I get a chance is to game out what it would take in LLVM to pattern-match inputs and generate |
I'm going to write some notes to my future self. To the best of my knowledge LLVM has syntax for pattern-matching and easily converting from LLVM IR to machine-dependent IR through Thus to instruction-select |
Your future self may find it instructive to find out (I don't know!) how |
Oh that one's actually "super easy". LLVM already had a lowering "node" corresponding to exactly I do remember seeing various legalizations/lowerings in x86/aarch64 for various things having to do with overflow flags, and I might be able to adapt those and their pattern-matching to add_wide and add3_wide. |
I got some inspiration and started on this: haven't evaluated/benchmarked anything yet though |
I wanted to extract the proposal from here to a separate issue to avoid getting lost too much in that thread. Specifically it seems worthwhile to me to evaluate new addition/subtraction instructions:
(names bikesheddable over time of course)
My thinking for evaluating this would be:
Much of the performance here will probably related to how good the LLVM implementation is would be my guess. To be determined!
The text was updated successfully, but these errors were encountered: