-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Will simd be introduced? #60
Comments
I took at look at portable SIMD. While it does define the packed SIMD types, some of the instructions used by the algs presented in the paper are target feature specific (such as the sse4.2 cmpistrm) |
Another consideration is that the vectorized versions (at least for the two I've looked at, OR and AND) are not safe to use in-place ( A nice feature we have that rust makes trivial is the ability to take ownership of another Vectorization and container recycling are seemingly at odds. When performing an AND, should we use the faster vectorized alg that must allocate, or recycle the allocation and go with the slower one. Or is there another option? Thinking out loud: We could keep a thread local buffer around, always use the vectorized version (when it's possible, and given it's the best choice). Then swap the pointers. The output vector would then never alias the same memory as either of the two input vectors. |
Ported the vectorized intersection alg from CRoaring, and tried using it for Fun. I haven't convinced myself it's a good idea, but it sure is fast Given the SIMD alg requires specific x86 features, I'm not at all sure how to go about testing. Curious what you think @Kerollmops, @Nemo157 |
SIMD stuff in my very messy very experimental branch |
Wow, is it me or is it way faster than the C equivalent? 🚤
This is one of the reasons why I am not sure either, do we really want to bring raw SIMD algorithms into this crate? Can't we rely on the soon-to-be-released stdSIMD module?
That's impressive, the speed is quite good and if I'm not wrong, even faster than C but... have you seen the complexity of it? The amount of code we will need to maintain just for array-array operations and on the x86 only platform. Do we want to maintain these 1095 of raw SIMD functions? Maybe that's fair and we effectively want/need that 😄 |
AFACT the std simd crate will only provide data type definitions and ops to load, store, mask, and swizzle I called out the lack of some x86 specific instructions like |
Have you tried to just always allocate the buffer instead of using a thread-local variable? I would be glad if we can avoid using that, I am sure that malloc is fast enough to generate an uninitialized memory area for us.
Maybe we can keep the work you have done and make sure that it doesn't impact the compilation on other platforms e.g. ARM, I am sure that's what you thought of! The thing that makes MeiliSearch use RoaringBitmap and not the CRoaring crate is because it compiles on all the platforms I tried. I would at least like it to continue in this way. A good way to be sure that the codebase is portable could be to compile it for |
I dont think this in in the "spirit" of If were going to alloc/free a bunch of temporary vecs, what makes it different from let a = RoaringBitmap::arbitrary();
let b = RoaringBitmap::arbitrary();
let a = a & &b; |
Also: memory fragmentation. |
I suppose in the case of |
pub fn intersect_assign_vector_stack(lhs: &mut Vec<u16>, rhs: &[u16]) {
let mut buf: [u16; 4096] = [0; 4096];
let len = unsafe {
intersect_vector16(
lhs.as_ptr(),
lhs.len(),
rhs.as_ptr(),
rhs.len(),
buf.as_mut_ptr(),
)
};
lhs.as_mut_slice()[..len].copy_from_slice(&buf[..len]);
lhs.truncate(len);
} |
No. It's dropped after use. It wont fragment memory. 😄
In my informal testing, comparing runtime: malloc < stack+memcpy < thread local Thinking out loud some more: We dont generally operate on containers in isolation, it could be a worthwhile future optimization to "weave" an array container though the op over many containers, swapping them as we go. :) Also, I wonder why CRoaring isn't doing this.
To be fair, most of it is LUTs. Also, this is only for 2 of the 4 ops. |
Yes, although this experimental code is wildly unsafe and assumes platform and required features, it's not too much effort to conditionally compile. With a medium level of effort, we could also add runtime detection. Otherwise, without specifying |
Do you have some benchmark graphs? Just to see the impact of using your stack+memcpy technique compared to the thread-local one? About the thread-local technique: I am worried about platform requirements i.e. wasm and this is why I would prefer we use simple and common techniques to make sure the crate compiles everywhere. As you can see in the documentation of the thread-local
I like that idea, the only downside of it is code complexity.
What do you mean by LUTs, here?
Hum... you are right! If we can find a way to abort the operation when it detects that there are too many values we can keep a small stack size and continue with a bitmap, indeed. But that would be a brand new algorithm to create. Maybe we can start by using this |
< being better, meaning less runtime. In other words: you are correct, malloc/free is faster! I should have known better than to try to outsmart jemalloc.
We would effectively be putting bounds checks back in, albeit once per loop iter, rather than on every memory access . I'm not convinced it would be worth the complexity for the small incremental gain.
Look Up Tables It's about ~325 LoC for union and intersect algs, and ~700 lines of LUTs |
I looked into what it would take to translate the vectorized ops to Or, XorThere are some instructions that are not supported by
[1] Here is an example of Once we have these basic building blocks it would be relatively easy to translate CRoaring And, Sub
Summary
DiscussionWould we like to start a discussion re: whether or not we want to add SIMD to the project, plus which set of SIMD primitives we should be coding against, or should we wait until we have a better understanding of what's required for And, Sub. |
Got it. Implemented how It should be easy now to replace the rest of the alg with std simd. The rest is just the "machinery" that does the looping, loads, and stores. |
Wow! You are such a beast! That's amazing 🎉 🏎️ According to the benchmarks you sent we are even with the C and
Does the
I think we, and more specifically you, found the best compromise: Using |
It's actually slightly slower than the x86 intrinsics. I think it's worth it, given it runs on x86, ARM, and WASM.
Yes. No thread local in this bench.
Rough overview:
|
What's fun with computers is the different results you can get when benchmarking on different versions of a CPU, here is what I see when benchmarking on my iMac 2019 Intel Core i9 8 cores, 3,6 GHz. I don't know why but the c version doesn't seem to find the desired extensions to compile properly. I set the
This is great! This is so good!
I looked at your code, that's indeed very good that with |
My translation workflow goes something like this
TLDR: Be aware: in this branch for any given commit: "cur" might be anything, as it calls the default op, which I might have pointed to any given implementation for testing. I've not gated x86 unsafe code with compile time checks. Your compiler might emit any garbage. I hope it does not eat your lunch. :) |
Super interesting that the rust beats C for scalar ARM. "opt_unsafe" is a pretty straightforward port of CRoaring scalar code. My hypothesis: We have those two different implementations of intersect_skewed. Rust can statically guarantee to LLVM that the mutable ref will not alias the two immutable refs, which enables LLVM to do optimizations that it cant with C. |
That sounds about right. My high level plan
What I'm not at all sure of is how to integrate testing and CI. Qemu? 🤔 |
Also undecided on bounds checks. The classic safety-performance-maintainability things in contention. My intuition is to leave bounds checks in and later we can profile and look at the call stack, removing from only the hottest of hot code paths. Or perhaps we could come up with ways to express the algs in terms of iterators and bury the unsafe code in them. I suspect there are still some lower-hanging fruit to optimize without sacrificing safety. I'll open some more issues. :) |
I think that we should indeed use as much safe code as possible. I, myself fall into this issue of trying to use too much unsafe and lose performances due to this. You can read this small story of implementing the Sometimes, even putting
I looked at how the RoaringBitmap/CRoaring library was testing this in their CI and it seems like they test it on a Windows ARM machine. Is it a custom one? didn't check more. Maybe @lemire can help us on this subject? |
OK. Done with the initial impl + added proptests for xor. Going to run proptests on a generator that only creates array stores for a few hours. 🤞 😬 🤞 |
I forgot to mention, I stripped out all the unsafe code around SIMD. (prior to test and bench) It survived 4 hours of proptest without any failures or crashes. Here is the full report of the op benchmarks. Note that unlike the Array-Array only benchmarks I shared earlier, this is for the full dataset and includes Array-Bitmap and Bitmap-Bitmap operations. It's surprisingly comparable to CRoaring in my environment. For reference I've got a Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz on a MacBook Pro. I'm going to clean it all up and open a PR in the next few days. |
I just forget about that 🤦 but the current CI only checks that the program works on the stable channel of Rust, however, you introduce the use of I would like to be able to use // At the top of the lib.rs
#![cfg_attr(feature = "simd", feature(portable_simd))]
Yup, that's impressive how hard it is now to see the performance difference between the C and Rust code! 🏎️ You've done such an impressive amount of work! Could you send me an email, please? I can't find a contact channel on your profile. |
Thanks. Sent you an email. =) I made a few discoveries today. Good news first
That means we dont need to have any fallbacks to intrinsics or scalar code! Bad news: The generator I used for the proptests would generate some roaring bitmaps with disjoint container sets. |
@Kerollmops Could you run tests and bench on your hardware on 0027530
|
Hey @saik0, All the tests are passing on my M1 🎉 I didn't run protest though, but I can. I just don't remember how 😄 Here are the benchmarks on my MacBook Pro M1 that I carefully plugged in to make sure performances are good.
Do you mean that no more functions are using scalar implementations?
What do you mean by that? Do you mean that we can simply compile it on ARM and it will just ignore that part of the code and only compile the aarch64 part? |
They are part of the normal suite. 🎉 You may wish to set
I do! The jump table idea is a truly terrible one, but it does work. We will need to keep it until llvm exposes a dynamic swizzle and std simd updates to use it. Then I can atone for my sin 😅 Briefly, to call it a stack frame must be pushed and popped inside the hot vectorized loop. Inlining would be a huge instruction bloat. (on my HW it's about 5% faster if we force inline anyways)
Yes. 🤞 The x86 stuff wont make it into a PR, but it is useful for comparing std simd directly to using x86 intrinsics. There are too many other variables when comparing it to C. |
Wow Not sure what's going on with |
I am also impressed, especially because the kind of flamegraph that we are trying to optimize relies a lot on the
Do you mean that the C's codebase hasn't been ported to ARM/aarch64? Do you think that different algorithms could be used to fit the ARM instructions extensions and that that's the reason they didn't port it to this arch, yet?
I just ran them with |
Yes, Arm has a very different suite of LUT-style operations. It doesn't have quite the same specialized string instructions (which can wind up being inferior to more "straightforward" SIMD implementations if used incorrectly, anyways). |
As far as "use intrinsics or not" goes, though: There are conversions to and from the |
38 Comments later. @TheLudlows Yes. SIMD is introduced. @Kerollmops This can be closed. |
The work you have done on that side of the library is awesome! It brings the library to another level 🚀 Thanks to @workingjubilee too for your help on this subject, hope this project helps you prioritize features for the |
151: Move array_store module to a directory r=Kerollmops a=saik0 This is to make room in the `array_store` module for incoming submodules that specialize array-array ops. See RoaringBitmap#60 & RoaringBitmap#133 Co-authored-by: saik0 <[email protected]>
hello, Will simd be introduced?
The text was updated successfully, but these errors were encountered: