-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD-enabled utf-8 validation #68455
Comments
This does not do any validation, it just calls to |
@CryZe ah, my bad. Yeah I was confused about the exact output, so for good measure I also copied over std's algorithm into godbolt (third link) to see what would happen. Thanks for clarifying what's actually going on! |
UTF-8 validation is hopefully not where the compiler spends its time :) However, I could imagine this having some impact on "smallest possible" compile times (e.g., UI tests, hello world). My recommendation is to replace the algorithm in core::str::from_utf8 (or where-ever it is in core) with direct use of AVX2 or some similar set, and we can then run that by perf.rust-lang.org as a loose benchmark. That's likely not tenable in reality (we would need to conditionally, likely at runtime, gate use of SIMD instructions) but would give I believe the best possible performance wins (since it would apply to all uses of from_utf8. |
Here's pretty much a 1:1 port: |
Going to add the benchmarks from isutf8 (run on my machine
|
I re-read some of the Lemire algorithm and there are some key differences which might make it not that suitable for the general string validation. The two key points are:
I think ™️ that might make them slower for small payloads. |
I'm at work now, so I will read this thread later in more detail, but here are some things I want to say:
Indeed. I created the crate originally to be included in the Rust internals. I still feel that some of the algorithms need a little bit of refactoring to make them more idiomatic, before being included. I also want to add @zwegner's algorithm (https://github.com/zwegner/faster-utf8-validator), which seems to be the fastest algorithm that has been invented to date.
If the build fails, can you open an issue? :) I would appreciate that.
Also: You misspelled my name ;) |
Is that the AVX version or something? The SSE version just allocates 16 bytes on the stack for the remaining <16 bytes.
I don't even know what this means. Are we talking about this one? struct processed_utf_bytes {
__m128i rawbytes;
__m128i high_nibbles;
__m128i carried_continuations;
}; This one gets completely optimized away. |
Sorry, a bit tired, bit not byte :)
Good point |
Since this would have to go into core, can core even use runtime checks for target features yet (we need at least SSE 4.1 it seems)? |
In reply to @Licenser:
I have thought about this and we have a few options:
We would need to think more about this. |
It can be compile-time only for now. Those who self-compile (such as myself) would see an improvement. |
Hey, thanks for the shout out. I want to note that I have made several improvements to my code that aren't present in the master branch (but are in the When not memory-bandwidth bound, my algorithm is now over 2x faster (in my tests, on AVX2) than the one in Daniel Lemire's original repo (the one described in the blog post), and my SSE4 path is faster than the AVX2 path of that repo. The algorithm used in simdjson has made some improvements, but last I checked I think my algorithm is still faster. I still need to finish writing up documentation for the new algorithm. The code's definitely more hairy, from dealing with more architectures (AVX-512 and NEON), handling pointer alignment, generating lookup tables with a Python script, etc... But I'm happy to help out if anyone wants to use it/port it. |
@Licenser Are you still working on this? |
I ported the validator https://github.com/simd-lite/faster-utf8-validator-rs but from what I understand this all falls apart with the stdlib not being able to use CPU native features under the hood? |
It would need to be gated under runtime CPUID checks or left disabled by default, gated under a ifuncs or similar mechanisms might also be an option on certain platforms, but they're not very portable and have obscure and hard-to-satisfy constraints. Manually emulating them with function pointers initialized on first call might have more overhead, not sure. |
My understanding is that benchmarks can be misleading for AVX (and possibly SSE) in general purpose code, because:
Curious what others think about the suitability? Does it make sense only for sufficiently large strings? |
This article goes into detail on when it makes sense to use AVX (AVX-512 especially). The most relevant parts:
Some older CPUs downclock all cores when any of them are using AVX2 instructions, but for newer ones, they mostly only affect the core running them. Also, the SIMD instructions used for string validation would fall in the "light" category of instructions as they don't involve the floating-point unit. As @bprosnitz mentioned, take them with a grain of salt, but microbenchmarks certainly imply there is something to be gained in using an accelerated validator. |
Note that we have an updated validator called lookup 4... It is going to be really hard to beat.
Firstly, let us set aside AVX-512 and "heavy" (numerical) AVX2 instructions. They have their uses (e.g., in machine learning, simulation). But that's probably not what you have in mind. This being said... Regarding power usage, it is generally true that the faster code is the code that uses less energy. So if you can multiply your speed using NEON, SSE, AVX, go ahead. You'll come up on top. It is a bit like being concerned with climate change and observing that buses and trains use more energy than cars. They use more energy in total, but less energy per work done. So you have to hold the work constant if you are going to make comparisons. Does it take more energy to do 4 additions, or to use one instruction that does 4 additions at once? So SIMD instructions are the public transportation of computing. They are green and should be used as much as possible. (Again, I am setting aside AVX-512 and numerical AVX2 instructions that are something more controversial.) Regarding the fear that SIMD instructions are somehow exotic and rare, and that if you ever use it, you will trigger a chain reaction of slowness... You are using AVX all the time... Read this commit where the committer identified that the hot function in his benchmark was __memmove_avx_unaligned_erms. You can bet that this function is AVX-based. The Golang runtime uses AVX, Glibc uses AVX, LLVM uses AVX, Java, and so forth. Even PHP uses SIMD instructions for some string algorithms. And yes, Rust programs use AVX or other SIMD instructions. |
I don't think so. We have a well tested approach. It is as simple as that... internal::atomic_ptr<const implementation> active_implementation{&detect_best_supported_implementation_on_first_use_singleton};
const implementation *detect_best_supported_implementation_on_first_use::set_best() const noexcept {
return active_implementation = available_implementations.detect_best_supported();
} So you just need an atomic function pointer. Obviously, you do not get inlining, but that's about the only cost. Loading an atomic pointer is no more expensive, really, than loading an ordinary pointer. So this is pretty much free... except for the first time. You pay a price of first use, but that's not a fundamental limitation: you could set the best function any time, including at startup. |
I mentioned this off-hand on Twitter, but to clarify, in my benchmarks of pure UTF-8 validation, my code with scalar continuation byte checks and unaligned loads is still faster than lookup4 (or at least my translation of lookup4 to C). The difference depends on compiler (on my HSW, testing 1M of UTF-8, GCC gives .227 (me) vs .319 (L4) cycles/byte, while LLVM has .240 (me) vs .266 (L4)). The picture gets more complicated in the context of simdjson, when plenty of other code is competing for the same execution ports (and the best algorithm is less clear), but I think in the case of Rust, the pure UTF-8 microbenchmarks are probably more representative. |
@lemire to clarify, my benchmarks are already using the lookup4 implementation in simdjson (from the pull request's branch). |
@milkey-mouse Fantastic. cc @jkeiser |
Blog post: The cost of runtime dispatch. |
I think you should learn that from a packaging standpoint we would not want to package so many binaries such that it is only used by certain processor that certain people have. So instead of creating one binary we will have to create hundreds of binaries. This is also why linux distributions will only build with a very general SIMD that almost every one have during compile time. But with runtime detection, it can keep a single binary at the cost of runtime checks when the application starts and use a faster version without many drawbacks, I believe the performance cost is insignificant compared to the cost for building it for very specific CPU and keeping all the binaries for different specific CPU. This is also how ripgrep does SIMD, which is using runtime detection. Try discussing it with package distributor then you will understand more why is this the case. |
To follow up with what @pickfire said:
And ripgrep is in good company. glibc does the same thing. ripgrep does detection via its dependencies: |
Perhaps my view is just too narrow as I'm used to dealing with only fixed server platforms (only targeting one specific CPU) where we build everything destined for the system and my local Mac OS which has a constrained CPU instruction set (and fat binaries for ARM vs x86_64). Either way, I think it should never be excluded as an option if people wish it to be at compile time. Thanks for taking your time and explaining. |
@mlindner Yeah indeed. If I only had to deploy to a fixed set of server platforms where I knew exactly what ISA extensions were needed, then I'd totally skip the complexity of runtime CPU feature detection and just rely on compile time. |
So I guess feature detection will be done multiple times? Wouldn't it be better to only have the cost of feature detection once across all crates? |
Feature detection is cached in |
If you use |
That did never work for me (Godbolt). |
Looks like the I opened rust-lang/stdarch#1135 for this. |
The compatible API in simdutf8 v0.1.1 is now as a fast on small valid inputs and up to 22 times faster on large valid non-ASCII strings (three times as fast on ASCII) - x86-64 with AVX 2 support. If there is a way forward to get this into std or core I would be happy to work on it. |
@hkratz What about arm and those cpu without simd? Will there be a regression? |
On arm we would just use the current standard library |
@m-ou-se having read the entire conversation in this thread, it's not entirely clear to me what is blocking this from progressing. Would this set a precedent for runtime feature detection, and or something else? I'd appreciate if you, or someone else from the lib team could help me and others like @hkratz understand what would be need to get this into libcore. |
@Voultapher The main problem is how to detect which target features are supported from inside libcore. Except for x86, all platforms require asking the OS for which target features are supported by the CPU. libcore by definition doesn't know anything about the OS and as such can't know which target features are supported by the CPU. It is not safe to use target features before checking if they are supported. The results of doing so can range from crashes to unexpected instructions executing. In any case it will not result in the correct results. |
The Rust runtime does rely on runtime feature detection, but, as far as I can tell, only indirectly (e.g., by calling memcpy). Languages like Go, Java, C/C++... do use (quite often) runtime feature detection as part of the core libraries. A sensible way forward would be to add runtime CPU feature detection directly in the Rust core. |
I agree, there are enough examples that demonstrate it can be done within the larger constrains of the Rust language. A potential nice side-effect would be efficient and correct feature detection in the standard library, there are enough libraries and tools ripgrep, memchr etc. that have to hand roll it today. And tbh intrinsic support feels incomplete without feature detection. |
CPU feature detection is supported as part of libstd, as this is the part of the standard library that is allowed to interface with the OS, but libcore does not. How should CPU feature detection work when compiling a kernel against libcore? That would require some way to pass the supported features to libcore. What if the kernel works right now with stable rustc and doesn't provide the supported features to libcore as there is currently no such thing. How can the target feature detection support be added to libcore without breaking this kernel that doesn't pass the supported target features to libcore. |
To clarify, I poorly worded my earlier comment. I'm thinking about both feature detection and corresponding dispatch. For example implemented with atomic function pointer overwrite. I'm not deep enough into the Rust standard library to even fully understand the implications of what you said. I'm looking at this from a user perspective. Many Rust applications use the standard library, not only libcore. Part of the standard library interface are directly and transitively calls that perform UTF-8 validation. Somewhere in this stack of abstractions, feature detection and 'dispatch' can happen to use more efficient versions. Naively, if it can't easily be added to libcore, why not add it to the stdlib only, as first step? |
One way would be to tell the core library that SIMD implementations are supported ( For aarch64 detection would not even be needed because ASIMD/neon is enabled by default on nightly Rust so LLVM autovectorizes to ASIMD/neon code already. |
|
With LLVM / gcc / System V having recently introduced a concept of x86-64-v1/2/3/4 there's a reasonable chance we'll get actual targets for those. Also there's always the possibility of using build-std for a specific set of features / target cpu. So I'd say for an initial implementation we wouldn't need any runtime detection at all. |
It seems this issue is suffering a bit from https://xkcd.com/2368/. How can we get to an MVP, even as small as, you only get this feature if you compile std with a feature flag, or only x86-64 etc. ? Given that it's clear that:
All that remains is breaking the problem into smaller chunks and doing them. |
I am currently working on speeding up using SIMD UTF-8 validation for strings < 128 bytes. I plan to work on a PR adding SIMD UTF-8 support in core for aarch64 as that is compiled with SIMD support anyway afterwards. Following that we can have a look at the tradeoffs and performance in real world projects and then maybe implement X86-64 support with runtime detection in core (checking CPUID ond CR4.OSXSAVE should work on all operating systems, and if not that could be enabled with a flag std sets for when compiling core as part of std). |
It doesn't work inside an SGX secure enclave: https://github.com/rust-lang/stdarch/blob/ec989330959b31348d182767a0d40291519ba9d2/crates/core_arch/src/x86/cpuid.rs#L101-L104 |
But we can just check for |
Yes, or check |
@hkratz How did that go? |
Introduction
The "Parsing Gigabytes of JSON per second" post (ArXiv - langdale, lemire) proposes a novel approach for parsing JSON that is fast enough that on many systems it moves the bottleneck to the disk and network instead of the parser. This is done through the clever use of SIMD instructions.
Something that stood out to me from the post is that JSON is required to be valid utf-8, and they had come up with new algorithms to validate utf-8 using SIMD instructions that function much faster than conventional approaches.
Since rustc does a lot of utf-8 validation (each
.rs
source file needs to be valid utf-8), itgot me curious about what rustc currently does. Validation seems to be done by the following routine:
rust/src/libcore/str/mod.rs
Lines 1500 to 1618 in 2f688ac
This doesn't appear to use SIMD anywhere, not even conditionally. But it's run a lot, so I figured it might be interesting to use a more efficient algorithm for.
Performance improvements
The post "Validating UTF-8 strings using as little as 0.7 cycles per byte" shows about an order of magnitude performance improvement on validating utf-8, going from
8
cycles per byte parsed to0.7
cycles per byte parsed.When passing Rust's validation code through the godbolt decompiler,
from_utf8_unchecked
outputs 7 instructions, andfrom_utf8
outputs 57 instructions. In the case offrom_utf8
most instructions seem to occur inside a loop. Which makes it likely we'll be able to observe a performance improvement by using a SIMD-enabled utf-8 parsing algorithm. Especially since economies of scale would apply here -- it's not uncommon for the compiler to parse several million bytes of input in a run. Any improvements here would quickly add up.All examples linked have been compiled with
-O -C target-cpu=native
.Also ecosystem libraries such as
serde_json
perform utf-8 validation in several locations, so would likely also benefit from performance improvements to Rust's utf-8 validation routines.Implementation
There are two known Rust implementations of lemire's algorithm available in Rust today:
The latter even includes benchmarks against the compiler's algorithm (which makes it probable I'm not be the first person to think of this). But I haven't been able to successfully compile the benches, so I don't know how they stack up against the current implementation.
I'm not overly familiar with rustc's internals. But it seems we would likely want to keep the current algorithm, and through feature detection enable SIMD algorithms. The
simdjson
library has different algorithms for different architectures, but we could probably start with instructions that are widely available and supported on tier-1 targets (such asAVX2
).These changes wouldn't require an RFC because no APIs would change. The only outcome would be a performance improvement.
Future work
Lemire's post also covers parsing ASCII in as little as 0.1 cycles per byte parsed. Rust's current ASCII validation algorithm validates bytes one at the time, and could likely benefit from similar optimizations.
rust/src/libcore/str/mod.rs
Lines 4136 to 4141 in 2f688ac
Speeding this up would likely have ecosystem implications as well. For example HTTP headers must be valid ASCII, and are often performance sensitive. If the stdlib sped up ASCII validation, it would likely benefit the wider ecosystem as well.
Conclusion
In this issue I propose to use a SIMD-enabled algorithm for utf-8 validation in rustc. This seems like an interesting avenue to explore since there's a reasonable chance it might yield a performance improvement for many rust programs.
I'm somewhat excited to have stumbled upon this, but was also surprised no issue had been filed for this yet. I'm a bit self-aware posting this since I'm not a rustc compiler engineer; but I hope this proves to be useful!
cc/ @jonas-schievink @nnethercote
References
The text was updated successfully, but these errors were encountered: