Add optimized slide_hash for Power processors #457
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,
During performance tests, we noticed that slide_hash consumes considerable CPU during compression on Power processors. This PR introduces an optimized version using VSX vector instructions to make it faster. The main difference is that it slides 8 elements at a time, instead of just one as the standard code does.
The implementation uses GNU indirect function (ifunc) feature to choose the correct function version to be used on the first call during runtime. Later calls will all go directly to the selected function. This way, the same binary can be used for all Power processor versions. The ifunc helper code, however, is not limited to Power, and can be reused by other archs if wanted, so it was placed under
contrib/gcc
.I tried to make as few changes as possible to top-level files (
deflate.c
), and instead place most Power-specific code undercontrib/power
.To measure the performance improvement, we used Chromium's zlib_bench.cc with input files from jsnell/zlib-bench.
The results below show compression throughput in MB/s using RAW deflate, for all compression levels:
jpeg
pngpixels
executable
html