Add optimized slide_hash for Power processors #457

mscastanho · 2019-12-10T14:34:43Z

Hi,

During performance tests, we noticed that slide_hash consumes considerable CPU during compression on Power processors. This PR introduces an optimized version using VSX vector instructions to make it faster. The main difference is that it slides 8 elements at a time, instead of just one as the standard code does.

The implementation uses GNU indirect function (ifunc) feature to choose the correct function version to be used on the first call during runtime. Later calls will all go directly to the selected function. This way, the same binary can be used for all Power processor versions. The ifunc helper code, however, is not limited to Power, and can be reused by other archs if wanted, so it was placed under contrib/gcc.

I tried to make as few changes as possible to top-level files (deflate.c), and instead place most Power-specific code under contrib/power.

To measure the performance improvement, we used Chromium's zlib_bench.cc with input files from jsnell/zlib-bench.

The results below show compression throughput in MB/s using RAW deflate, for all compression levels:

jpeg

comp lvl	default	optimized	gain
1	20.4	27.4	+34.31%
2	20.2	26.4	+30.69%
3	20.2	27.1	+34.16%
4	20.3	27.3	+34.48%
5	20.3	27.3	+34.48%
6	20.3	27.3	+34.48%
7	20.3	27.3	+34.48%
8	20.3	27.3	+34.48%
9	20.3	27.3	+34.48%

pngpixels

comp lvl	default	optimized	gain
1	67.0	98.6	+47.16%
2	58.7	79.8	+35.95%
3	38.8	46.7	+20.36%
4	42.1	48.8	+15.91%
5	26.6	29.2	+9.77%
6	13.8	14.5	+5.07%
7	8.9	9.2	+3.37%
8	2.8	2.8	+0.00%
9	1.3	1.3	+0.00%

executable

comp lvl	default	optimized	gain
1	41.3	57.6	+39.47%
2	37.9	50.9	+34.30%
3	29.0	36.1	+24.48%
4	28.4	34.8	+22.54%
5	20.2	23.2	+14.85%
6	12.5	13.7	+9.60%
7	9.5	10.1	+6.32%
8	5.4	5.6	+3.70%
9	4.1	4.2	+2.44%

html

comp lvl	default	optimized	gain
1	43.1	59.3	+37.59%
2	38.6	50.7	+31.35%
3	27.8	33.8	+21.58%
4	28.3	33.1	+16.96%
5	18.1	20.1	+11.05%
6	12.2	13.0	+6.56%
7	10.6	11.2	+5.66%
8	8.0	8.4	+5.00%
9	7.9	8.3	+5.06%

mscastanho · 2020-03-10T20:57:33Z

Force push to add changes to feature detection on configure.

Optimized functions for Power will make use of GNU indirect functions, an extension to support different implementations of the same function, which can be selected during runtime. This will be used to provide optimized functions for different processor versions. Since this is a GNU extension, we placed the definition of the Z_IFUNC macro under `contrib/gcc`. This can be reused by other archs as well. Author: Matheus Castanho <[email protected]> Author: Rogerio Alves <[email protected]>

Considerable time is spent on deflate.c:slide_hash() during deflate. This commit introduces a new slide_hash function that uses VSX vector instructions to slide 8 hash elements at a time, instead of just one as the standard code does. The choice between the optimized and default versions is made only on the first call to the function, enabling a fallback to standard behavior if the host processor does not support VSX instructions, so the same binary can be used for multiple Power processor versions. Author: Matheus Castanho <[email protected]>

This was referenced Dec 10, 2019

Add optimization for Adler32 checksum for Power processors #458

Open

Add optimized longest_match for Power processors #459

Open

nmoinvaz mentioned this pull request Jan 17, 2020

Add AltiVec-optimized adler32 and slide_hash for PowerPC zlib-ng/zlib-ng#109

Merged

mscastanho mentioned this pull request Feb 3, 2020

Adding CPU features detection code #468

Open

mscastanho force-pushed the slide-hash-power branch 2 times, most recently from 5a93654 to d62e658 Compare March 10, 2020 20:55

mscastanho force-pushed the slide-hash-power branch from 751c961 to 862cc11 Compare April 6, 2022 13:07

mscastanho force-pushed the slide-hash-power branch from 862cc11 to 505b2fd Compare June 13, 2022 17:03

Neustradamus mentioned this pull request Aug 23, 2023

IBM Power Processors and Zlib #847

Open

Neustradamus mentioned this pull request Jan 1, 2025

CMake and Zlib #831

Closed

fneddy mentioned this pull request Feb 25, 2025

IBM S390X contrib cleanup #1050

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimized slide_hash for Power processors #457

Add optimized slide_hash for Power processors #457

mscastanho commented Dec 10, 2019

mscastanho commented Mar 10, 2020

Add optimized slide_hash for Power processors #457

Are you sure you want to change the base?

Add optimized slide_hash for Power processors #457

Conversation

mscastanho commented Dec 10, 2019

mscastanho commented Mar 10, 2020