Optimize varint parsing #337

saethlin · 2020-06-25T19:57:21Z

@ZoeyR After this meets with your approval I'd like the chance to verify that it fixes the performance issue we're seeing before you merge.

codecov-commenter · 2020-06-25T19:58:49Z

Codecov Report

Merging #337 into master will decrease coverage by 0.05%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##           master     #337      +/-   ##
==========================================
- Coverage   61.40%   61.35%   -0.06%     
==========================================
  Files          11       11              
  Lines         824      828       +4     
==========================================
+ Hits          506      508       +2     
- Misses        318      320       +2

Impacted Files	Coverage Δ
src/de/mod.rs	`77.77% <ø> (ø)`
src/de/read.rs	`53.57% <50.00%> (-0.28%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b3c88a...a85958d. Read the comment docs.

ZoeyR · 2020-06-25T20:47:49Z

Looking at this I'm not sure how is can increase performance, since its just calling the same methods we already called before. The main reason BincodeRead allowed for performance increases was that the slice implementation specialized its behavior.

saethlin · 2020-12-24T18:36:19Z

added the waiting on author label

Oh hi that's me. Sorry, this got put at the end of the queue when I became busy and never made its way back up.

The short answer to "how can this increase performance" is that the code paths in byteorder all use the std::io::Read buffer-based interface to move data around. Which since compilers aren't perfect means that we have a lot of calls to __memcpy_avx_unaligned to move 1, 2, 4, or 8 bytes around. It looks like the optimizer also falls down in other ways. I have an internal benchmark of our deserialization code that sees a 20% improvement by cleaning up this code path.

To actually realize this improvement, an implementor of BincodeRead needs to bypass the memcpy-like code path. On my end I'm doing this by pulling bytes directly out of a hacked-up std::io::BufReader. If we had access to specialization, I think this means we could make bincode deserialization from a std::io::BufReader faster by default, but alas.

codecov-io · 2020-12-24T19:21:12Z

Codecov Report

Merging #337 (7236cce) into master (c393171) will increase coverage by 3.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #337      +/-   ##
==========================================
+ Coverage   61.40%   64.44%   +3.04%     
==========================================
  Files          11       11              
  Lines         824      827       +3     
==========================================
+ Hits          506      533      +27     
+ Misses        318      294      -24

Impacted Files	Coverage Δ
src/de/mod.rs	`78.43% <100.00%> (+0.65%)`	⬆️
src/de/read.rs	`61.29% <100.00%> (+7.44%)`	⬆️
src/config/int.rs	`57.14% <0.00%> (+2.04%)`	⬆️
src/ser/mod.rs	`84.40% <0.00%> (+6.42%)`	⬆️
src/config/limit.rs	`54.54% <0.00%> (+9.09%)`	⬆️
src/config/trailing.rs	`71.42% <0.00%> (+14.28%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c393171...7236cce. Read the comment docs.

src/de/read.rs

VictorKoenders · 2021-04-08T13:02:15Z

Looks good to me, @saethlin do you still want to check if this fixes your performance issues?

saethlin · 2021-04-08T19:37:38Z

@VictorKoenders Thanks for the attention. I'd like to assemble a microbenchmark or two that we can add to this repo to back up these changes. I'm a bit busy at the moment but I should get to this in the next few days.

codecov-commenter · 2021-04-17T21:24:58Z

Codecov Report

❗ No coverage uploaded for pull request base (trunk@922d193). Click here to learn what that means.
The diff coverage is 29.62%.

@@           Coverage Diff            @@
##             trunk     #337   +/-   ##
========================================
  Coverage         ?   62.45%           
========================================
  Files            ?       10           
  Lines            ?      871           
  Branches         ?        0           
========================================
  Hits             ?      544           
  Misses           ?      327           
  Partials         ?        0

Impacted Files	Coverage Δ
src/config/int.rs	`54.82% <0.00%> (ø)`
src/de/read.rs	`38.92% <30.92%> (ø)`
src/de/mod.rs	`77.77% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 922d193...712256c. Read the comment docs.

saethlin · 2021-04-17T21:33:14Z

I've force-pushed a new set of commits, which start with adding benchmarks. Anyone interested should be able to check out each commit and run cargo bench to verify that that each change by itself or in combination with other commits is backed up by the benchmarks. Here's the output on my desktop between the first and last commits (I've deleted the outlier reports):

slice_varint_u8         time:   [11.660 us 11.712 us 11.751 us]
                        change: [-11.588% -10.242% -8.8491%] (p = 0.00 < 0.05)
                        Performance has improved.

slice_varint_u16        time:   [23.902 us 23.904 us 23.906 us]
                        change: [-46.843% -46.831% -46.821%] (p = 0.00 < 0.05)
                        Performance has improved.

slice_varint_u32        time:   [23.496 us 23.498 us 23.499 us]
                        change: [-52.080% -52.075% -52.069%] (p = 0.00 < 0.05)
                        Performance has improved.

slice_varint_u64        time:   [11.782 us 11.784 us 11.785 us]
                        change: [-55.459% -55.447% -55.432%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u8     time:   [25.808 us 25.809 us 25.811 us]
                        change: [-30.269% -30.255% -30.241%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u16    time:   [56.217 us 56.225 us 56.236 us]
                        change: [-30.338% -30.317% -30.296%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u32    time:   [56.507 us 56.514 us 56.521 us]
                        change: [-29.953% -29.936% -29.919%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u64    time:   [56.436 us 56.440 us 56.443 us]
                        change: [-15.982% -15.962% -15.944%] (p = 0.00 < 0.05)
                        Performance has improved.

There's definitely more work that could be done here, but I'm personally careful about scope creep. On the BufReader benchmarks, deserialize_varint is not inlined, probably due to its code size. But I think that its body is basically the same code 3 times, so with some cleverness it could probably be collapsed enough to be a good candidate for inlining.

saethlin · 2021-04-23T01:45:58Z

Based on the observation that deserialize_varint is slow primarily because it tends to do two reads, I've drafted an alternate implementation, which rethinks BincodeRead as a trait that allows bincode generic access to an internal buffer, in the style of BufRead. This is now considerably faster for BufReader, though I'm not totally clear on the slice behavior yet. (unclear if it's my re-baselining or latest commit that is at fault)

slice_varint_u8         time:   [12.333 us 12.335 us 12.337 us]
                        change: [+29.307% +29.349% +29.391%] (p = 0.00 < 0.05)
                        Performance has regressed.

slice_varint_u16        time:   [23.022 us 23.027 us 23.033 us]
                        change: [-45.572% -45.528% -45.481%] (p = 0.00 < 0.05)
                        Performance has improved.

slice_varint_u32        time:   [22.865 us 22.875 us 22.886 us]
                        change: [-48.292% -48.236% -48.160%] (p = 0.00 < 0.05)
                        Performance has improved.

slice_varint_u64        time:   [14.270 us 14.272 us 14.273 us]
                        change: [-48.914% -48.906% -48.897%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u8     time:   [16.583 us 16.584 us 16.585 us]
                        change: [-55.812% -55.758% -55.709%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u16    time:   [31.291 us 31.295 us 31.301 us]
                        change: [-62.946% -62.924% -62.904%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u32    time:   [30.902 us 30.904 us 30.906 us]
                        change: [-62.075% -62.061% -62.049%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u64    time:   [31.816 us 31.819 us 31.822 us]
                        change: [-53.799% -53.792% -53.786%] (p = 0.00 < 0.05)
                        Performance has improved.

saethlin · 2021-04-24T03:51:53Z

(unclear if it's my re-baselining or latest commit that is at fault)

The regression is real, but I truly have no idea why. I cannot distinguish any semantic difference between the instructions that are generated for the benchmark between commits. At a guess, it's a code layout mistake. I can get the faster codegen again by turning on LTO, which suggests that is the case because there's no reason symbol visibility should matter, but running passes in a different order often produces random optimizations and regressions.

saethlin · 2021-04-24T16:48:32Z

With the latest commit, all the benchmarks inline everything into one blob. So basically, we've peaked. Improvements from here on are almost certainly quite marginal. Unless anyone has ideas on why the slice_varint_u8 regresses (that is, if anyone even cares about this use-case), I'm done tuning.

slice_varint_u8         time:   [12.344 us 12.344 us 12.345 us]
                        change: [+29.028% +29.046% +29.064%] (p = 0.00 < 0.05)
                        Performance has regressed.

slice_varint_u16        time:   [23.889 us 23.890 us 23.892 us]
                        change: [-44.029% -44.023% -44.017%] (p = 0.00 < 0.05)
                        Performance has improved.

slice_varint_u32        time:   [25.904 us 25.907 us 25.909 us]
                        change: [-42.067% -42.055% -42.042%] (p = 0.00 < 0.05)
                        Performance has improved.

slice_varint_u64        time:   [15.165 us 15.169 us 15.173 us]
                        change: [-45.931% -45.911% -45.889%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u8     time:   [13.744 us 13.747 us 13.751 us]
                        change: [-63.480% -63.463% -63.448%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u16    time:   [28.180 us 28.398 us 28.564 us]
                        change: [-67.873% -67.604% -67.385%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u32    time:   [24.455 us 24.457 us 24.459 us]
                        change: [-70.190% -70.187% -70.184%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u64    time:   [20.501 us 20.507 us 20.514 us]
                        change: [-70.600% -70.592% -70.584%] (p = 0.00 < 0.05)
                        Performance has improved.

ZoeyR · 2021-06-20T01:15:57Z

Sorry this has taken so long! This all looks great.

saethlin changed the title ~~add provided functions to BincodeRead to customize reading of literals~~ Add provided functions to BincodeRead to customize reading of literals Jun 25, 2020

ZoeyR added the waiting on author label Jul 20, 2020

Base automatically changed from master to trunk February 23, 2021 14:27

VictorKoenders reviewed Apr 8, 2021

View reviewed changes

src/de/read.rs Show resolved Hide resolved

saethlin added 7 commits April 15, 2021 20:27

Add benchmarks for varint parsing

215d8d2

Enable more inlining

41b7f5a

Outline error construction

8da76fe

Add provided functions to BincodeRead to customize reading of literals

c03b1e1

Add #[inline] to deserialize_byte

5f25241

Outline SliceReader::unexpected_eof so that deserialize_varint inlines

55264eb

Implement BincodeRead for std::io::BufReader

712256c

saethlin marked this pull request as draft April 23, 2021 00:11

Reimplement all of BincodeRead in terms of BufRead-like functions

89dcf8b

saethlin marked this pull request as ready for review April 24, 2021 03:48

Move branch into error-creation function to get below inline threshold

fad988a

saethlin changed the title ~~Add provided functions to BincodeRead to customize reading of literals~~ Optimize varint parsing May 18, 2021

ZoeyR merged commit e0ac324 into bincode-org:trunk Jun 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize varint parsing #337

Optimize varint parsing #337

saethlin commented Jun 25, 2020

codecov-commenter commented Jun 25, 2020 •

edited

Loading

ZoeyR commented Jun 25, 2020

saethlin commented Dec 24, 2020

codecov-io commented Dec 24, 2020

VictorKoenders commented Apr 8, 2021

saethlin commented Apr 8, 2021

codecov-commenter commented Apr 17, 2021

saethlin commented Apr 17, 2021

saethlin commented Apr 23, 2021

saethlin commented Apr 24, 2021

saethlin commented Apr 24, 2021 •

edited

Loading

ZoeyR commented Jun 20, 2021

Optimize varint parsing #337

Optimize varint parsing #337

Conversation

saethlin commented Jun 25, 2020

codecov-commenter commented Jun 25, 2020 • edited Loading

Codecov Report

ZoeyR commented Jun 25, 2020

saethlin commented Dec 24, 2020

codecov-io commented Dec 24, 2020

Codecov Report

VictorKoenders commented Apr 8, 2021

saethlin commented Apr 8, 2021

codecov-commenter commented Apr 17, 2021

Codecov Report

saethlin commented Apr 17, 2021

saethlin commented Apr 23, 2021

saethlin commented Apr 24, 2021

saethlin commented Apr 24, 2021 • edited Loading

ZoeyR commented Jun 20, 2021

codecov-commenter commented Jun 25, 2020 •

edited

Loading

saethlin commented Apr 24, 2021 •

edited

Loading