Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize varint parsing #337

Merged
merged 9 commits into from
Jun 20, 2021
Merged

Optimize varint parsing #337

merged 9 commits into from
Jun 20, 2021

Conversation

saethlin
Copy link
Contributor

@ZoeyR After this meets with your approval I'd like the chance to verify that it fixes the performance issue we're seeing before you merge.

@codecov-commenter
Copy link

codecov-commenter commented Jun 25, 2020

Codecov Report

Merging #337 into master will decrease coverage by 0.05%.
The diff coverage is 50.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #337      +/-   ##
==========================================
- Coverage   61.40%   61.35%   -0.06%     
==========================================
  Files          11       11              
  Lines         824      828       +4     
==========================================
+ Hits          506      508       +2     
- Misses        318      320       +2     
Impacted Files Coverage Δ
src/de/mod.rs 77.77% <ø> (ø)
src/de/read.rs 53.57% <50.00%> (-0.28%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b3c88a...a85958d. Read the comment docs.

@saethlin saethlin changed the title add provided functions to BincodeRead to customize reading of literals Add provided functions to BincodeRead to customize reading of literals Jun 25, 2020
@ZoeyR
Copy link
Collaborator

ZoeyR commented Jun 25, 2020

Looking at this I'm not sure how is can increase performance, since its just calling the same methods we already called before. The main reason BincodeRead allowed for performance increases was that the slice implementation specialized its behavior.

@saethlin
Copy link
Contributor Author

added the waiting on author label

Oh hi that's me. Sorry, this got put at the end of the queue when I became busy and never made its way back up.

The short answer to "how can this increase performance" is that the code paths in byteorder all use the std::io::Read buffer-based interface to move data around. Which since compilers aren't perfect means that we have a lot of calls to __memcpy_avx_unaligned to move 1, 2, 4, or 8 bytes around. It looks like the optimizer also falls down in other ways. I have an internal benchmark of our deserialization code that sees a 20% improvement by cleaning up this code path.

To actually realize this improvement, an implementor of BincodeRead needs to bypass the memcpy-like code path. On my end I'm doing this by pulling bytes directly out of a hacked-up std::io::BufReader. If we had access to specialization, I think this means we could make bincode deserialization from a std::io::BufReader faster by default, but alas.

@codecov-io
Copy link

Codecov Report

Merging #337 (7236cce) into master (c393171) will increase coverage by 3.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #337      +/-   ##
==========================================
+ Coverage   61.40%   64.44%   +3.04%     
==========================================
  Files          11       11              
  Lines         824      827       +3     
==========================================
+ Hits          506      533      +27     
+ Misses        318      294      -24     
Impacted Files Coverage Δ
src/de/mod.rs 78.43% <100.00%> (+0.65%) ⬆️
src/de/read.rs 61.29% <100.00%> (+7.44%) ⬆️
src/config/int.rs 57.14% <0.00%> (+2.04%) ⬆️
src/ser/mod.rs 84.40% <0.00%> (+6.42%) ⬆️
src/config/limit.rs 54.54% <0.00%> (+9.09%) ⬆️
src/config/trailing.rs 71.42% <0.00%> (+14.28%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c393171...7236cce. Read the comment docs.

Base automatically changed from master to trunk February 23, 2021 14:27
@VictorKoenders
Copy link
Contributor

Looks good to me, @saethlin do you still want to check if this fixes your performance issues?

@saethlin
Copy link
Contributor Author

saethlin commented Apr 8, 2021

@VictorKoenders Thanks for the attention. I'd like to assemble a microbenchmark or two that we can add to this repo to back up these changes. I'm a bit busy at the moment but I should get to this in the next few days.

@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (trunk@922d193). Click here to learn what that means.
The diff coverage is 29.62%.

Impacted file tree graph

@@           Coverage Diff            @@
##             trunk     #337   +/-   ##
========================================
  Coverage         ?   62.45%           
========================================
  Files            ?       10           
  Lines            ?      871           
  Branches         ?        0           
========================================
  Hits             ?      544           
  Misses           ?      327           
  Partials         ?        0           
Impacted Files Coverage Δ
src/config/int.rs 54.82% <0.00%> (ø)
src/de/read.rs 38.92% <30.92%> (ø)
src/de/mod.rs 77.77% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 922d193...712256c. Read the comment docs.

@saethlin
Copy link
Contributor Author

I've force-pushed a new set of commits, which start with adding benchmarks. Anyone interested should be able to check out each commit and run cargo bench to verify that that each change by itself or in combination with other commits is backed up by the benchmarks. Here's the output on my desktop between the first and last commits (I've deleted the outlier reports):

slice_varint_u8         time:   [11.660 us 11.712 us 11.751 us]
                        change: [-11.588% -10.242% -8.8491%] (p = 0.00 < 0.05)
                        Performance has improved.

slice_varint_u16        time:   [23.902 us 23.904 us 23.906 us]
                        change: [-46.843% -46.831% -46.821%] (p = 0.00 < 0.05)
                        Performance has improved.

slice_varint_u32        time:   [23.496 us 23.498 us 23.499 us]
                        change: [-52.080% -52.075% -52.069%] (p = 0.00 < 0.05)
                        Performance has improved.

slice_varint_u64        time:   [11.782 us 11.784 us 11.785 us]
                        change: [-55.459% -55.447% -55.432%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u8     time:   [25.808 us 25.809 us 25.811 us]
                        change: [-30.269% -30.255% -30.241%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u16    time:   [56.217 us 56.225 us 56.236 us]
                        change: [-30.338% -30.317% -30.296%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u32    time:   [56.507 us 56.514 us 56.521 us]
                        change: [-29.953% -29.936% -29.919%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u64    time:   [56.436 us 56.440 us 56.443 us]
                        change: [-15.982% -15.962% -15.944%] (p = 0.00 < 0.05)
                        Performance has improved.

There's definitely more work that could be done here, but I'm personally careful about scope creep. On the BufReader benchmarks, deserialize_varint is not inlined, probably due to its code size. But I think that its body is basically the same code 3 times, so with some cleverness it could probably be collapsed enough to be a good candidate for inlining.

@saethlin saethlin marked this pull request as draft April 23, 2021 00:11
@saethlin
Copy link
Contributor Author

Based on the observation that deserialize_varint is slow primarily because it tends to do two reads, I've drafted an alternate implementation, which rethinks BincodeRead as a trait that allows bincode generic access to an internal buffer, in the style of BufRead. This is now considerably faster for BufReader, though I'm not totally clear on the slice behavior yet. (unclear if it's my re-baselining or latest commit that is at fault)

slice_varint_u8         time:   [12.333 us 12.335 us 12.337 us]
                        change: [+29.307% +29.349% +29.391%] (p = 0.00 < 0.05)
                        Performance has regressed.

slice_varint_u16        time:   [23.022 us 23.027 us 23.033 us]
                        change: [-45.572% -45.528% -45.481%] (p = 0.00 < 0.05)
                        Performance has improved.

slice_varint_u32        time:   [22.865 us 22.875 us 22.886 us]
                        change: [-48.292% -48.236% -48.160%] (p = 0.00 < 0.05)
                        Performance has improved.

slice_varint_u64        time:   [14.270 us 14.272 us 14.273 us]
                        change: [-48.914% -48.906% -48.897%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u8     time:   [16.583 us 16.584 us 16.585 us]
                        change: [-55.812% -55.758% -55.709%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u16    time:   [31.291 us 31.295 us 31.301 us]
                        change: [-62.946% -62.924% -62.904%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u32    time:   [30.902 us 30.904 us 30.906 us]
                        change: [-62.075% -62.061% -62.049%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u64    time:   [31.816 us 31.819 us 31.822 us]
                        change: [-53.799% -53.792% -53.786%] (p = 0.00 < 0.05)
                        Performance has improved.

@saethlin saethlin marked this pull request as ready for review April 24, 2021 03:48
@saethlin
Copy link
Contributor Author

(unclear if it's my re-baselining or latest commit that is at fault)

The regression is real, but I truly have no idea why. I cannot distinguish any semantic difference between the instructions that are generated for the benchmark between commits. At a guess, it's a code layout mistake. I can get the faster codegen again by turning on LTO, which suggests that is the case because there's no reason symbol visibility should matter, but running passes in a different order often produces random optimizations and regressions.

@saethlin
Copy link
Contributor Author

saethlin commented Apr 24, 2021

With the latest commit, all the benchmarks inline everything into one blob. So basically, we've peaked. Improvements from here on are almost certainly quite marginal. Unless anyone has ideas on why the slice_varint_u8 regresses (that is, if anyone even cares about this use-case), I'm done tuning.

slice_varint_u8         time:   [12.344 us 12.344 us 12.345 us]
                        change: [+29.028% +29.046% +29.064%] (p = 0.00 < 0.05)
                        Performance has regressed.

slice_varint_u16        time:   [23.889 us 23.890 us 23.892 us]
                        change: [-44.029% -44.023% -44.017%] (p = 0.00 < 0.05)
                        Performance has improved.

slice_varint_u32        time:   [25.904 us 25.907 us 25.909 us]
                        change: [-42.067% -42.055% -42.042%] (p = 0.00 < 0.05)
                        Performance has improved.

slice_varint_u64        time:   [15.165 us 15.169 us 15.173 us]
                        change: [-45.931% -45.911% -45.889%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u8     time:   [13.744 us 13.747 us 13.751 us]
                        change: [-63.480% -63.463% -63.448%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u16    time:   [28.180 us 28.398 us 28.564 us]
                        change: [-67.873% -67.604% -67.385%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u32    time:   [24.455 us 24.457 us 24.459 us]
                        change: [-70.190% -70.187% -70.184%] (p = 0.00 < 0.05)
                        Performance has improved.

bufreader_varint_u64    time:   [20.501 us 20.507 us 20.514 us]
                        change: [-70.600% -70.592% -70.584%] (p = 0.00 < 0.05)
                        Performance has improved.

@saethlin saethlin changed the title Add provided functions to BincodeRead to customize reading of literals Optimize varint parsing May 18, 2021
@ZoeyR
Copy link
Collaborator

ZoeyR commented Jun 20, 2021

Sorry this has taken so long! This all looks great.

@ZoeyR ZoeyR merged commit e0ac324 into bincode-org:trunk Jun 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants