-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backport prevector optimizations #3132
Backport prevector optimizations #3132
Conversation
8abb590
to
5798eb6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK
Seeing like 15-20x in prevector benchmarks 👍
Tests also got a little bit faster on Travis: |
…rations much faster 5aad635 Use memset() to optimize prevector::resize() (Evan Klitzke) e46be25 Reduce redundant code of prevector and speed it up (Akio Nakamura) f0e7aa7 Add new prevector benchmarks. (Evan Klitzke) Pull request description: This branch optimizes various `prevector` operations, especially resizing vectors. While profiling the `loadblk` thread I noticed that a lot of time was being spent in `prevector::resize()` which led to this work. I have some data here indicating that it takes up **37%** of the time in `ReadBlockFromDisk()`: https://monad.io/readblockfromdisk.svg This branch improves things significantly. For trivial types, the new results for the prevector benchmark are: * `PrevectorClearTrivial` which tests `prevector::clear()` becomes 24.6x faster * `PrevectorDestructorTrivial` which tests `prevector::~prevector()` becomes 20.5x faster * `PrevectorResizeTrivial` which tests `prevector::resize()` becomes 20.3x faster Note that in practice it looks like the prevector is only used to contain `unsigned char` types, which is a trivial type. The benchmarks are testing a bit of an extreme case, but the changes here are motivated by the profiling data for `ReadBlockFromDisk()` I linked to above. The pull request here consists of a series of three commits: * The first adds new benchmarks but does not change the prevector code. * The second is from @AkioNak , and merges some prevector optimizations he submitted in bitcoin#11988 * The third optimizes `prevector::resize()` to use `memset()` when the prevector contains trivially constructible types Tree-SHA512: 28f7cbb91a19f9f43b6a5942781d7eb2e3197389186b666f086b69df12bee37773140f765426d715bfb8ebff79cb27a5f1206d0325b54b4aa65598b50fb18368
86b47fa speed up Unserialize_impl for prevector (Akio Nakamura) Pull request description: The unserializer for prevector uses `resize()` for reserve the area, but it's prefer to use `reserve()` because `resize()` have overhead to call its constructor many times. However, `reserve()` does not change the value of `_size` (a private member of prevector). This PR make the logic of read from stream to callback function, and prevector handles initilizing new values with that call-back and ajust the value of `_size`. The changes are as follows: 1. prevector.h Add a public member function named 'append'. This function has 2 params, number of elemenst to append and call-back function that initilizing new appended values. 2. serialize.h In the following two function: - `Unserialize_impl(Stream& is, prevector<N, T>& v, const unsigned char&)` - `Unserialize_impl(Stream& is, prevector<N, T>& v, const V&)` Make a callback function from each original logic of reading values from stream, and call prevector's `append()`. 3. test/prevector_tests.cpp Add a test for `append()`. ## A benchmark result is following: [Machine] MacBook Pro (macOS 10.13.3/i7 2.2GHz/mem 16GB/SSD) [result] DeserializeAndCheckBlockTest => 22% faster DeserializeBlockTest => 29% faster [before PR] # Benchmark, evals, iterations, total, min, max, median DeserializeAndCheckBlockTest, 60, 160, 94.4901, 0.0094644, 0.0104715, 0.0098339 DeserializeBlockTest, 60, 130, 65.0964, 0.00800362, 0.00895134, 0.00824187 [After PR] # Benchmark, evals, iterations, total, min, max, median DeserializeAndCheckBlockTest, 60, 160, 77.1597, 0.00767013, 0.00858959, 0.00805757 DeserializeBlockTest, 60, 130, 49.9443, 0.00613926, 0.00691187, 0.00635527 ACKs for top commit: laanwj: utACK 86b47fa Tree-SHA512: 62ea121ccd45a306fefc67485a1b03a853435af762607dae2426a87b15a3033d802c8556e1923727ddd1023a1837d0e5f6720c2c77b38196907e750e15fbb902
Until the necessary backports for the benchmark system are backported.
5798eb6
to
bdfc303
Compare
Rebased on develop to get Travis green |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-utACK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK
While debugging and profiling tests I've seen that in some tests (e.g. p2p-fullblocktest.py), 50% of the time can be spent in prevector initialization and copying. These backported PRs should reduce this to near zero time spent.