Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

argon2: add parallelism #547

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

jonasmalacofilho
Copy link

@jonasmalacofilho jonasmalacofilho commented Jan 13, 2025

Adds a default-enabled parallel feature, with an otherwise optional dependency on rayon, and parallelize the filling of blocks using the memory views mentioned above.

Coordinated shared access in the memory blocks is implemented with a SegmentViewIter iterator, which implements either rayon::iter::ParallelIterator or core::iter::Iterator and returns SegmentView views into the Argon2 blocks memory that are safe to be used in parallel.

The views alias in the regions that are read-only, but are disjoint in the regions where mutation happens. Effectively, they implement, with a combination of mutable borrowing and runtime checking, the cooperative contract outlined in RFC 9106. This is similar to what was suggested in #380.

To avoid aliasing mutable references into the entire buffer of blocks (which would be UB), pointers are used up to the moment where a reference (shared or mutable) into a specific block is returned. At that point, aliasing is no longer possible.

The following tests have been tried in and pass Miri (modulo unrelated warnings):

reference_argon2i_v0x13_2_8_2
reference_argon2id_v0x13_2_8_2

(Running these in Miri is quite slow, taking ~5 minutes each, so I only ran the most obviously relevant tests for now).

Finally, the alignment of Blocks increases to 128 bytes for better prevention of false sharing on modern platforms. The new value is based on notes on crossbeam-utils::CachePadded.


I also took some inspiration from an intermediate snapshot of #247, before the parallel implementation was removed, as well as from an implementation without any safe abstractions I just worked on for the rust-argon2 crate (sru-systems/rust-argon2#56).

@newpavlov
Copy link
Member

Could you benchmark the parallel implementation and compare it against the single threaded one?

@jonasmalacofilho

This comment was marked as outdated.

Comment on lines +345 to +400
memory_blocks
.segment_views(slice, lanes)
.for_each(|mut memory_view| {
let lane = memory_view.lane();
Copy link
Author

@jonasmalacofilho jonasmalacofilho Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that this fill_blocks diff is very noisy, due to a necessary indentation change + rustfmt + diff getting somewhat lost.

The only changes to this function are the use of the segment view iterator (here), the accessing of memory through the segment view API instead of through indexing of the memory_blocks slice (bellow), and changing memory_blocks to be mutable (above).

@jonasmalacofilho
Copy link
Author

jonasmalacofilho commented Jan 16, 2025

Some benchmarks:

Note: these are outdated since the removal of 018c3e9 due to #568 (comment).

Benchmarking master...HEAD with parallel feature
argon2i V0x10           time:   [21.324 ms 21.344 ms 21.371 ms]                          
                        change: [-0.3322% -0.1068% +0.0761%] (p = 0.34 > 0.05)
                        No change in performance detected.

argon2i V0x13           time:   [21.429 ms 21.447 ms 21.471 ms]                          
                        change: [+0.0329% +0.2197% +0.3896%] (p = 0.01 < 0.05)
                        Change within noise threshold.

argon2d V0x10           time:   [21.302 ms 21.322 ms 21.348 ms]                          
                        change: [+0.6139% +0.8010% +0.9679%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2d V0x13           time:   [21.367 ms 21.384 ms 21.408 ms]                          
                        change: [+1.8140% +1.9978% +2.1628%] (p = 0.00 < 0.05)
                        Performance has regressed.

argon2id V0x10          time:   [21.361 ms 21.379 ms 21.405 ms]                           
                        change: [+1.2980% +1.4700% +1.6321%] (p = 0.00 < 0.05)
                        Performance has regressed.

argon2id V0x13          time:   [21.303 ms 21.320 ms 21.342 ms]                           
                        change: [+0.9147% +1.1631% +1.3556%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2id V0x13 m=2048 t=4 p=4                                                                             
                        time:   [1.6939 ms 1.6979 ms 1.7026 ms]
                        change: [-58.795% -58.661% -58.490%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=16384 t=4 p=4                                                                            
                        time:   [11.230 ms 11.309 ms 11.391 ms]
                        change: [-67.907% -67.695% -67.447%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=65536 t=4 p=4                                                                            
                        time:   [44.778 ms 45.122 ms 45.489 ms]
                        change: [-71.067% -70.867% -70.621%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=262144 t=4 p=4                                                                            
                        time:   [172.61 ms 173.58 ms 174.61 ms]
                        change: [-72.478% -72.337% -72.127%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=2 p=4                                                                            
                        time:   [11.964 ms 12.047 ms 12.132 ms]
                        change: [-69.521% -69.311% -69.093%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=8 p=4                                                                            
                        time:   [45.011 ms 45.311 ms 45.623 ms]
                        change: [-69.838% -69.634% -69.434%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=16 p=4                                                                            
                        time:   [88.879 ms 89.461 ms 90.061 ms]
                        change: [-69.861% -69.687% -69.482%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=24 p=4                                                                            
                        time:   [133.26 ms 134.09 ms 134.93 ms]
                        change: [-69.816% -69.628% -69.446%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=1                                                                            
                        time:   [8.1242 ms 8.1254 ms 8.1268 ms]
                        change: [+1.4099% +1.4320% +1.4529%] (p = 0.00 < 0.05)
                        Performance has regressed.

argon2id V0x13 m=2048 t=8 p=2                                                                             
                        time:   [4.8775 ms 4.9057 ms 4.9336 ms]
                        change: [-39.640% -39.331% -38.984%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=4                                                                             
                        time:   [3.2967 ms 3.3045 ms 3.3137 ms]
                        change: [-59.213% -59.105% -58.995%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=6                                                                             
                        time:   [2.5706 ms 2.5757 ms 2.5827 ms]
                        change: [-68.446% -68.385% -68.301%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=8                                                                             
                        time:   [2.1205 ms 2.1339 ms 2.1500 ms]
                        change: [-73.975% -73.809% -73.631%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=12                                                                             
                        time:   [1.8220 ms 1.8515 ms 1.8819 ms]
                        change: [-77.377% -76.954% -76.482%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=16                                                                             
                        time:   [2.2035 ms 2.2221 ms 2.2437 ms]
                        change: [-73.287% -73.088% -72.841%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=64                                                                             
                        time:   [2.2370 ms 2.2553 ms 2.2788 ms]
                        change: [-74.567% -74.380% -74.087%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=1                                                                            
                        time:   [74.181 ms 74.228 ms 74.292 ms]
                        change: [-0.8519% -0.7318% -0.6115%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2id V0x13 m=32768 t=4 p=2                                                                            
                        time:   [39.565 ms 39.759 ms 39.980 ms]
                        change: [-47.750% -47.455% -47.143%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=4                                                                            
                        time:   [23.032 ms 23.199 ms 23.368 ms]
                        change: [-69.607% -69.389% -69.150%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=6                                                                            
                        time:   [18.127 ms 18.171 ms 18.214 ms]
                        change: [-75.369% -75.303% -75.239%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=8                                                                            
                        time:   [14.412 ms 14.439 ms 14.471 ms]
                        change: [-80.442% -80.403% -80.360%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=12                                                                            
                        time:   [11.878 ms 12.021 ms 12.200 ms]
                        change: [-83.827% -83.654% -83.390%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=16                                                                            
                        time:   [14.359 ms 14.388 ms 14.423 ms]
                        change: [-80.504% -80.462% -80.415%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=64                                                                            
                        time:   [12.239 ms 12.285 ms 12.343 ms]
                        change: [-83.542% -83.480% -83.391%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=1                                                                            
                        time:   [652.11 ms 652.26 ms 652.40 ms]
                        change: [-6.4332% -6.4049% -6.3769%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=2                                                                            
                        time:   [337.65 ms 338.01 ms 338.40 ms]
                        change: [-51.454% -51.401% -51.345%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=4                                                                            
                        time:   [178.52 ms 179.41 ms 180.40 ms]
                        change: [-74.218% -74.087% -73.947%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=6                                                                            
                        time:   [137.57 ms 139.27 ms 141.00 ms]
                        change: [-80.074% -79.832% -79.558%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=8                                                                            
                        time:   [136.21 ms 136.41 ms 136.64 ms]
                        change: [-80.298% -80.265% -80.231%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=12                                                                            
                        time:   [119.20 ms 120.03 ms 121.02 ms]
                        change: [-82.675% -82.535% -82.391%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=16                                                                            
                        time:   [146.64 ms 147.06 ms 147.47 ms]
                        change: [-78.611% -78.557% -78.499%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=64                                                                            
                        time:   [131.18 ms 131.41 ms 131.64 ms]
                        change: [-80.804% -80.771% -80.735%] (p = 0.00 < 0.05)
                        Performance has improved.

Note: 6-core CPU with SMT.

Also:

Benchmarking master...HEAD without parallel feature, default param tests only
argon2i V0x10           time:   [21.365 ms 21.390 ms 21.419 ms]                          
                        change: [-0.9417% -0.7019% -0.4585%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2i V0x13           time:   [21.523 ms 21.548 ms 21.574 ms]                          
                        change: [+0.0241% +0.2325% +0.4389%] (p = 0.03 < 0.05)
                        Change within noise threshold.

argon2d V0x10           time:   [21.201 ms 21.220 ms 21.243 ms]                          
                        change: [-0.6101% -0.4179% -0.2436%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2d V0x13           time:   [21.403 ms 21.426 ms 21.453 ms]                          
                        change: [+0.3981% +0.6366% +0.8608%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2id V0x10          time:   [21.241 ms 21.258 ms 21.279 ms]                           
                        change: [-1.7410% -1.5319% -1.3262%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13          time:   [21.319 ms 21.335 ms 21.355 ms]                           
                        change: [-0.9682% -0.7757% -0.5904%] (p = 0.00 < 0.05)
                        Change within noise threshold.

@tarcieri
Copy link
Member

@jonasmalacofilho if you can rebase I added cargo careful in #553 which should help spot issues in unsafe code

@jonasmalacofilho jonasmalacofilho force-pushed the add-parallelism-to-argon2 branch from 264821d to 018c3e9 Compare January 21, 2025 18:32
@jonasmalacofilho
Copy link
Author

@tarcieri oh, i forgot about that one. Rebased, and thanks for pointing it out!

That said, we should probably try also to add the very cheapest of tests and have it run in Miri in CI:

That said, there is a lot of Undefined Behavior that is not detected by cargo careful; check out Miri if you want to be more exhaustively covered. The advantage of cargo careful over Miri is that it works on all code, supports using arbitrary system and C FFI functions, and is much faster.

@jonasmalacofilho
Copy link
Author

By the way, I think there are some things I can improve in the code, but I would really appreciate a review first. And so I've kept edits to a minimum for now, so that you can actually review it.

@tarcieri
Copy link
Member

Yeah, Miri is tricky specifically because you can't do anything computationally expensive under it. I think we could potentially gate expensive tests under #[cfg(not(miri))] perhaps?

@jonasmalacofilho
Copy link
Author

Yeah, Miri is tricky specifically because you can't do anything computationally expensive under it. I think we could potentially gate expensive tests under #[cfg(not(miri))] perhaps?

I think the 2_8_2 (t=2,m=2,p=2) tests are the cheapest in the crate, and still quite expensive... I could try adding t=1,m=8,p=2 tests and see if they execute in acceptable time in CI.

Additionally, maybe a few unit tests ensuring that allowed borrows pass in Miri, and that some known invalid borrow patterns are either impossible at compile time or caught at runtime.

@jonasmalacofilho
Copy link
Author

jonasmalacofilho commented Mar 4, 2025

Quick update: I ended up getting stuck trying to remove the (apparently) unrelated warnings from Miri (a warning in crossbeam and a leak due to rayon), and then I couldn't get to this PR for a few weeks.


EDIT: (easily) running the tests in Miri is not currently possible due to crossbeam-rs/crossbeam#1181. Once that fix is released, it's possible that only Tree Borrows may work due to crossbeam-rs/crossbeam#545.

Coordinated shared access in the memory blocks is implemented with
`SegmentViewIter` and associated types, which provide views into Argon2
memory that can be processed in parallel.

These views alias in the regions that are read-only, but are disjoint in
the regions where mutation happens. Effectively, they implement, with a
combination of mutable borrowing and runtime checking, the cooperative
contract outlined in RFC 9106.

To avoid aliasing mutable references into the entire buffer of blocks
(which would be UB), pointers are used up to the moment where a
reference (shared or mutable) into a specific block is returned. At that
point, aliasing is no longer possible, as argued in SAFETY comments
and/or checked at runtime.

Finally, add a `parallel` feature and parallelize filling the blocks
using the memory views mentioned above and rayon.
This was cause by having multiple different versions of criterion, and
therefore the train, in use: we specified ^0.4, but pprof 0.14.0 already
required ^0.5.
Additionally, use a set instead of trying to avoid repeating a
particular set of params by hand.
@jonasmalacofilho jonasmalacofilho force-pushed the add-parallelism-to-argon2 branch from 31cecde to 0f5355a Compare March 8, 2025 17:48
@jonasmalacofilho
Copy link
Author

I removed the conflict, rebased the PR, fixed/updated the benchmarks and did some other minor cleanup.

Crossbeam-epoch doesn't currently work in Miri (see my edited comment above). Between that and the fact that even the most minimal Miri test would be super slow on GitHub free runners, I just don't think they are worth it for now. (It should be still possible to get an older toolchain and Miri and run some specific tests locally).

Is there's something else you would like me to add here?

@tarcieri
Copy link
Member

tarcieri commented Mar 9, 2025

@jonasmalacofilho still need to go through it in detail, but there's nothing I see that's an immediate blocker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants