Monty-related performance improvements #777

fjarri · 2025-02-23T23:23:22Z

Trying to figure out massive slowdowns crypto-primes experiences for boxed uints (up to 4x). Could be the reason of the slowdowns in RSA as well.

Public changes:

Added Monty::div_by_2_assign() (with a blanket impl).
Added BoxedUint::inv_mod2k_vartime().
Made BoxedUint::inv_mod2k() public.
Added Monty::Multiplier associated type and Monty::copy_montgomery_from() to assist with tight loops (specifically, Lucas test in crypto-primes).
Cleaned up AMM, added comments and references, and reduced the size of the internal buffer to N from 2N. Also made it const fn. Closes Almost Montgomery Multiplication #782

Note: the multiplier for Uint is called DynMontyMultiplier. Not happy with the name, but we already have MontyMultiplier as a trait, and it clashes.

Note: the exact way MontyMultiplier is exposed and the naming I'm not sure about, also not sure how hazmat do we want to make them. Potentially AMM can be exposed too, but it would be good to wrap the results in some struct that will propagate the "reduction level". Not for this PR, I need to finalize the minimum viable solution.

Fixes:

Fixed a bug in BoxedUnsatInt::to_uint() which created a 64-bit number instead of a 32-bit one on 32-bit targets

Internal:

Added tests for BoxedUint::inv_mod2k() and inv_mod2k_vartime().
Removed allocations inside the loop in BoxedUint::inv_mod2k().
Used and inv_mod2k_vartime() in BoxedMontyParams::new_vartime() and new() - since it's only vartime in the k, which is fixed.
new_vartime() can be made even faster (~15% for Uint, 25% for Boxed) if we make a variant of inv_mod2k that is vartime in both arguments. Currently added in the commit as inv_mod2k_full_vartime() (crate-private). Can be removed if that's too much detail.
Removed an unnecessary allocation in Add/SubAssign of BoxedMontgomeryForm.

Performance notes:

BoxedUint::div_by_2() uses div_by_2_assign() because it is faster and does not allocate.
Uint::div_by_2() uses the same approach, gets rid of one addition and one shr1(), so it is marginally faster (~10%).
As expected, because of inv_mod2k_vartime() usage MontyParams::new/_vartime() became massively faster (~10x for Uint, ~15x for Boxed, 4096 bits).
I tried using AMM in Uint, but it leads to performance degradation for smaller uints (U256). So for now we'll keep the status quo with Uint using multiply + reduce. Worth investigating later.

fjarri · 2025-02-25T01:05:57Z

@tarcieri I'm still working on this PR, but meanwhile, it seems that I'll need to expose the following in the public API:

Multiplication/squaring of Monty with the provided buffer (since it seems to be impossible to do Montgomery multiplication fully in place). Currently it's mul_into() and square_into().
Copying one uint into another. Currently it's fill_with().

The buffer can be the same size as the arguments if we use interleaved Montgomery multiplication (I added both CIOS and FIOS ones, seem to have the same performance, but CIOS is simpler). For some reason the interleaved form is a couple percent slower than multiplication + reduction, even though it's supposed to have less operations.

Another tight loop where a lot of allocations are happening is pow_bounded_exp(). I think I can reduce them there as well.

tarcieri · 2025-02-25T07:47:40Z

I’m fine with exposing those methods in the public API.

Also getting rid of allocations would hopefully benefit rsa which is quite slow now, so that sounds great.

…uctor

src/modular/boxed_monty_form.rs

src/modular/monty_form.rs

src/uint/boxed/inv_mod.rs

tarcieri · 2025-03-05T22:07:51Z

Curious what the performance impact on rsa will be /cc @dignifiedquire

fjarri · 2025-03-05T22:13:18Z

I am actually curious too, was planning to test it. crypto-primes with this PR reaches parity between Uint and BoxedUint for 1024 bits, but it needs some changes to make use of the newly exposed multiplier.

dignifiedquire · 2025-03-05T22:25:02Z

do I need to update the usage, or does it make sense to test it directly?

fjarri · 2025-03-05T22:27:13Z

do I need to update the usage, or does it make sense to test it directly?

You might. Some calls are non-allocating automatically, but you may have to use the explicit multiplier object (that encapsulates the scratch buffer) if you do any multiplications in tight loops. Also crypto-primes needs some unpublished changes (to be made into a PR later today).

I'm testing it right now, and it seems that the changes in crypto-primes and crypto-bigint are not enough.

fjarri force-pushed the perf branch 9 times, most recently from 1db359c to 0b401e3 Compare February 26, 2025 03:31

fjarri force-pushed the perf branch 2 times, most recently from 5986947 to 8a46714 Compare March 5, 2025 20:53

fjarri added 12 commits March 5, 2025 12:55

Fix a bug in BoxedUnsatInt::to_uint()

c580dae

Add Monty::div_by_2_assign()

46013b9

More inv_mod2k methods and using them in MontyParams constructors

dca8c0b

Make MontyForm constructors const

c820291

Use sub_assign_mod_with_carry() in montgomery_reduction_boxed_mut()

0abfc4b

Get rid of allocations in Add/SubAssign for BoxedMontyForm

0cc9826

Use AMM for BoxedMontyForm multiplication

354efe6

Use AMM for conversion to Montgomery and in BoxedMontyParams constr…

c1dd823

…uctor

Use AMM in BoxedMontyForm::retrieve

d62e139

Make AMM functions const

6529b75

Use canonical CIOS with N-sized buffer, expand comments.

a3d464c

Expose MontyMultiplier

8dc0d4a

fjarri force-pushed the perf branch from 8a46714 to 8dc0d4a Compare March 5, 2025 20:56

fjarri marked this pull request as ready for review March 5, 2025 21:09

fjarri requested a review from tarcieri March 5, 2025 21:09

tarcieri reviewed Mar 5, 2025

View reviewed changes

src/modular/boxed_monty_form.rs Show resolved Hide resolved

tarcieri reviewed Mar 5, 2025

View reviewed changes

src/modular/monty_form.rs Show resolved Hide resolved

tarcieri reviewed Mar 5, 2025

View reviewed changes

src/uint/boxed/inv_mod.rs Show resolved Hide resolved

tarcieri approved these changes Mar 5, 2025

View reviewed changes

tarcieri merged commit 2734f18 into RustCrypto:master Mar 5, 2025
18 checks passed

fjarri mentioned this pull request Mar 5, 2025

Slowdown after switch to crypto-bigint RustCrypto/RSA#490

Open

fjarri deleted the perf branch March 5, 2025 22:56

fjarri mentioned this pull request Mar 5, 2025

Use inplace ops in Lucas test entropyxyz/crypto-primes#78

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monty-related performance improvements #777

Monty-related performance improvements #777

fjarri commented Feb 23, 2025 •

edited

Loading

fjarri commented Feb 25, 2025

tarcieri commented Feb 25, 2025

tarcieri commented Mar 5, 2025

fjarri commented Mar 5, 2025

dignifiedquire commented Mar 5, 2025

fjarri commented Mar 5, 2025

Monty-related performance improvements #777

Monty-related performance improvements #777

Conversation

fjarri commented Feb 23, 2025 • edited Loading

fjarri commented Feb 25, 2025

tarcieri commented Feb 25, 2025

tarcieri commented Mar 5, 2025

fjarri commented Mar 5, 2025

dignifiedquire commented Mar 5, 2025

fjarri commented Mar 5, 2025

fjarri commented Feb 23, 2025 •

edited

Loading