Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monty-related performance improvements #777

Merged
merged 12 commits into from
Mar 5, 2025
Merged

Conversation

fjarri
Copy link
Contributor

@fjarri fjarri commented Feb 23, 2025

Trying to figure out massive slowdowns crypto-primes experiences for boxed uints (up to 4x). Could be the reason of the slowdowns in RSA as well.

Public changes:

  • Added Monty::div_by_2_assign() (with a blanket impl).
  • Added BoxedUint::inv_mod2k_vartime().
  • Made BoxedUint::inv_mod2k() public.
  • Added Monty::Multiplier associated type and Monty::copy_montgomery_from() to assist with tight loops (specifically, Lucas test in crypto-primes).
  • Cleaned up AMM, added comments and references, and reduced the size of the internal buffer to N from 2N. Also made it const fn. Closes Almost Montgomery Multiplication #782

Note: the multiplier for Uint is called DynMontyMultiplier. Not happy with the name, but we already have MontyMultiplier as a trait, and it clashes.

Note: the exact way MontyMultiplier is exposed and the naming I'm not sure about, also not sure how hazmat do we want to make them. Potentially AMM can be exposed too, but it would be good to wrap the results in some struct that will propagate the "reduction level". Not for this PR, I need to finalize the minimum viable solution.

Fixes:

  • Fixed a bug in BoxedUnsatInt::to_uint() which created a 64-bit number instead of a 32-bit one on 32-bit targets

Internal:

  • Added tests for BoxedUint::inv_mod2k() and inv_mod2k_vartime().
  • Removed allocations inside the loop in BoxedUint::inv_mod2k().
  • Used and inv_mod2k_vartime() in BoxedMontyParams::new_vartime() and new() - since it's only vartime in the k, which is fixed.
  • new_vartime() can be made even faster (~15% for Uint, 25% for Boxed) if we make a variant of inv_mod2k that is vartime in both arguments. Currently added in the commit as inv_mod2k_full_vartime() (crate-private). Can be removed if that's too much detail.
  • Removed an unnecessary allocation in Add/SubAssign of BoxedMontgomeryForm.

Performance notes:

  • BoxedUint::div_by_2() uses div_by_2_assign() because it is faster and does not allocate.
  • Uint::div_by_2() uses the same approach, gets rid of one addition and one shr1(), so it is marginally faster (~10%).
  • As expected, because of inv_mod2k_vartime() usage MontyParams::new/_vartime() became massively faster (~10x for Uint, ~15x for Boxed, 4096 bits).
  • I tried using AMM in Uint, but it leads to performance degradation for smaller uints (U256). So for now we'll keep the status quo with Uint using multiply + reduce. Worth investigating later.

@fjarri
Copy link
Contributor Author

fjarri commented Feb 25, 2025

@tarcieri I'm still working on this PR, but meanwhile, it seems that I'll need to expose the following in the public API:

  • Multiplication/squaring of Monty with the provided buffer (since it seems to be impossible to do Montgomery multiplication fully in place). Currently it's mul_into() and square_into().
  • Copying one uint into another. Currently it's fill_with().

The buffer can be the same size as the arguments if we use interleaved Montgomery multiplication (I added both CIOS and FIOS ones, seem to have the same performance, but CIOS is simpler). For some reason the interleaved form is a couple percent slower than multiplication + reduction, even though it's supposed to have less operations.

Another tight loop where a lot of allocations are happening is pow_bounded_exp(). I think I can reduce them there as well.

@tarcieri
Copy link
Member

I’m fine with exposing those methods in the public API.

Also getting rid of allocations would hopefully benefit rsa which is quite slow now, so that sounds great.

@fjarri fjarri force-pushed the perf branch 9 times, most recently from 1db359c to 0b401e3 Compare February 26, 2025 03:31
@fjarri fjarri force-pushed the perf branch 2 times, most recently from 5986947 to 8a46714 Compare March 5, 2025 20:53
@fjarri fjarri marked this pull request as ready for review March 5, 2025 21:09
@fjarri fjarri requested a review from tarcieri March 5, 2025 21:09
@tarcieri
Copy link
Member

tarcieri commented Mar 5, 2025

Curious what the performance impact on rsa will be /cc @dignifiedquire

@fjarri
Copy link
Contributor Author

fjarri commented Mar 5, 2025

I am actually curious too, was planning to test it. crypto-primes with this PR reaches parity between Uint and BoxedUint for 1024 bits, but it needs some changes to make use of the newly exposed multiplier.

@tarcieri tarcieri merged commit 2734f18 into RustCrypto:master Mar 5, 2025
18 checks passed
@dignifiedquire
Copy link
Member

do I need to update the usage, or does it make sense to test it directly?

@fjarri
Copy link
Contributor Author

fjarri commented Mar 5, 2025

do I need to update the usage, or does it make sense to test it directly?

You might. Some calls are non-allocating automatically, but you may have to use the explicit multiplier object (that encapsulates the scratch buffer) if you do any multiplications in tight loops. Also crypto-primes needs some unpublished changes (to be made into a PR later today).

I'm testing it right now, and it seems that the changes in crypto-primes and crypto-bigint are not enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Almost Montgomery Multiplication
3 participants