-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monty-related performance improvements #777
Conversation
@tarcieri I'm still working on this PR, but meanwhile, it seems that I'll need to expose the following in the public API:
The buffer can be the same size as the arguments if we use interleaved Montgomery multiplication (I added both CIOS and FIOS ones, seem to have the same performance, but CIOS is simpler). For some reason the interleaved form is a couple percent slower than multiplication + reduction, even though it's supposed to have less operations. Another tight loop where a lot of allocations are happening is |
I’m fine with exposing those methods in the public API. Also getting rid of allocations would hopefully benefit |
1db359c
to
0b401e3
Compare
5986947
to
8a46714
Compare
Curious what the performance impact on |
I am actually curious too, was planning to test it. |
do I need to update the usage, or does it make sense to test it directly? |
You might. Some calls are non-allocating automatically, but you may have to use the explicit multiplier object (that encapsulates the scratch buffer) if you do any multiplications in tight loops. Also I'm testing it right now, and it seems that the changes in |
Trying to figure out massive slowdowns
crypto-primes
experiences for boxed uints (up to 4x). Could be the reason of the slowdowns inRSA
as well.Public changes:
Monty::div_by_2_assign()
(with a blanket impl).BoxedUint::inv_mod2k_vartime()
.BoxedUint::inv_mod2k()
public.Monty::Multiplier
associated type andMonty::copy_montgomery_from()
to assist with tight loops (specifically, Lucas test incrypto-primes
).const fn
. Closes Almost Montgomery Multiplication #782Note: the multiplier for
Uint
is calledDynMontyMultiplier
. Not happy with the name, but we already haveMontyMultiplier
as a trait, and it clashes.Note: the exact way MontyMultiplier is exposed and the naming I'm not sure about, also not sure how hazmat do we want to make them. Potentially AMM can be exposed too, but it would be good to wrap the results in some struct that will propagate the "reduction level". Not for this PR, I need to finalize the minimum viable solution.
Fixes:
BoxedUnsatInt::to_uint()
which created a 64-bit number instead of a 32-bit one on 32-bit targetsInternal:
BoxedUint::inv_mod2k()
andinv_mod2k_vartime()
.BoxedUint::inv_mod2k()
.inv_mod2k_vartime()
inBoxedMontyParams::new_vartime()
andnew()
- since it's only vartime in thek
, which is fixed.new_vartime()
can be made even faster (~15% for Uint, 25% for Boxed) if we make a variant ofinv_mod2k
that is vartime in both arguments. Currently added in the commit asinv_mod2k_full_vartime()
(crate-private). Can be removed if that's too much detail.BoxedMontgomeryForm
.Performance notes:
BoxedUint::div_by_2()
usesdiv_by_2_assign()
because it is faster and does not allocate.Uint::div_by_2()
uses the same approach, gets rid of one addition and oneshr1()
, so it is marginally faster (~10%).inv_mod2k_vartime()
usageMontyParams::new/_vartime()
became massively faster (~10x for Uint, ~15x for Boxed, 4096 bits).Uint
, but it leads to performance degradation for smaller uints (U256). So for now we'll keep the status quo withUint
using multiply + reduce. Worth investigating later.