-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Argon2::hash_password_into
should use fallible memory allocations
#566
Comments
Your claim it "crashes on 32-bit architectures" seems misleading in that we do run CI on 32-bit architectures successfully: It seems the real issue is you're attempting to allocate 2GiB of RAM on a system which doesn't have it available. Yes, it could potentially use fallible allocations, however there is not a stable API in Rust for allocating memory fallibly, as the I am not sure there is a 3rd party crate for this which is both trustworthy and portable, until such a time as the In the meantime you can use whatever fallible allocation solution you want in conjunction with |
Argon2::hash_password_into
should use fallible memory allocations
I should also note that fallible allocations don't always work the way you expect. Linux has an OOM killer, and will terminate other processes to satisfy requests for large amounts of memory on low-memory systems. So unfortunately simply adding fallible allocations may not be "safe" in the way you expect. It would definitely be better to use less memory on low-resource systems to begin with, by selecting parameters which use less memory. |
I reproduced the issue that was reported to me by running a 32-bit build of Debian using Docker, with plenty of memory available. I'm pretty sure that I'm looking at address space exhaustion here, as in, there is no continuous block of unused address space with the required size. Original bug report "sequoia-openpgp v2.0.0-alpha.2 test failure on i686 (32-bit x86)": https://gitlab.com/sequoia-pgp/sequoia/-/issues/1157 The test tries to unlock the locked sample key from RFC 9580, using the most obvious API offered by this crate. My feedback was that obvious invocation of Argon2, on recommended parameter choices, leads to a panic, which I consider very surprising, and this kind of surprises aren't great.
That is what I went with. I ended up using
I am aware.
As a consumer of OpenPGP artifacts, I don't have the choice of parameters, the sender chose. Now, maybe the sender chose poorly, and I won't be able to decrypt it, that is an okay outcome. Panic'ing the application is not. |
Please provide a complete reproduction of the problem |
Well, then you should also be aware that if you request too much memory, the OOM killer can kill your process. You need to detect how much memory is available, calculate how much memory the parameters will use, and if it's too much, then fail yourself, or you're still at risk of the process being terminated by the kernel. |
Vec::try_reserve is stable fwiw: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.try_reserve |
A better option could be to use the |
For prosperity, this is the gymnastics that hashbrown goes into: https://github.com/rust-lang/hashbrown/blob/master/src/raw/alloc.rs |
@newpavlov what API are you talking about that's stable? e.g. the |
Relaying another suggestion: we could take an optional memory limit parameter, and return an error if the given parameters require more memory than the limit |
Why do you need the |
I wasn’t aware, hence why I was asking. What makes that better than |
It's a more fundamental API. We don't need anything from |
It seems like a lot more work to integrate due to having to deal with uninitialized memory, and it might also complicate #547. @newpavlov are you going to do the work to integrate it? |
Note that we also have the
I could try it a bit later. It should be easy enough to replace this line (IIUC it's the only place which performs heap allocation). |
One problem is that (I think this is what @tarcieri meant by "[fallible allocation] might also complicate #547"). That said, I don't see why the workaround of allocating bytes (and transmuting into an aligned slice of blocks) could not be implemented on top of fallible allocation. But could you also consider that case, besides what's in the master branch, and maybe test and benchmark how it would look? Footnotes
|
I thought that for big enough allocations (e.g. more than 16 pages) allocators usually just
But with |
Overallocating with alloc_zeroes and subslicing to the correct alignment does seem like the best option here |
This is weird. You mean that I don't see any meaningful difference with the snippet like this (i.e. changing extern crate alloc;
use alloc::alloc::{dealloc, alloc_zeroed, Layout};
const SIZE: usize = 1 << 20;
const ALIGN: usize = 64;
// Use extern fn to prevent compiler optimizations
unsafe extern "C" {
fn use_mem(p: *mut u8, val: u32);
}
mod inner {
#[unsafe(no_mangle)]
#[inline(never)]
unsafe extern "C" fn use_mem(p: *mut u8, val: u32) {
unsafe { core::ptr::write_bytes(p, val as u8, super::SIZE); }
}
}
fn main() {
let l = Layout::from_size_align(SIZE, ALIGN).unwrap();
let t = std::time::Instant::now();
for i in 0..1_000_000_000 {
unsafe {
let p = alloc_zeroed(l);
if p.is_null() {
panic!()
}
use_mem(p, i);
dealloc(p, l);
}
}
println!("{:?}", t.elapsed());
} I understand that this "benchmark" is problematic, so I would appreciate if you could provide a small reproduction of your issue. |
At least on my apple silicon, what I am observing is that I cannot speak for @jonasmalacofilho's previous benchmarks though. Perhaps the interleaving of block hashes and page mappings performs better than mapping all-at-once. |
This is may be an implementation detail of the used allocator. I think we should prefer code which directly conveys the desired intent, than trying to work around potential weird allocator issues. We also could directly call |
(Before I forget again, I'm testing/looking at x86_64 Linux).
That's documented for glibc malloc, at an even lower threshold than that. However, the problem isn't always the allocation per say, it can also have to do with ensuring the memory is zeroed. For small alignments So for larger alignments,
Yeah, my bad. I should have pointed out that the benchmarks in that commit message were representative of what I saw with
Well, I think that has to do with the fact that you're only allocating 1 MiB. If I increase your Here's the higher-level benchmark I used back then: #![allow(dead_code)]
#![deny(unsafe_op_in_unsafe_fn)]
use std::alloc::{self, Layout};
use std::hint::black_box;
use criterion::{criterion_group, criterion_main, Criterion, Throughput};
#[derive(Debug, Clone, Copy)]
#[repr(align(64))]
struct Block([u64; 128]);
impl Default for Block {
fn default() -> Self {
Self([0; 128])
}
}
const COUNT: usize = 1024 * 1024;
const SIZE: usize = size_of::<Block>();
fn bench(c: &mut Criterion) {
let mut group = c.benchmark_group("bench");
group.throughput(Throughput::Bytes((SIZE * COUNT) as u64)); // 1 GiB
group.bench_function("bytes with vec! macro", |b| {
b.iter(|| vec![0u8; black_box(SIZE * COUNT)])
});
group.bench_function("42u8 with vec! macro", |b| {
b.iter(|| vec![42u8; black_box(SIZE * COUNT)])
});
group.bench_function("Blocks with vec! macro", |b| {
b.iter(|| vec![Block::default(); black_box(COUNT)])
});
group.bench_function(
"Blocks with alloc::alloc_zeroed then Vec::from_raw_parts",
|b| {
b.iter(|| {
let count = black_box(COUNT);
let layout = Layout::array::<Block>(count).unwrap();
assert_eq!(layout.size(), SIZE * COUNT);
let ptr = unsafe { alloc::alloc_zeroed(layout) };
if ptr.is_null() {
alloc::handle_alloc_error(layout);
}
let vec: Vec<Block> = unsafe { Vec::from_raw_parts(ptr.cast(), count, count) };
vec
})
},
);
group.finish();
}
criterion_group!(benches, bench);
criterion_main!(benches);
|
Note that the benchmark in my previous comment is also flawed, of course. For one, it doesn't take into account overcommit, which is probably why But the underlying point is that not carefully allocating the buffer results in one extra single-threaded pass over memory, which is significant when the algorithm we're interested in basically only does a few more passes over it.
I get your point, but a 27.5% performance hit on m=1GiB p=4 is significant. And it gets relatively worse as memory and/or lane counts increase. |
IIUC on a long-running system we need the zeroization pass either way. The difference is only whether it will be done by OS in kernel space (if we For now I think we should merge the |
Alternatively, we could try using uninit memory. But it certainly should be implemented as a separate feature. |
I think there's an issue with that argument: if the OS does it, it will continue to do it regardless of whether we do it too (unless we find a way to explicitly opt out of that, if that's even possible). So we're still talking about one extra memory pass, and not the same amount of work.
I'm ok with that.
I think an argon2 slice in the first pass can sometimes read (zeroes ) from later slices? I'm not sure, and either way it seems tricky to guarantee that uninit memory is never possibly read. |
Well, there is I was talking about a slightly different thing. A common optimization for allocators is to do the following:
The Existence of
I am not sure. If it's true, then we are out of luck. |
Yes, that could happen, given the right allocator and circumstances. But I think you're overestimating how common that scenario is. Note that I haven't yet been able to trigger it with a benchmark (either the one above or the full argon2 benchmarks I ran when trying different allocation strategies for #547). And a benchmark is basically the ideal case to observe what you described: we allocate and free the exact same (large) size over and over again. This on a fairly common platform using a fairly common allocator (glibc on x86_64 Linux). Your argument also doesn't apply to use cases where it's desirable for argon to be as fast as possible with high memory sizes, but it isn't called over and over again. Like a password manager, for example, where one-off latency limits the choice of argon2 parameters. So I just want to ask you to keep the ugly, but (at least sometimes) effective, oversized-buffer-of-bytes-into-aligned-slice workaround on the table, at least when thinking about #547. |
I am open to adding this workaround, but I think we should do it in a separate PR after additional investigation. |
Hi, reporter of the original issue here. From what I can tell, the actual issue here is that the So with the argon2 parameters requested in sequoia-openpgp ( It might be better to use the recommended argon2 parameters for "resource constrained systems" on 32-bit architectures instead of hard-erroring on use of argon2 on those systems? |
I don't think we should change default algorithm parameters depending on target, since it may cause portability issues. But it may be reasonable to change the default for all targets. Please open a separate issue for that. |
That would mean an allocation of 2^42 bytes, no? Much larger than Footnotes
|
I was talking about this line in the
As you can see here it works as documented. |
@newpavlov, sorry, I misunderstood the point you were trying to make. @decathorpe didn't suggest that |
Ah, I also slightly misread this comment. I thought that here:
"Layout calculation" was referring to the |
Argon2::hash_password_into
crashes on 32-bit architectures when using the FIRST RECOMMENDED parameter option (see Section 4 of RFC 9106) because it tries to infallibly allocate the 2 GiB buffer.We believe that the straight-forward, easy-to-use interfaces should be safe by default. Following that,
Argon2::hash_password_into
should make a fallible allocation, returning allocation errors instead of crashing.The text was updated successfully, but these errors were encountered: