Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argon2::hash_password_into should use fallible memory allocations #566

Closed
teythoon opened this issue Mar 3, 2025 · 35 comments · Fixed by #568
Closed

Argon2::hash_password_into should use fallible memory allocations #566

teythoon opened this issue Mar 3, 2025 · 35 comments · Fixed by #568

Comments

@teythoon
Copy link

teythoon commented Mar 3, 2025

Argon2::hash_password_into crashes on 32-bit architectures when using the FIRST RECOMMENDED parameter option (see Section 4 of RFC 9106) because it tries to infallibly allocate the 2 GiB buffer.

We believe that the straight-forward, easy-to-use interfaces should be safe by default. Following that, Argon2::hash_password_into should make a fallible allocation, returning allocation errors instead of crashing.

@tarcieri tarcieri changed the title Argon2::hash_password_into is not safe Argon2::hash_password_into should use fallible memory allocations Mar 3, 2025
@tarcieri
Copy link
Member

tarcieri commented Mar 3, 2025

Your claim it "crashes on 32-bit architectures" seems misleading in that we do run CI on 32-bit architectures successfully:

https://github.com/RustCrypto/password-hashes/blob/master/.github/workflows/argon2.yml#L88-L105

It seems the real issue is you're attempting to allocate 2GiB of RAM on a system which doesn't have it available.

Yes, it could potentially use fallible allocations, however there is not a stable API in Rust for allocating memory fallibly, as the Allocator trait is currently unstable.

I am not sure there is a 3rd party crate for this which is both trustworthy and portable, until such a time as the Allocator trait is stabilized. We also generally dislike using 3rd party crates unless absolutely necessary.

In the meantime you can use whatever fallible allocation solution you want in conjunction with hash_password_into_with_memory

@tarcieri tarcieri changed the title Argon2::hash_password_into should use fallible memory allocations Argon2::hash_password_into should use fallible memory allocations Mar 3, 2025
@tarcieri
Copy link
Member

tarcieri commented Mar 3, 2025

I should also note that fallible allocations don't always work the way you expect. Linux has an OOM killer, and will terminate other processes to satisfy requests for large amounts of memory on low-memory systems. So unfortunately simply adding fallible allocations may not be "safe" in the way you expect.

It would definitely be better to use less memory on low-resource systems to begin with, by selecting parameters which use less memory.

@teythoon
Copy link
Author

teythoon commented Mar 3, 2025

Your claim it "crashes on 32-bit architectures" seems misleading in that we do run CI on 32-bit architectures successfully:
It seems the real issue is you're attempting to allocate 2GiB of RAM on a system which doesn't have it available.

I reproduced the issue that was reported to me by running a 32-bit build of Debian using Docker, with plenty of memory available. I'm pretty sure that I'm looking at address space exhaustion here, as in, there is no continuous block of unused address space with the required size.

Original bug report "sequoia-openpgp v2.0.0-alpha.2 test failure on i686 (32-bit x86)": https://gitlab.com/sequoia-pgp/sequoia/-/issues/1157

The test tries to unlock the locked sample key from RFC 9580, using the most obvious API offered by this crate. My feedback was that obvious invocation of Argon2, on recommended parameter choices, leads to a panic, which I consider very surprising, and this kind of surprises aren't great.

In the meantime you can use whatever fallible allocation solution you want in conjunction with hash_password_into_with_memory

That is what I went with. I ended up using std::alloc::alloc: https://gitlab.com/sequoia-pgp/sequoia/-/commit/3325c74c241654bc34859d95092f0b4f9b280f2e

I should also note that fallible allocations don't always work the way you expect. Linux has an OOM killer, and will terminate other processes to satisfy requests for large amounts of memory on low-memory systems. So unfortunately simply adding fallible allocations may not be "safe" in the way you expect.

I am aware.

It would definitely be better to use less memory on low-resource systems to begin with, by selecting parameters which use less memory.

As a consumer of OpenPGP artifacts, I don't have the choice of parameters, the sender chose. Now, maybe the sender chose poorly, and I won't be able to decrypt it, that is an okay outcome. Panic'ing the application is not.

@tarcieri
Copy link
Member

tarcieri commented Mar 3, 2025

Please provide a complete reproduction of the problem

@tarcieri
Copy link
Member

tarcieri commented Mar 3, 2025

I am aware.

Well, then you should also be aware that if you request too much memory, the OOM killer can kill your process.

You need to detect how much memory is available, calculate how much memory the parameters will use, and if it's too much, then fail yourself, or you're still at risk of the process being terminated by the kernel.

@conradludgate
Copy link
Contributor

Vec::try_reserve is stable fwiw: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.try_reserve

@newpavlov
Copy link
Member

A better option could be to use the alloc API directly. It would involve a bit of unsafe code, but it should be fairly straightforward and we already have unsafe code in argon2.

@conradludgate
Copy link
Contributor

For prosperity, this is the gymnastics that hashbrown goes into: https://github.com/rust-lang/hashbrown/blob/master/src/raw/alloc.rs

@tarcieri
Copy link
Member

tarcieri commented Mar 3, 2025

@newpavlov what API are you talking about that's stable? e.g. the Allocator trait is unstable

@tarcieri
Copy link
Member

tarcieri commented Mar 3, 2025

Relaying another suggestion: we could take an optional memory limit parameter, and return an error if the given parameters require more memory than the limit

@newpavlov
Copy link
Member

Why do you need the Allocator trait here? I was talking about the alloc function.

@tarcieri
Copy link
Member

tarcieri commented Mar 3, 2025

I wasn’t aware, hence why I was asking.

What makes that better than try_reserve though?

@newpavlov
Copy link
Member

It's a more fundamental API. We don't need anything from Vec and know the desired memory size right away.

@tarcieri
Copy link
Member

tarcieri commented Mar 3, 2025

It seems like a lot more work to integrate due to having to deal with uninitialized memory, and it might also complicate #547.

@newpavlov are you going to do the work to integrate it?

@newpavlov
Copy link
Member

newpavlov commented Mar 3, 2025

Note that we also have the alloc_zeroed function.

are you going to do the work to integrate it?

I could try it a bit later. It should be easy enough to replace this line (IIUC it's the only place which performs heap allocation).

@jonasmalacofilho
Copy link

@newpavlov,

One problem is that alloc/alloc_zeroed are slow, due to the alignment requirement for Block, which forces Rust into a slow path. And this makes a noticeable difference when parallelism is enabled (#547) and large memory sizes as used (which is already assumed in this discussion). Please see 018c3e9 for some benchmark results, and the workaround I came up with for #547.1

(I think this is what @tarcieri meant by "[fallible allocation] might also complicate #547").

That said, I don't see why the workaround of allocating bytes (and transmuting into an aligned slice of blocks) could not be implemented on top of fallible allocation. But could you also consider that case, besides what's in the master branch, and maybe test and benchmark how it would look?

Footnotes

  1. In fact, alloc_zeroed was one of the things I tried first, which is how I now it's (as) slow (as allocating a large Vec of blocks).

@newpavlov
Copy link
Member

newpavlov commented Mar 4, 2025

I thought that for big enough allocations (e.g. more than 16 pages) allocators usually just mmap the requested memory (they also could reuse previously mmaped chunk marked with MADV_FREE, but it's not relevant to this discussion).

The issue lies in vec![Block::default; ...], which clones the supplied block.

But with alloc_zeroed we don't need any cloning, no? We just allocate the zeroed memory and cast it into effectively Box<[Block]>.

@conradludgate
Copy link
Contributor

Overallocating with alloc_zeroes and subslicing to the correct alignment does seem like the best option here

@newpavlov
Copy link
Member

This is weird. You mean that alloc_zeroed is faster with alignment of 1 than with alignment of 64?

I don't see any meaningful difference with the snippet like this (i.e. changing ALIGN to 1 has no measurable performance impact):

extern crate alloc;

use alloc::alloc::{dealloc, alloc_zeroed, Layout};

const SIZE: usize = 1 << 20;
const ALIGN: usize = 64;

// Use extern fn to prevent compiler optimizations
unsafe extern "C" {
    fn use_mem(p: *mut u8, val: u32);
}


mod inner {
    #[unsafe(no_mangle)]
    #[inline(never)]
    unsafe extern "C" fn use_mem(p: *mut u8, val: u32) {
        unsafe { core::ptr::write_bytes(p, val as u8, super::SIZE); }
    }
}

fn main() {
    let l = Layout::from_size_align(SIZE, ALIGN).unwrap();

    let t = std::time::Instant::now();
    for i in 0..1_000_000_000 {
        unsafe {
            let p = alloc_zeroed(l);
            if p.is_null() {
                panic!()
            }
            use_mem(p, i);
            dealloc(p, l);
        }
    }
    println!("{:?}", t.elapsed());
}

I understand that this "benchmark" is problematic, so I would appreciate if you could provide a small reproduction of your issue.

@conradludgate
Copy link
Contributor

At least on my apple silicon, what I am observing is that alloc_zeroed when aligned is actually mapping the pages, whereas when unaligned it's seemingly unmapped until used. This looks like unaligned is faster to alloc, but when actually touching the pages, the slowdown is equal.

I cannot speak for @jonasmalacofilho's previous benchmarks though. Perhaps the interleaving of block hashes and page mappings performs better than mapping all-at-once.

@newpavlov
Copy link
Member

newpavlov commented Mar 4, 2025

This is may be an implementation detail of the used allocator. I think we should prefer code which directly conveys the desired intent, than trying to work around potential weird allocator issues.

We also could directly call mmap on some systems to skip the allocator completely. Though it may be less efficient in the end, since allocator may reuse previously mapped memory without doing any syscalls.

@jonasmalacofilho
Copy link

(Before I forget again, I'm testing/looking at x86_64 Linux).

I thought that for big enough allocations (e.g. more than 16 pages) allocators usually just mmap the requested memory (...).

That's documented for glibc malloc, at an even lower threshold than that. However, the problem isn't always the allocation per say, it can also have to do with ensuring the memory is zeroed.

For small alignments alloc_zeroed on Unix/Linux calls calloc, which ensures the memory is both aligned and zeroed. But calloc's guarantees of alignment are limited to what C calls "fundamental alignment", which is basically size_of::<usize>().

So for larger alignments, alloc_zeroed falls back to allocating with the requested alignment, then zeroing the memory. And the latter is the slow part, when compared to the calloc fast path, because, on Linux, calloc can guarantee that the returned memory is zeroed without generally doing any work at all (at least at these large allocation sizes).

But with alloc_zeroed we don't need any cloning, no? We just allocate the zeroed memory and cast it into effectively Box<[Block]>.

Yeah, my bad. I should have pointed out that the benchmarks in that commit message were representative of what I saw with alloc_zeroed. But you're right, of course, that cloning isn't involved with alloc_zeroed.

I don't see any meaningful difference with the snippet like this (i.e. changing ALIGN to 1 has no measurable performance impact):

Well, I think that has to do with the fact that you're only allocating 1 MiB. If I increase your SIZE parameter to 1 GiB, I see roughly a 25% drop in performance between 1 and 64 byte alignment: with 100 iterations, from 7.9s to 10.1s.


Here's the higher-level benchmark I used back then:

#![allow(dead_code)]
#![deny(unsafe_op_in_unsafe_fn)]

use std::alloc::{self, Layout};
use std::hint::black_box;

use criterion::{criterion_group, criterion_main, Criterion, Throughput};

#[derive(Debug, Clone, Copy)]
#[repr(align(64))]
struct Block([u64; 128]);

impl Default for Block {
    fn default() -> Self {
        Self([0; 128])
    }
}

const COUNT: usize = 1024 * 1024;
const SIZE: usize = size_of::<Block>();

fn bench(c: &mut Criterion) {
    let mut group = c.benchmark_group("bench");
    group.throughput(Throughput::Bytes((SIZE * COUNT) as u64)); // 1 GiB

    group.bench_function("bytes with vec! macro", |b| {
        b.iter(|| vec![0u8; black_box(SIZE * COUNT)])
    });

    group.bench_function("42u8 with vec! macro", |b| {
        b.iter(|| vec![42u8; black_box(SIZE * COUNT)])
    });

    group.bench_function("Blocks with vec! macro", |b| {
        b.iter(|| vec![Block::default(); black_box(COUNT)])
    });

    group.bench_function(
        "Blocks with alloc::alloc_zeroed then Vec::from_raw_parts",
        |b| {
            b.iter(|| {
                let count = black_box(COUNT);
                let layout = Layout::array::<Block>(count).unwrap();
                assert_eq!(layout.size(), SIZE * COUNT);
                let ptr = unsafe { alloc::alloc_zeroed(layout) };
                if ptr.is_null() {
                    alloc::handle_alloc_error(layout);
                }
                let vec: Vec<Block> = unsafe { Vec::from_raw_parts(ptr.cast(), count, count) };
                vec
            })
        },
    );

    group.finish();
}

criterion_group!(benches, bench);
criterion_main!(benches);
bench/bytes with vec! macro                                                                             
                        time:   [9.3414 µs 9.3510 µs 9.3612 µs]
                        thrpt:  [106824 GiB/s 106940 GiB/s 107051 GiB/s]
bench/42u8 with vec! macro                                                                            
                        time:   [78.610 ms 78.696 ms 78.788 ms]
                        thrpt:  [12.692 GiB/s 12.707 GiB/s 12.721 GiB/s]
bench/Blocks with vec! macro                                                                            
                        time:   [64.976 ms 65.099 ms 65.217 ms]
                        thrpt:  [15.333 GiB/s 15.361 GiB/s 15.390 GiB/s]
bench/Blocks with alloc::alloc_zeroed then Vec::from_raw_parts                                                                            
                        time:   [78.990 ms 79.199 ms 79.406 ms]
                        thrpt:  [12.594 GiB/s 12.626 GiB/s 12.660 GiB/s]

@jonasmalacofilho
Copy link

Note that the benchmark in my previous comment is also flawed, of course. For one, it doesn't take into account overcommit, which is probably why vec![0u8; ...] was so suspiciously fast.

But the underlying point is that not carefully allocating the buffer results in one extra single-threaded pass over memory, which is significant when the algorithm we're interested in basically only does a few more passes over it.

This is may be an implementation detail of the used allocator. I think we should prefer code which directly conveys the desired intent, than trying to work around potential weird allocator issues.

I get your point, but a 27.5% performance hit on m=1GiB p=4 is significant. And it gets relatively worse as memory and/or lane counts increase.

@newpavlov
Copy link
Member

newpavlov commented Mar 4, 2025

IIUC on a long-running system we need the zeroization pass either way. The difference is only whether it will be done by OS in kernel space (if we mmap new memory) or by an allocator in user space (if we reuse previously mapped memory). In the former case, this zeroization may not be reflected as time spent in your program, but the system as the whole still performs the same amount of work.

For now I think we should merge the alloc_zeroed PR and continue discussion of potential improvements in a separate PR/issue.

@newpavlov
Copy link
Member

Alternatively, we could try using uninit memory. But it certainly should be implemented as a separate feature.

@jonasmalacofilho
Copy link

IIUC on a long-running system we need the zeroization pass either way. The difference is only whether it will be done by OS in kernel space or by an allocator in user space. In the former case, this zeroization may not be reflected as time spent in your program, but the system as the whole still performs the same amount of work.

I think there's an issue with that argument: if the OS does it, it will continue to do it regardless of whether we do it too (unless we find a way to explicitly opt out of that, if that's even possible). So we're still talking about one extra memory pass, and not the same amount of work.

For now I think we should merge the alloc_zeroed PR and continue discussion of potential improvements in a separate PR/issue.

I'm ok with that.

Alternatively, we could try using uninit memory. But it certainly should be implemented as a separate feature.

I think an argon2 slice in the first pass can sometimes read (zeroes ) from later slices? I'm not sure, and either way it seems tricky to guarantee that uninit memory is never possibly read.

@newpavlov
Copy link
Member

newpavlov commented Mar 4, 2025

I think there's an issue with that argument: if the OS does it, it will continue to do it regardless of whether we do it too (unless we find a way to explicitly opt out of that, if that's even possible).

Well, there is MAP_UNINITIALIZED, but it's not used outside of embedded devices.

I was talking about a slightly different thing. A common optimization for allocators is to do the following:

  • mmap huge chunk of memory with MAP_ANONYMOUS (the pages get zeroized by OS either immediately, or on the first page touch).
  • On deallocation keep the mapped memory, but mark it with MADV_FREE.
  • On a future non-zeroizing allocation, give the MADV_FREE memory instead of mapping new memory.

The MADV_FREE memory is semantically "uninitialized". Not only it may contain garbage data from the previous allocation, but, even worse, until we write to it, its content may change to zeros "under out foot" (i.e. OS may reclaim the physical page and replace it with "zero page").

Existence of MADV_FREE is the classical argument of why properly managing uninit memory is hard in Rust.

I think an argon2 slice in the first pass can sometimes read (zeroes ) from later slices?

I am not sure. If it's true, then we are out of luck.

@jonasmalacofilho
Copy link

Yes, that could happen, given the right allocator and circumstances. But I think you're overestimating how common that scenario is.

Note that I haven't yet been able to trigger it with a benchmark (either the one above or the full argon2 benchmarks I ran when trying different allocation strategies for #547). And a benchmark is basically the ideal case to observe what you described: we allocate and free the exact same (large) size over and over again. This on a fairly common platform using a fairly common allocator (glibc on x86_64 Linux).

Your argument also doesn't apply to use cases where it's desirable for argon to be as fast as possible with high memory sizes, but it isn't called over and over again. Like a password manager, for example, where one-off latency limits the choice of argon2 parameters.

So I just want to ask you to keep the ugly, but (at least sometimes) effective, oversized-buffer-of-bytes-into-aligned-slice workaround on the table, at least when thinking about #547.

@newpavlov
Copy link
Member

I am open to adding this workaround, but I think we should do it in a separate PR after additional investigation.

@decathorpe
Copy link

Hi, reporter of the original issue here.

From what I can tell, the actual issue here is that the vec![] call that causes this (here) fails not because the memory is not available, but because the requested allocation causes an integer overflow in Layout calculation on 32-bit architectures (the size in self.size.checked_mul(n) in Layout::repeat_packed is a usize).

So with the argon2 parameters requested in sequoia-openpgp (=2.0.0-alpha.2), the Layout calculation unconditionally panic!d the program. The workaround in the 2.0.0 release was to introduce code for fallibly allocate memory, but - if I understand this correctly - that should still unconditionally fail on 32-bit architectures for exactly the same reason - because it still needs to allocate the same amount of memory. It now just returns an error instead of hard-panic!ing. Which is better, but doesn't really solve the underlying issue.

It might be better to use the recommended argon2 parameters for "resource constrained systems" on 32-bit architectures instead of hard-erroring on use of argon2 on those systems?

@newpavlov
Copy link
Member

Layout::array::<Block>(len).ok()?; should handle big buffer sizes without panicking and IIUC Params::block_count returns a number which is always smaller than 2^32 (i.e. computations inside it can not overflow).

It might be better to use the recommended argon2 parameters for "resource constrained systems" on 32-bit architectures instead of hard-erroring on use of argon2 on those systems?

I don't think we should change default algorithm parameters depending on target, since it may cause portability issues. But it may be reasonable to change the default for all targets. Please open a separate issue for that.

@jonasmalacofilho
Copy link

jonasmalacofilho commented Mar 5, 2025

Layout::array::<Block>(len).ok()?; should handle big buffer sizes without panicking and IIUC Params::block_count returns a number which is always smaller than 2^32 (i.e. computations inside it can not overflow).

That would mean an allocation of 2^42 bytes, no? Much larger than isize::MAX1 on 32-bit platforms.

Footnotes

  1. Also, isn't isize::MAX, not usize::MAX, the maximum allocation size?

@newpavlov
Copy link
Member

I was talking about this line in the Layout::array docs:

On arithmetic overflow or when the total size would exceed isize::MAX, returns LayoutError.

As you can see here it works as documented.

@jonasmalacofilho
Copy link

@newpavlov, sorry, I misunderstood the point you were trying to make. @decathorpe didn't suggest that Layout::array could panic; in fact, the workaround he mentioned uses it. So I failed to consider that you were just confirming it.

@newpavlov
Copy link
Member

Ah, I also slightly misread this comment. I thought that here:

the Layout calculation unconditionally panic!d the program

"Layout calculation" was referring to the Layout::array part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants