`Argon2::hash_password_into` should use fallible memory allocations #566

teythoon · 2025-03-03T09:12:40Z

Argon2::hash_password_into crashes on 32-bit architectures when using the FIRST RECOMMENDED parameter option (see Section 4 of RFC 9106) because it tries to infallibly allocate the 2 GiB buffer.

We believe that the straight-forward, easy-to-use interfaces should be safe by default. Following that, Argon2::hash_password_into should make a fallible allocation, returning allocation errors instead of crashing.

The text was updated successfully, but these errors were encountered:

tarcieri · 2025-03-03T13:22:28Z

Your claim it "crashes on 32-bit architectures" seems misleading in that we do run CI on 32-bit architectures successfully:

https://github.com/RustCrypto/password-hashes/blob/master/.github/workflows/argon2.yml#L88-L105

It seems the real issue is you're attempting to allocate 2GiB of RAM on a system which doesn't have it available.

Yes, it could potentially use fallible allocations, however there is not a stable API in Rust for allocating memory fallibly, as the Allocator trait is currently unstable.

I am not sure there is a 3rd party crate for this which is both trustworthy and portable, until such a time as the Allocator trait is stabilized. We also generally dislike using 3rd party crates unless absolutely necessary.

In the meantime you can use whatever fallible allocation solution you want in conjunction with hash_password_into_with_memory

tarcieri · 2025-03-03T13:38:22Z

I should also note that fallible allocations don't always work the way you expect. Linux has an OOM killer, and will terminate other processes to satisfy requests for large amounts of memory on low-memory systems. So unfortunately simply adding fallible allocations may not be "safe" in the way you expect.

It would definitely be better to use less memory on low-resource systems to begin with, by selecting parameters which use less memory.

teythoon · 2025-03-03T13:57:34Z

Your claim it "crashes on 32-bit architectures" seems misleading in that we do run CI on 32-bit architectures successfully:
It seems the real issue is you're attempting to allocate 2GiB of RAM on a system which doesn't have it available.

I reproduced the issue that was reported to me by running a 32-bit build of Debian using Docker, with plenty of memory available. I'm pretty sure that I'm looking at address space exhaustion here, as in, there is no continuous block of unused address space with the required size.

Original bug report "sequoia-openpgp v2.0.0-alpha.2 test failure on i686 (32-bit x86)": https://gitlab.com/sequoia-pgp/sequoia/-/issues/1157

The test tries to unlock the locked sample key from RFC 9580, using the most obvious API offered by this crate. My feedback was that obvious invocation of Argon2, on recommended parameter choices, leads to a panic, which I consider very surprising, and this kind of surprises aren't great.

In the meantime you can use whatever fallible allocation solution you want in conjunction with hash_password_into_with_memory

That is what I went with. I ended up using std::alloc::alloc: https://gitlab.com/sequoia-pgp/sequoia/-/commit/3325c74c241654bc34859d95092f0b4f9b280f2e

I should also note that fallible allocations don't always work the way you expect. Linux has an OOM killer, and will terminate other processes to satisfy requests for large amounts of memory on low-memory systems. So unfortunately simply adding fallible allocations may not be "safe" in the way you expect.

I am aware.

It would definitely be better to use less memory on low-resource systems to begin with, by selecting parameters which use less memory.

As a consumer of OpenPGP artifacts, I don't have the choice of parameters, the sender chose. Now, maybe the sender chose poorly, and I won't be able to decrypt it, that is an okay outcome. Panic'ing the application is not.

tarcieri · 2025-03-03T14:00:31Z

Please provide a complete reproduction of the problem

tarcieri · 2025-03-03T14:02:16Z

I am aware.

Well, then you should also be aware that if you request too much memory, the OOM killer can kill your process.

You need to detect how much memory is available, calculate how much memory the parameters will use, and if it's too much, then fail yourself, or you're still at risk of the process being terminated by the kernel.

conradludgate · 2025-03-03T17:36:50Z

Vec::try_reserve is stable fwiw: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.try_reserve

newpavlov · 2025-03-03T17:43:42Z

A better option could be to use the alloc API directly. It would involve a bit of unsafe code, but it should be fairly straightforward and we already have unsafe code in argon2.

conradludgate · 2025-03-03T17:44:06Z

For prosperity, this is the gymnastics that hashbrown goes into: https://github.com/rust-lang/hashbrown/blob/master/src/raw/alloc.rs

tarcieri · 2025-03-03T17:56:26Z

@newpavlov what API are you talking about that's stable? e.g. the Allocator trait is unstable

tarcieri · 2025-03-03T18:18:11Z

Relaying another suggestion: we could take an optional memory limit parameter, and return an error if the given parameters require more memory than the limit

newpavlov · 2025-03-03T19:18:23Z

Why do you need the Allocator trait here? I was talking about the alloc function.

tarcieri · 2025-03-03T19:21:23Z

I wasn’t aware, hence why I was asking.

What makes that better than try_reserve though?

newpavlov · 2025-03-03T19:24:57Z

It's a more fundamental API. We don't need anything from Vec and know the desired memory size right away.

tarcieri · 2025-03-03T19:34:00Z

It seems like a lot more work to integrate due to having to deal with uninitialized memory, and it might also complicate #547.

@newpavlov are you going to do the work to integrate it?

newpavlov · 2025-03-03T20:09:24Z

Note that we also have the alloc_zeroed function.

are you going to do the work to integrate it?

I could try it a bit later. It should be easy enough to replace this line (IIUC it's the only place which performs heap allocation).

jonasmalacofilho · 2025-03-04T04:16:08Z

@newpavlov,

One problem is that alloc/alloc_zeroed are slow, due to the alignment requirement for Block, which forces Rust into a slow path. And this makes a noticeable difference when parallelism is enabled (#547) and large memory sizes as used (which is already assumed in this discussion). Please see 018c3e9 for some benchmark results, and the workaround I came up with for #547.¹

(I think this is what @tarcieri meant by "[fallible allocation] might also complicate #547").

That said, I don't see why the workaround of allocating bytes (and transmuting into an aligned slice of blocks) could not be implemented on top of fallible allocation. But could you also consider that case, besides what's in the master branch, and maybe test and benchmark how it would look?

In fact, alloc_zeroed was one of the things I tried first, which is how I now it's (as) slow (as allocating a large Vec of blocks). ↩

newpavlov · 2025-03-04T08:05:34Z

I thought that for big enough allocations (e.g. more than 16 pages) allocators usually just mmap the requested memory (they also could reuse previously mmaped chunk marked with MADV_FREE, but it's not relevant to this discussion).

The issue lies in vec![Block::default; ...], which clones the supplied block.

But with alloc_zeroed we don't need any cloning, no? We just allocate the zeroed memory and cast it into effectively Box<[Block]>.

conradludgate · 2025-03-04T08:24:08Z

Overallocating with alloc_zeroes and subslicing to the correct alignment does seem like the best option here

newpavlov · 2025-03-04T10:41:02Z

This is weird. You mean that alloc_zeroed is faster with alignment of 1 than with alignment of 64?

I don't see any meaningful difference with the snippet like this (i.e. changing ALIGN to 1 has no measurable performance impact):

extern crate alloc;

use alloc::alloc::{dealloc, alloc_zeroed, Layout};

const SIZE: usize = 1 << 20;
const ALIGN: usize = 64;

// Use extern fn to prevent compiler optimizations
unsafe extern "C" {
    fn use_mem(p: *mut u8, val: u32);
}


mod inner {
    #[unsafe(no_mangle)]
    #[inline(never)]
    unsafe extern "C" fn use_mem(p: *mut u8, val: u32) {
        unsafe { core::ptr::write_bytes(p, val as u8, super::SIZE); }
    }
}

fn main() {
    let l = Layout::from_size_align(SIZE, ALIGN).unwrap();

    let t = std::time::Instant::now();
    for i in 0..1_000_000_000 {
        unsafe {
            let p = alloc_zeroed(l);
            if p.is_null() {
                panic!()
            }
            use_mem(p, i);
            dealloc(p, l);
        }
    }
    println!("{:?}", t.elapsed());
}

I understand that this "benchmark" is problematic, so I would appreciate if you could provide a small reproduction of your issue.

conradludgate · 2025-03-04T11:22:11Z

At least on my apple silicon, what I am observing is that alloc_zeroed when aligned is actually mapping the pages, whereas when unaligned it's seemingly unmapped until used. This looks like unaligned is faster to alloc, but when actually touching the pages, the slowdown is equal.

I cannot speak for @jonasmalacofilho's previous benchmarks though. Perhaps the interleaving of block hashes and page mappings performs better than mapping all-at-once.

newpavlov · 2025-03-04T11:24:50Z

This is may be an implementation detail of the used allocator. I think we should prefer code which directly conveys the desired intent, than trying to work around potential weird allocator issues.

We also could directly call mmap on some systems to skip the allocator completely. Though it may be less efficient in the end, since allocator may reuse previously mapped memory without doing any syscalls.

jonasmalacofilho · 2025-03-04T11:38:39Z

(Before I forget again, I'm testing/looking at x86_64 Linux).

I thought that for big enough allocations (e.g. more than 16 pages) allocators usually just mmap the requested memory (...).

That's documented for glibc malloc, at an even lower threshold than that. However, the problem isn't always the allocation per say, it can also have to do with ensuring the memory is zeroed.

For small alignments alloc_zeroed on Unix/Linux calls calloc, which ensures the memory is both aligned and zeroed. But calloc's guarantees of alignment are limited to what C calls "fundamental alignment", which is basically size_of::<usize>().

So for larger alignments, alloc_zeroed falls back to allocating with the requested alignment, then zeroing the memory. And the latter is the slow part, when compared to the calloc fast path, because, on Linux, calloc can guarantee that the returned memory is zeroed without generally doing any work at all (at least at these large allocation sizes).

But with alloc_zeroed we don't need any cloning, no? We just allocate the zeroed memory and cast it into effectively Box<[Block]>.

Yeah, my bad. I should have pointed out that the benchmarks in that commit message were representative of what I saw with alloc_zeroed. But you're right, of course, that cloning isn't involved with alloc_zeroed.

I don't see any meaningful difference with the snippet like this (i.e. changing ALIGN to 1 has no measurable performance impact):

Well, I think that has to do with the fact that you're only allocating 1 MiB. If I increase your SIZE parameter to 1 GiB, I see roughly a 25% drop in performance between 1 and 64 byte alignment: with 100 iterations, from 7.9s to 10.1s.

Here's the higher-level benchmark I used back then:

#![allow(dead_code)]
#![deny(unsafe_op_in_unsafe_fn)]

use std::alloc::{self, Layout};
use std::hint::black_box;

use criterion::{criterion_group, criterion_main, Criterion, Throughput};

#[derive(Debug, Clone, Copy)]
#[repr(align(64))]
struct Block([u64; 128]);

impl Default for Block {
    fn default() -> Self {
        Self([0; 128])
    }
}

const COUNT: usize = 1024 * 1024;
const SIZE: usize = size_of::<Block>();

fn bench(c: &mut Criterion) {
    let mut group = c.benchmark_group("bench");
    group.throughput(Throughput::Bytes((SIZE * COUNT) as u64)); // 1 GiB

    group.bench_function("bytes with vec! macro", |b| {
        b.iter(|| vec![0u8; black_box(SIZE * COUNT)])
    });

    group.bench_function("42u8 with vec! macro", |b| {
        b.iter(|| vec![42u8; black_box(SIZE * COUNT)])
    });

    group.bench_function("Blocks with vec! macro", |b| {
        b.iter(|| vec![Block::default(); black_box(COUNT)])
    });

    group.bench_function(
        "Blocks with alloc::alloc_zeroed then Vec::from_raw_parts",
        |b| {
            b.iter(|| {
                let count = black_box(COUNT);
                let layout = Layout::array::<Block>(count).unwrap();
                assert_eq!(layout.size(), SIZE * COUNT);
                let ptr = unsafe { alloc::alloc_zeroed(layout) };
                if ptr.is_null() {
                    alloc::handle_alloc_error(layout);
                }
                let vec: Vec<Block> = unsafe { Vec::from_raw_parts(ptr.cast(), count, count) };
                vec
            })
        },
    );

    group.finish();
}

criterion_group!(benches, bench);
criterion_main!(benches);

bench/bytes with vec! macro                                                                             
                        time:   [9.3414 µs 9.3510 µs 9.3612 µs]
                        thrpt:  [106824 GiB/s 106940 GiB/s 107051 GiB/s]
bench/42u8 with vec! macro                                                                            
                        time:   [78.610 ms 78.696 ms 78.788 ms]
                        thrpt:  [12.692 GiB/s 12.707 GiB/s 12.721 GiB/s]
bench/Blocks with vec! macro                                                                            
                        time:   [64.976 ms 65.099 ms 65.217 ms]
                        thrpt:  [15.333 GiB/s 15.361 GiB/s 15.390 GiB/s]
bench/Blocks with alloc::alloc_zeroed then Vec::from_raw_parts                                                                            
                        time:   [78.990 ms 79.199 ms 79.406 ms]
                        thrpt:  [12.594 GiB/s 12.626 GiB/s 12.660 GiB/s]

jonasmalacofilho · 2025-03-04T11:53:51Z

Note that the benchmark in my previous comment is also flawed, of course. For one, it doesn't take into account overcommit, which is probably why vec![0u8; ...] was so suspiciously fast.

But the underlying point is that not carefully allocating the buffer results in one extra single-threaded pass over memory, which is significant when the algorithm we're interested in basically only does a few more passes over it.

This is may be an implementation detail of the used allocator. I think we should prefer code which directly conveys the desired intent, than trying to work around potential weird allocator issues.

I get your point, but a 27.5% performance hit on m=1GiB p=4 is significant. And it gets relatively worse as memory and/or lane counts increase.

newpavlov · 2025-03-04T12:06:57Z

IIUC on a long-running system we need the zeroization pass either way. The difference is only whether it will be done by OS in kernel space (if we mmap new memory) or by an allocator in user space (if we reuse previously mapped memory). In the former case, this zeroization may not be reflected as time spent in your program, but the system as the whole still performs the same amount of work.

For now I think we should merge the alloc_zeroed PR and continue discussion of potential improvements in a separate PR/issue.

newpavlov · 2025-03-04T12:11:17Z

Alternatively, we could try using uninit memory. But it certainly should be implemented as a separate feature.

jonasmalacofilho · 2025-03-04T12:19:53Z

IIUC on a long-running system we need the zeroization pass either way. The difference is only whether it will be done by OS in kernel space or by an allocator in user space. In the former case, this zeroization may not be reflected as time spent in your program, but the system as the whole still performs the same amount of work.

I think there's an issue with that argument: if the OS does it, it will continue to do it regardless of whether we do it too (unless we find a way to explicitly opt out of that, if that's even possible). So we're still talking about one extra memory pass, and not the same amount of work.

For now I think we should merge the alloc_zeroed PR and continue discussion of potential improvements in a separate PR/issue.

I'm ok with that.

Alternatively, we could try using uninit memory. But it certainly should be implemented as a separate feature.

I think an argon2 slice in the first pass can sometimes read (zeroes ) from later slices? I'm not sure, and either way it seems tricky to guarantee that uninit memory is never possibly read.

newpavlov · 2025-03-04T12:27:54Z

I think there's an issue with that argument: if the OS does it, it will continue to do it regardless of whether we do it too (unless we find a way to explicitly opt out of that, if that's even possible).

Well, there is MAP_UNINITIALIZED, but it's not used outside of embedded devices.

I was talking about a slightly different thing. A common optimization for allocators is to do the following:

mmap huge chunk of memory with MAP_ANONYMOUS (the pages get zeroized by OS either immediately, or on the first page touch).
On deallocation keep the mapped memory, but mark it with MADV_FREE.
On a future non-zeroizing allocation, give the MADV_FREE memory instead of mapping new memory.

The MADV_FREE memory is semantically "uninitialized". Not only it may contain garbage data from the previous allocation, but, even worse, until we write to it, its content may change to zeros "under out foot" (i.e. OS may reclaim the physical page and replace it with "zero page").

Existence of MADV_FREE is the classical argument of why properly managing uninit memory is hard in Rust.

I think an argon2 slice in the first pass can sometimes read (zeroes ) from later slices?

I am not sure. If it's true, then we are out of luck.

jonasmalacofilho · 2025-03-04T13:19:42Z

Yes, that could happen, given the right allocator and circumstances. But I think you're overestimating how common that scenario is.

Note that I haven't yet been able to trigger it with a benchmark (either the one above or the full argon2 benchmarks I ran when trying different allocation strategies for #547). And a benchmark is basically the ideal case to observe what you described: we allocate and free the exact same (large) size over and over again. This on a fairly common platform using a fairly common allocator (glibc on x86_64 Linux).

Your argument also doesn't apply to use cases where it's desirable for argon to be as fast as possible with high memory sizes, but it isn't called over and over again. Like a password manager, for example, where one-off latency limits the choice of argon2 parameters.

So I just want to ask you to keep the ugly, but (at least sometimes) effective, oversized-buffer-of-bytes-into-aligned-slice workaround on the table, at least when thinking about #547.

newpavlov · 2025-03-05T11:08:51Z

I am open to adding this workaround, but I think we should do it in a separate PR after additional investigation.

decathorpe · 2025-03-05T14:29:16Z

Hi, reporter of the original issue here.

From what I can tell, the actual issue here is that the vec![] call that causes this (here) fails not because the memory is not available, but because the requested allocation causes an integer overflow in Layout calculation on 32-bit architectures (the size in self.size.checked_mul(n) in Layout::repeat_packed is a usize).

So with the argon2 parameters requested in sequoia-openpgp (=2.0.0-alpha.2), the Layout calculation unconditionally panic!d the program. The workaround in the 2.0.0 release was to introduce code for fallibly allocate memory, but - if I understand this correctly - that should still unconditionally fail on 32-bit architectures for exactly the same reason - because it still needs to allocate the same amount of memory. It now just returns an error instead of hard-panic!ing. Which is better, but doesn't really solve the underlying issue.

It might be better to use the recommended argon2 parameters for "resource constrained systems" on 32-bit architectures instead of hard-erroring on use of argon2 on those systems?

newpavlov · 2025-03-05T14:52:15Z

Layout::array::<Block>(len).ok()?; should handle big buffer sizes without panicking and IIUC Params::block_count returns a number which is always smaller than 2^32 (i.e. computations inside it can not overflow).

It might be better to use the recommended argon2 parameters for "resource constrained systems" on 32-bit architectures instead of hard-erroring on use of argon2 on those systems?

I don't think we should change default algorithm parameters depending on target, since it may cause portability issues. But it may be reasonable to change the default for all targets. Please open a separate issue for that.

jonasmalacofilho · 2025-03-05T18:06:43Z

Layout::array::<Block>(len).ok()?; should handle big buffer sizes without panicking and IIUC Params::block_count returns a number which is always smaller than 2^32 (i.e. computations inside it can not overflow).

That would mean an allocation of 2^42 bytes, no? Much larger than isize::MAX¹ on 32-bit platforms.

Also, isn't isize::MAX, not usize::MAX, the maximum allocation size? ↩

newpavlov · 2025-03-05T19:25:57Z

I was talking about this line in the Layout::array docs:

On arithmetic overflow or when the total size would exceed isize::MAX, returns LayoutError.

As you can see here it works as documented.

jonasmalacofilho · 2025-03-05T20:09:38Z

@newpavlov, sorry, I misunderstood the point you were trying to make. @decathorpe didn't suggest that Layout::array could panic; in fact, the workaround he mentioned uses it. So I failed to consider that you were just confirming it.

newpavlov · 2025-03-05T20:27:22Z

Ah, I also slightly misread this comment. I thought that here:

the Layout calculation unconditionally panic!d the program

"Layout calculation" was referring to the Layout::array part.

tarcieri changed the title ~~Argon2::hash_password_into is not safe~~ Argon2::hash_password_into should use fallible memory allocations Mar 3, 2025

tarcieri changed the title ~~Argon2::hash_password_into should use fallible memory allocations~~ Argon2::hash_password_into should use fallible memory allocations Mar 3, 2025

conradludgate mentioned this issue Mar 4, 2025

Add BlockArray abstraction to handle fallible allocations #567

Closed

newpavlov mentioned this issue Mar 4, 2025

argon2: detect allocation failures in hash_password_into #568

Merged

newpavlov closed this as completed in #568 Mar 5, 2025

newpavlov mentioned this issue Mar 5, 2025

argon2: add workaround for big 64-byte aligned allocations? #573

Open

Argon2::hash_password_into should use fallible memory allocations #566

Argon2::hash_password_into should use fallible memory allocations #566

Comments

teythoon commented Mar 3, 2025

tarcieri commented Mar 3, 2025 • edited Loading

tarcieri commented Mar 3, 2025

teythoon commented Mar 3, 2025

tarcieri commented Mar 3, 2025

tarcieri commented Mar 3, 2025

conradludgate commented Mar 3, 2025

newpavlov commented Mar 3, 2025

conradludgate commented Mar 3, 2025

tarcieri commented Mar 3, 2025

tarcieri commented Mar 3, 2025

newpavlov commented Mar 3, 2025

tarcieri commented Mar 3, 2025

newpavlov commented Mar 3, 2025

tarcieri commented Mar 3, 2025

newpavlov commented Mar 3, 2025 • edited Loading

jonasmalacofilho commented Mar 4, 2025

Footnotes

newpavlov commented Mar 4, 2025 • edited Loading

conradludgate commented Mar 4, 2025

newpavlov commented Mar 4, 2025

conradludgate commented Mar 4, 2025

newpavlov commented Mar 4, 2025 • edited Loading

jonasmalacofilho commented Mar 4, 2025

jonasmalacofilho commented Mar 4, 2025

newpavlov commented Mar 4, 2025 • edited Loading

newpavlov commented Mar 4, 2025

jonasmalacofilho commented Mar 4, 2025

newpavlov commented Mar 4, 2025 • edited Loading

jonasmalacofilho commented Mar 4, 2025

newpavlov commented Mar 5, 2025

decathorpe commented Mar 5, 2025

newpavlov commented Mar 5, 2025

jonasmalacofilho commented Mar 5, 2025 • edited Loading

Footnotes

newpavlov commented Mar 5, 2025

jonasmalacofilho commented Mar 5, 2025

newpavlov commented Mar 5, 2025

`Argon2::hash_password_into` should use fallible memory allocations #566

`Argon2::hash_password_into` should use fallible memory allocations #566

tarcieri commented Mar 3, 2025 •

edited

Loading

newpavlov commented Mar 3, 2025 •

edited

Loading

newpavlov commented Mar 4, 2025 •

edited

Loading

newpavlov commented Mar 4, 2025 •

edited

Loading

newpavlov commented Mar 4, 2025 •

edited

Loading

newpavlov commented Mar 4, 2025 •

edited

Loading

jonasmalacofilho commented Mar 5, 2025 •

edited

Loading