Tracking Issue for algebraic floating point methods #136469

calder · 2025-02-03T07:12:43Z

Feature gate: #![feature(float_algebraic)]

This is a tracking issue for exposing core::intrinsics::f*_algebraic in stable Rust.

Public API

// core::num::f16

impl f16 {
    pub fn algebraic_add(self, rhs: f16) -> f16;
    pub fn algebraic_sub(self, rhs: f16) -> f16;
    pub fn algebraic_mul(self, rhs: f16) -> f16;
    pub fn algebraic_div(self, rhs: f16) -> f16;
    pub fn algebraic_rem(self, rhs: f16) -> f16;
}

// core::num::f32

impl f32 {
    pub fn algebraic_add(self, rhs: f32) -> f32;
    pub fn algebraic_sub(self, rhs: f32) -> f32;
    pub fn algebraic_mul(self, rhs: f32) -> f32;
    pub fn algebraic_div(self, rhs: f32) -> f32;
    pub fn algebraic_rem(self, rhs: f32) -> f32;
}

// core::num::f64

impl f64 {
    pub fn algebraic_add(self, rhs: f64) -> f64;
    pub fn algebraic_sub(self, rhs: f64) -> f64;
    pub fn algebraic_mul(self, rhs: f64) -> f64;
    pub fn algebraic_div(self, rhs: f64) -> f64;
    pub fn algebraic_rem(self, rhs: f64) -> f64;
}

// core::num::f128

impl f128 {
    pub fn algebraic_add(self, rhs: f128) -> f128;
    pub fn algebraic_sub(self, rhs: f128) -> f128;
    pub fn algebraic_mul(self, rhs: f128) -> f128;
    pub fn algebraic_div(self, rhs: f128) -> f128;
    pub fn algebraic_rem(self, rhs: f128) -> f128;
}

Steps / History

ACP: ACP: Expose algebraic floating point intrinsics libs-team#532
Implementation:
- Add "algebraic" fast-math intrinsics, based on fast-math ops that cannot return poison #120718
- Expose algebraic floating point intrinsics #136457
Final comment period (FCP)¹
Stabilization PR

Unresolved Questions

How does this interact (if at all) with the optional-fused-multiply-add intrinsics added in intrinsics fmuladdf{32,64}: expose llvm.fmuladd.* semantics #124874? Do we even still need those?
Naming: "algebraic" is not very descriptive of what this does since the operations themselves are algebraic.

References

cc @rust-lang/lang @rust-lang/libs-api

https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html ↩

The text was updated successfully, but these errors were encountered:

tgross35 · 2025-02-05T08:56:03Z

Added lang as requested in rust-lang/libs-team#532 (comment). That comment also mentions algebraic_mul_add, which could be added after the initial PR if it makes sense.

RalfJung · 2025-02-17T07:53:00Z

These operations are non-deterministic. @nikic do you know if LLVM scalar evolution handles that properly? We had codegen issues in the past when SE assumed that an operation was deterministic and then actually it was not.

RalfJung · 2025-02-17T07:57:04Z

That comment also mentions algebraic_mul_add, which could be added after the initial PR if it makes sense.

We also have the "may or may not fuse" intrinsics added in #124874, which so far have not been exposed in any way that has a path to stabilization. Would algebraic_mul_add use those intrinsics, or would it replace them since we likely want some of the algebraic fast-math flags on that operation as well (i.e., we want to give the compiler more freedom than just "fuse or don't fuse").

nikic · 2025-02-17T08:46:01Z

These operations are non-deterministic. @nikic do you know if LLVM scalar evolution handles that properly? We had codegen issues in the past when SE assumed that an operation was deterministic and then actually it was not.

Are these "algebraic" in the sense that they have reassoc FMF? If so, then yes, SE should be treating them as non-deterministic already: https://github.com/llvm/llvm-project/blob/6684a5970e74b8b4c0c83361a90e25dae9646db0/llvm/lib/Analysis/ConstantFolding.cpp#L1437-L1444

RalfJung · 2025-02-17T09:00:13Z

reassoc and a few more, yeah. Seems like that is accounted for, thanks. :)

tgross35 · 2025-02-17T20:16:37Z

On that note I added an unresolved question for naming since algebraic isn't the most clear indicator of what is going on.

tgross35 · 2025-02-17T20:31:34Z

In theory the flags could also be represented via const generics, which would allow more fine tuned control and more flags without a method explosion. Something like:

#[derive(Clone, Copy, Debug, Default, ...)]
#[non_exhaustive]
struct FpArithOps {
    reassociate: bool,
    contract: bool,
    reciprocal: bool,
    no_signed_zeros: bool,
    ftz: bool,
    daz: bool,
    poison_nan: bool,
    poison_inf: bool,
}

impl FpArithOps {
    // Current algebraic_* flags
    const ALGEBRAIC: Self = Self { reassociate: true, contract: true, reciprocal: true, no_signed_zeros: true, ..false };
}

// Panics if `poison_*` is set
fn add_with_ops<const OPS: FpArithOps>(self, y: Self) -> Self;

// Alows `poison_*`
unsafe fn add_with_ops_unchecked<const OPS: FpArithOps>(self, y: Self) -> Self;

// same as `f32::algebraic_div`
let x = 1.0f32.div_with_ops::<FpArithOps::ALGEBRAIC>(y);

(Using a struct needs the next step of const generic support, or some way to bless types in std)

hanna-kruppe · 2025-02-17T21:27:44Z

As I wrote in the ACP:

While this goes on the direction of having multiple combinations of LLVM “fast math flags” exposed, I don’t think that there’s more than two or three sets that are broadly useful enough and well-behaved enough to be worth exposing. And one of those is just “let the hardware do whatever is fast w.r.t. subnormals” which is something that really wants to be applied to entire regions of code and not to individual operations, so it may need an entirely different design. (It’s also a function attribute rather than an instruction flag in LLVM.)

I don’t think that we should expose the power set of LLVM’s flags, nor so many ad-hoc combinations that “method explosion” becomes a realistic problem. I’m not even sure if it’s a good idea to ever expose any of FMFs that can make an operation return poison (has anyone ever used those soundly while still getting a speedup?). And FTZ/DAZ shouldn’t be modeled as per-operation flags because you don’t want to toggle the CPU control registers for that constantly.

Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

zroug · 2025-02-18T07:50:09Z

Should @llvm.arithmetic.fence be exposed? Sometimes it would be nice to reason about the effects of algebraic operations locally. Currently, there is no clean way to prevent reassociation, etc. with operations outside your local context, because you can't know if the caller is using algebraic operations as well.

Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt

Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps ~~try-job: x86_64-gnu-nopt~~ try-job: x86_64-gnu-aux

calder added C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Feb 3, 2025

calder mentioned this issue Feb 3, 2025

Expose algebraic floating point intrinsics #136457

Open

Noratrieb marked this as a duplicate of #136468 Feb 3, 2025

jieyouxu added the A-floating-point Area: Floating point numbers and arithmetic label Feb 3, 2025

calder mentioned this issue Feb 3, 2025

ACP: Expose algebraic floating point intrinsics rust-lang/libs-team#532

Closed

4 tasks

tgross35 added the T-lang Relevant to the language team, which will review and decide on the PR/issue. label Feb 5, 2025

RalfJung mentioned this issue Feb 17, 2025

Imprecise floating point operations (fast-math) #21690

Closed

tgross35 changed the title ~~Tracking Issue for exposing algebraic floating point intrinsics~~ Tracking Issue for algebraic floating point methods Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking Issue for algebraic floating point methods #136469

Tracking Issue for algebraic floating point methods #136469

calder commented Feb 3, 2025 •

edited by tgross35

Loading

tgross35 commented Feb 5, 2025

RalfJung commented Feb 17, 2025

RalfJung commented Feb 17, 2025 •

edited

Loading

nikic commented Feb 17, 2025

RalfJung commented Feb 17, 2025

tgross35 commented Feb 17, 2025

tgross35 commented Feb 17, 2025

hanna-kruppe commented Feb 17, 2025

zroug commented Feb 18, 2025

Tracking Issue for algebraic floating point methods #136469

Tracking Issue for algebraic floating point methods #136469

Comments

calder commented Feb 3, 2025 • edited by tgross35 Loading

Public API

Steps / History

Unresolved Questions

References

Footnotes

tgross35 commented Feb 5, 2025

RalfJung commented Feb 17, 2025

RalfJung commented Feb 17, 2025 • edited Loading

nikic commented Feb 17, 2025

RalfJung commented Feb 17, 2025

tgross35 commented Feb 17, 2025

tgross35 commented Feb 17, 2025

hanna-kruppe commented Feb 17, 2025

zroug commented Feb 18, 2025

calder commented Feb 3, 2025 •

edited by tgross35

Loading

RalfJung commented Feb 17, 2025 •

edited

Loading