Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for algebraic floating point methods #136469

Open
2 of 6 tasks
calder opened this issue Feb 3, 2025 · 9 comments
Open
2 of 6 tasks

Tracking Issue for algebraic floating point methods #136469

calder opened this issue Feb 3, 2025 · 9 comments
Labels
A-floating-point Area: Floating point numbers and arithmetic C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-lang Relevant to the language team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@calder
Copy link

calder commented Feb 3, 2025

Feature gate: #![feature(float_algebraic)]

This is a tracking issue for exposing core::intrinsics::f*_algebraic in stable Rust.

Public API

// core::num::f16

impl f16 {
    pub fn algebraic_add(self, rhs: f16) -> f16;
    pub fn algebraic_sub(self, rhs: f16) -> f16;
    pub fn algebraic_mul(self, rhs: f16) -> f16;
    pub fn algebraic_div(self, rhs: f16) -> f16;
    pub fn algebraic_rem(self, rhs: f16) -> f16;
}
// core::num::f32

impl f32 {
    pub fn algebraic_add(self, rhs: f32) -> f32;
    pub fn algebraic_sub(self, rhs: f32) -> f32;
    pub fn algebraic_mul(self, rhs: f32) -> f32;
    pub fn algebraic_div(self, rhs: f32) -> f32;
    pub fn algebraic_rem(self, rhs: f32) -> f32;
}
// core::num::f64

impl f64 {
    pub fn algebraic_add(self, rhs: f64) -> f64;
    pub fn algebraic_sub(self, rhs: f64) -> f64;
    pub fn algebraic_mul(self, rhs: f64) -> f64;
    pub fn algebraic_div(self, rhs: f64) -> f64;
    pub fn algebraic_rem(self, rhs: f64) -> f64;
}
// core::num::f128

impl f128 {
    pub fn algebraic_add(self, rhs: f128) -> f128;
    pub fn algebraic_sub(self, rhs: f128) -> f128;
    pub fn algebraic_mul(self, rhs: f128) -> f128;
    pub fn algebraic_div(self, rhs: f128) -> f128;
    pub fn algebraic_rem(self, rhs: f128) -> f128;
}

Steps / History

Unresolved Questions

References

cc @rust-lang/lang @rust-lang/libs-api

Footnotes

  1. https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html

@calder calder added C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Feb 3, 2025
@Noratrieb Noratrieb marked this as a duplicate of #136468 Feb 3, 2025
@jieyouxu jieyouxu added the A-floating-point Area: Floating point numbers and arithmetic label Feb 3, 2025
@tgross35 tgross35 added the T-lang Relevant to the language team, which will review and decide on the PR/issue. label Feb 5, 2025
@tgross35
Copy link
Contributor

tgross35 commented Feb 5, 2025

Added lang as requested in rust-lang/libs-team#532 (comment). That comment also mentions algebraic_mul_add, which could be added after the initial PR if it makes sense.

@RalfJung
Copy link
Member

These operations are non-deterministic. @nikic do you know if LLVM scalar evolution handles that properly? We had codegen issues in the past when SE assumed that an operation was deterministic and then actually it was not.

@RalfJung
Copy link
Member

RalfJung commented Feb 17, 2025

That comment also mentions algebraic_mul_add, which could be added after the initial PR if it makes sense.

We also have the "may or may not fuse" intrinsics added in #124874, which so far have not been exposed in any way that has a path to stabilization. Would algebraic_mul_add use those intrinsics, or would it replace them since we likely want some of the algebraic fast-math flags on that operation as well (i.e., we want to give the compiler more freedom than just "fuse or don't fuse").

@nikic
Copy link
Contributor

nikic commented Feb 17, 2025

These operations are non-deterministic. @nikic do you know if LLVM scalar evolution handles that properly? We had codegen issues in the past when SE assumed that an operation was deterministic and then actually it was not.

Are these "algebraic" in the sense that they have reassoc FMF? If so, then yes, SE should be treating them as non-deterministic already: https://github.com/llvm/llvm-project/blob/6684a5970e74b8b4c0c83361a90e25dae9646db0/llvm/lib/Analysis/ConstantFolding.cpp#L1437-L1444

@RalfJung
Copy link
Member

reassoc and a few more, yeah. Seems like that is accounted for, thanks. :)

@tgross35 tgross35 changed the title Tracking Issue for exposing algebraic floating point intrinsics Tracking Issue for algebraic floating point methods Feb 17, 2025
@tgross35
Copy link
Contributor

On that note I added an unresolved question for naming since algebraic isn't the most clear indicator of what is going on.

@tgross35
Copy link
Contributor

In theory the flags could also be represented via const generics, which would allow more fine tuned control and more flags without a method explosion. Something like:

#[derive(Clone, Copy, Debug, Default, ...)]
#[non_exhaustive]
struct FpArithOps {
    reassociate: bool,
    contract: bool,
    reciprocal: bool,
    no_signed_zeros: bool,
    ftz: bool,
    daz: bool,
    poison_nan: bool,
    poison_inf: bool,
}

impl FpArithOps {
    // Current algebraic_* flags
    const ALGEBRAIC: Self = Self { reassociate: true, contract: true, reciprocal: true, no_signed_zeros: true, ..false };
}

// Panics if `poison_*` is set
fn add_with_ops<const OPS: FpArithOps>(self, y: Self) -> Self;

// Alows `poison_*`
unsafe fn add_with_ops_unchecked<const OPS: FpArithOps>(self, y: Self) -> Self;

// same as `f32::algebraic_div`
let x = 1.0f32.div_with_ops::<FpArithOps::ALGEBRAIC>(y);

(Using a struct needs the next step of const generic support, or some way to bless types in std)

@hanna-kruppe
Copy link
Contributor

As I wrote in the ACP:

While this goes on the direction of having multiple combinations of LLVM “fast math flags” exposed, I don’t think that there’s more than two or three sets that are broadly useful enough and well-behaved enough to be worth exposing. And one of those is just “let the hardware do whatever is fast w.r.t. subnormals” which is something that really wants to be applied to entire regions of code and not to individual operations, so it may need an entirely different design. (It’s also a function attribute rather than an instruction flag in LLVM.)

I don’t think that we should expose the power set of LLVM’s flags, nor so many ad-hoc combinations that “method explosion” becomes a realistic problem. I’m not even sure if it’s a good idea to ever expose any of FMFs that can make an operation return poison (has anyone ever used those soundly while still getting a speedup?). And FTZ/DAZ shouldn’t be modeled as per-operation flags because you don’t want to toggle the CPU control registers for that constantly.

Zalathar added a commit to Zalathar/rust that referenced this issue Feb 18, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps
@zroug
Copy link

zroug commented Feb 18, 2025

Should @llvm.arithmetic.fence be exposed? Sometimes it would be nice to reason about the effects of algebraic operations locally. Currently, there is no clean way to prevent reassociation, etc. with operations outside your local context, because you can't know if the caller is using algebraic operations as well.

bors added a commit to rust-lang-ci/rust that referenced this issue Feb 19, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
bors added a commit to rust-lang-ci/rust that referenced this issue Feb 26, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
fmease added a commit to fmease/rust that referenced this issue Feb 26, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
bors added a commit to rust-lang-ci/rust that referenced this issue Feb 26, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

~~try-job: x86_64-gnu-nopt~~
try-job: x86_64-gnu-aux
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-floating-point Area: Floating point numbers and arithmetic C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-lang Relevant to the language team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants