Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve precision for Sinh and Cosh TP routines #99763

Closed

Conversation

eiriktsarpalis
Copy link
Member

The amd algorithm for cosh employs fallback to scalar for values exceeding a specified threshold. This was missed in the TP transcription resulting in loss of precision when passing in large inputs.

Contributes to #98861.

private const float Single_LOGV = 0.693161f;
private const float Single_HALFV = 1.0000138f;
private const float Single_INVV2 = 0.24999309f;

private const ulong Double_MAX_VECTORIZED_VALUE = 0x41600000000000ul;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB there's no double implementation in the amd library, I picked this value deriving from upper bounds used elsewhere and make it small enough such that all our tests pass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good value, it represents 1.9330329145781312E-307 which is so small that essentially no real world input will be vectorized.

I'd expect a more realistic value to at least match float and be 0x4056_5A9F_8000_0000, although in practice I'd expect it to be quite a bit higher. If you note the scalar cosh algorithm, you'll note that double uses: 0x4086_33CE_8FB9_F87E, which is the bitwise representation for 710.475860073944


In general, this is a case where porting the scalar kernel is likely overall better. The biggest "blocker" to that is that the double version currently uses a lookup table for the tail adjustment, which tends to not be trivially or efficiently vectorizable. The float version does not and uses more branches instead.

-- There are options here, just not "trivial" options, and the PR would be more involved. So for the time being, I think simply matching the "correct" arg max is the more desirable behavior.

Copy link
Member Author

@eiriktsarpalis eiriktsarpalis Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my bad, I was toying around with the binary representation of the upper bound to find something that passes and hadn't realized that it ended up being so small! After more careful experimentation it turns out that the maximal cut-off within which we hit our precision targets for double is ~16.5, much lower compared to the float algorithm. Assuming we don't change the algorithm I see two options here:

  1. Use the reduced cut-off of 16.5 or
  2. Increase or remove the cut-offs and increase testing tolerances for the hyperbolic methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we really need to use the normal cutoff here, otherwise users will end up seeing major perf slowdowns for somewhat typical inputs.

The main reason for the precision issue is going to come about from LOGV, HALFV, and INVV2 being "exact" (or nearest representable). I had given a rough breakdown of where the values came from in #97874 (comment), but didn't end up having time to figure out the right amount of adjustment to give to the values for double.

If you observe https://github.com/amd/aocl-libm-ose/blob/master/src/optimized/cosh.c, you'll note that values over 20 get simplified because the negative exponential from (e^x + e^-x) / 2 is small enough to not matter any more (we just do exp(|x|) / 2 with the relevant sign from x while values under this range end up using the head/tail split to account for imprecision).

Given the above and given that we need to call exp regardless, we should be able to conditionalize the work to keep it accurate and efficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, in that case I'd be inclined to close this PR.

Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

@stephentoub
Copy link
Member

The amd algorithm for cosh employs fallback to scalar for values exceeding a specified threshold. This was missed in the TP transcription

It wasn't "missed"... it was removed in 0e6a327. But I don't remember why.

@eiriktsarpalis
Copy link
Member Author

But I don't remember why.

Maybe because it was missing a double counterpart at the time? That's where the precision issues manifest in our testing.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants