Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] added definitions for lp-distances. Faster than just using norm #1234

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ajinkya-k
Copy link
Contributor

@ajinkya-k ajinkya-k commented Mar 7, 2025

This PR adds functions to compute $L_p$ distance between two vectors. This is faster than just calling norm(x .- y)

Some notes (I am going to write up the benchmarks and post the details on my website soon):

  • separate methods for integer p and real p. Integer versions (e.g. supplying p = 3) are an order of magnitude faster than the corresponding real version (eg p = 3.0)
  • Special versions are provided for p = 0, 1, 2 (provided as integers) and for p = Inf.
  • l1 and l2 dist can be made faster using sum(x .- y) but that allocates memory, and the allocation can be quite large for large vectors, if there is any non-allocating version that does not involve looping please let me know I can try it! (mapreduce also allocates and is slow)

TODOs:

  • add compat notice (?)
  • add tests
  • use (x - y < tol) for $\ell_0$? or remove $\ell_0$ altogether?
  • add comment to mention the StatsBase implementation which is the source for the loop sum version
    Any feedback is welcome!

@ajinkya-k
Copy link
Contributor Author

@jishnub @StefanKarpinski please let me know what you think, and if there is any way i can make it more efficient.

Copy link

codecov bot commented Mar 7, 2025

Codecov Report

Attention: Patch coverage is 97.36842% with 1 line in your changes missing coverage. Please review.

Project coverage is 91.98%. Comparing base (8cc4216) to head (11df60f).

Files with missing lines Patch % Lines
src/generic.jl 97.36% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1234      +/-   ##
==========================================
- Coverage   91.99%   91.98%   -0.02%     
==========================================
  Files          34       34              
  Lines       15480    15514      +34     
==========================================
+ Hits        14241    14270      +29     
- Misses       1239     1244       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@RoyiAvital
Copy link

RoyiAvital commented Mar 7, 2025

I'd make the last step of applying ^(1/p) with an opt out option.
In many calculations applying the power is not needed (Though then it is not a real norm / distance).
As an example, many use the Squared Euclidean distance (Not really a distance).

@jishnub
Copy link
Member

jishnub commented Mar 7, 2025

I wonder if this should be a method of norm rather than a separate function that does something similar?

@fredrikekre
Copy link
Member

I think all of this exist in https://github.com/JuliaStats/Distances.jl ?

@ajinkya-k
Copy link
Contributor Author

@RoyiAvital great idea.

function _lpunnorm(x::T, y::T, fn) where {T}
r = 0.0
for i in eachindex(x,y)
@inbounds r += fn(x[i] - y[i])
Copy link
Member

@stevengj stevengj Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this algorithm suffers from spurious overflow if the elements of x or y are very large.

e.g. lpdist([1e300],[0.0]) should return 1e300, but this will return Inf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch! Any pointers on how I can fix this?

Copy link
Member

@jishnub jishnub Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps divide each vector by the max of the maximum absolute value of each vector before computing the distance, and rescale the result later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that rescaling is not needed for the L1 norm or the maximum norm.

Note also that you should be using norm, not abs, everywhere, in order to handle vectors whose entries are themselves vectors. In fact, arguably you should be calling the distance function recursively for such cases.

In general, I would study the generic_norm implementation, which already deals with such issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I intended this to be an implementation of vectors of reals (and maybe complex) only. Would it be better to specialise for real/complex vectors and have a separate implementation for recursive structures?

Copy link
Member

@stevengj stevengj Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See how it's done for norm. There's no need to have two implementations, since norm_sqr just calls abs2 for scalars, and norm for scalars just calls abs.

You should really study the norm implementation and build off that, as otherwise you're re-inventing the wheel here.

@stevengj
Copy link
Member

stevengj commented Mar 7, 2025

Has there been discussion in the past of whether Distances.jl functionality belongs in the stdlib?

@ViralBShah
Copy link
Member

I don't recollect such a discussion.

@ajinkya-k
Copy link
Contributor Author

This came up in a slack thread and I did mention that the l2dist was available in statsbase but there seemed to be some interest in having the l2dist in LinearAlgebra

Comment on lines +922 to +924
fn = abs
elseif p == 2
fn = abs2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be norm and norm_sqr, respectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants