-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] added definitions for lp-distances. Faster than just using norm #1234
base: master
Are you sure you want to change the base?
Conversation
@jishnub @StefanKarpinski please let me know what you think, and if there is any way i can make it more efficient. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1234 +/- ##
==========================================
- Coverage 91.99% 91.98% -0.02%
==========================================
Files 34 34
Lines 15480 15514 +34
==========================================
+ Hits 14241 14270 +29
- Misses 1239 1244 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
I'd make the last step of applying |
I wonder if this should be a method of |
I think all of this exist in https://github.com/JuliaStats/Distances.jl ? |
@RoyiAvital great idea. |
function _lpunnorm(x::T, y::T, fn) where {T} | ||
r = 0.0 | ||
for i in eachindex(x,y) | ||
@inbounds r += fn(x[i] - y[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this algorithm suffers from spurious overflow if the elements of x
or y
are very large.
e.g. lpdist([1e300],[0.0])
should return 1e300
, but this will return Inf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch! Any pointers on how I can fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps divide each vector by the max
of the maximum absolute value of each vector before computing the distance, and rescale the result later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that rescaling is not needed for the L1 norm or the maximum norm.
Note also that you should be using norm
, not abs
, everywhere, in order to handle vectors whose entries are themselves vectors. In fact, arguably you should be calling the distance function recursively for such cases.
In general, I would study the generic_norm
implementation, which already deals with such issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I intended this to be an implementation of vectors of reals (and maybe complex) only. Would it be better to specialise for real/complex vectors and have a separate implementation for recursive structures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See how it's done for norm
. There's no need to have two implementations, since norm_sqr
just calls abs2
for scalars, and norm
for scalars just calls abs
.
You should really study the norm
implementation and build off that, as otherwise you're re-inventing the wheel here.
Has there been discussion in the past of whether Distances.jl functionality belongs in the stdlib? |
I don't recollect such a discussion. |
This came up in a slack thread and I did mention that the l2dist was available in statsbase but there seemed to be some interest in having the l2dist in LinearAlgebra |
fn = abs | ||
elseif p == 2 | ||
fn = abs2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be norm
and norm_sqr
, respectively.
This PR adds functions to compute$L_p$ distance between two vectors. This is faster than just calling
norm(x .- y)
Some notes (I am going to write up the benchmarks and post the details on my website soon):
p
and realp
. Integer versions (e.g. supplyingp = 3
) are an order of magnitude faster than the corresponding real version (egp = 3.0
)p = 0, 1, 2
(provided as integers) and forp = Inf
.sum(x .- y)
but that allocates memory, and the allocation can be quite large for large vectors, if there is any non-allocating version that does not involve looping please let me know I can try it! (mapreduce
also allocates and is slow)TODOs:
(x - y < tol)
forStatsBase
implementation which is the source for the loop sum versionAny feedback is welcome!