Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which algorithm? Or incorrect results? #22

Closed
orbisvicis opened this issue Dec 18, 2021 · 2 comments
Closed

Which algorithm? Or incorrect results? #22

orbisvicis opened this issue Dec 18, 2021 · 2 comments

Comments

@orbisvicis
Copy link

orbisvicis commented Dec 18, 2021

Which algorithm is this? The heuristics Jenks, the total minimum distance Jenks, or the optimized and total minimum distance Fischer-Jenks? Yours deviates from my pure-python Jenks implementation and I suspect off-by-one errors:

import statistics

def distance(l, m):
    if not l:
        raise ValueError
    return sum((i-m)**2 for i in l)

ls = [0, 1, -2]
b = 2
jpc = break_list(ls, jenkspy.jenks_breaks(ls, b))
jmc = jenks(ls, b)[1]
jpd = [distance(c, statistics.mean(c)) for c in jpc]
jmd = [distance(c, statistics.mean(c)) for c in jmc]
jps = sum(jpd)
jms = sum(jmd)

print("jenkspy:")
print(f"\tsum distance: {jps}")
print(f"\tdistances:    {jpd}")
print(f"\tclusters:     {[sorted(c) for c in jpc]}")
print()
print("pure-python:")
print(f"\tsum distance: {jms}")
print(f"\tdistances:    {jmd}")
print(f"\tclusters:     {[sorted(c) for c in jmc]}")
jenkspy:
        sum distance: 2
        distances:    [2, 0]
        clusters:     [[-2, 0], [1]]

pure-python:
        sum distance: 0.5
        distances:    [0, 0.5]
        clusters:     [[-2], [0, 1]]

Even intuitively the jenkspy result is incorrect. My implementation and unit tests are here, see jenks and test_jenks_breaks.

Resources:

[1] Is Jenks' Natural Breaks deterministic?
[2] Fisher's Natural Breaks Classification complexity proof
[3] Re: Jenk's Optimization for Natural Breaks Classification

@mthh
Copy link
Owner

mthh commented Aug 16, 2022

Thanks a lot for the report, the links and the test cases.

I'm working on it and already fixed some errors in my implementation in 30abfe4 and 886c567 (as such my jenks_breaks function now returns the correct bounds for your example).

@mthh
Copy link
Owner

mthh commented Aug 18, 2022

I updated the README, the docstrings and the package metadata to be more precise about the algorithm in use. The returned results should now be correct.

I am preparing a notebook with comparaison of various implementations, i will post the link here if you are interested.

@mthh mthh closed this as completed Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants