Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats function triage #2200

Closed
StefanKarpinski opened this issue Feb 6, 2013 · 12 comments
Closed

stats function triage #2200

StefanKarpinski opened this issue Feb 6, 2013 · 12 comments
Assignees
Labels
breaking This change will break code
Milestone

Comments

@StefanKarpinski
Copy link
Member

Went through base/statistics.jl yesterday, trying to figure out what should stay in base and what should go into a hypothetical new Stats.jl package for basic statistics functionality that doesn't go into base. The reason for this is twofold: some stats stuff probably just doesn't belong in the base; putting it in a package allows us to continue to evolve the stats API with more flexibility while still stabilizing the core language. Here's what I came up with:

keep:
    mean
    median
    var
    std
    hist
    autocor
    cov (pearson; delete the alias)
    cor (pearson; delete the alias)
    quantile
    percentile
    quartile
    quintile
    decile
    iqr

rename:
    dist => distmatrix?
    histc => hist (unify)

move in to Stats.jl:
    mad
    skewness
    kurtosis
    tiedrank
    cov_spearman
    cor_spearman
    rle
    irle

Wanted to get this out there for people more knowledgable about basic status usage than me to comment on.

cc: @dmbates, @johnmyleswhite, @HarlanH.

@andreasnoack
Copy link
Member

I find the "keep" list reasonable except that I think the number of quantile functions should be limited to quantile only.

In addition to removing some functions, I think we should try to get consistent behaviour of mean, var and median. Right now we have:

julia> X=randn(4,2);

julia> mean(X)
0.4105754895893154

julia> mean(X, 1)
1x2 Float64 Array:
 0.563512  0.257639

julia> mean(X, 2)
4x1 Float64 Array:
 -0.159016
  0.99812 
  0.916519
 -0.113322

julia> var(X)
0.4116875679884982

julia> var(X, 1)
0.8087404291003555

julia> var(X, 2)
3.298853595753349

julia> median(X)
ERROR: no method median!(Array{Float64,2},)
 in median at statistics.jl:25

julia> median(X, 1)
ERROR: no method median(Array{Float64,2},Int64)

I think I prefer apply functions in the spirit of R to run the functions along axes but I am not sure.

@StefanKarpinski
Copy link
Member Author

That was actually how I had the list, but Jeff thought it was weird to have quantile in base and then have the one-line definitions of percentile, quintile, quartile, etc. in stats.

@StefanKarpinski
Copy link
Member Author

We do definitely need regularize these behaviors too.

@ViralBShah
Copy link
Member

Definitely get rid of histc.

@HarlanH
Copy link
Contributor

HarlanH commented Feb 6, 2013

Mostly looks good to me. Does autocor belong in base, though?

I do think that the various quantile functions should remain in base, fwiw.

On Wed, Feb 6, 2013 at 11:44 AM, Viral B. Shah [email protected]:

Definitely get rid of histc.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2200#issuecomment-13191419.

@StefanKarpinski
Copy link
Member Author

You're probably right that autocor should move out too. I should mention that moving things to a package can be reversed – we can always put it there for now and then move it back into base once the behavior and API is solidified.

@milktrader
Copy link
Contributor

mad includes both mean absolute deviation and median absolute deviation. Any chance to create an alias somewhere for madd(x) = mad(x,1)?

@milktrader
Copy link
Contributor

Uhh, disregard the last post. I just figured out I can do this assignment myself. Cool.

@johnmyleswhite
Copy link
Member

My thoughts:

  • Yes, let's move autocor() into Stats.jl
  • Let's rename dist() => distances() rather than distmatrix.
  • Allow either row or column distances using distances(M, 1) and distances(M, 2)
  • Keep inverse_rle as irle. We don't call - i+ after all.

I personally agree with Andreas that we should only have quantile in Base, but we have iqr, so there's no harm in other specialized calls to quantile being there as well. It is important, though, that these functions all produce identical values: quantile(v, [0.25, 0.75]) must be iqr(v).

We are also badly missing several really important functions:

  • range ([min(x), max(x)])
  • ecdf (The inverse of quantile)

I believe we have the appropriate definition of range in DataFrames that can be copied over. ecdf is tricky because you need to define a step function type first and use some strategy for interpolation, but I find that I use the ecdf function more often than the quantile function.

@StefanKarpinski
Copy link
Member Author

ecdf is definitely very important; @timholy's Grid package would be relevant, but seems a bit too big to require just to get ecdf. Note that adding functionality post-0.1 is ok, we just need to do the renaming and deleting now.

@johnmyleswhite
Copy link
Member

I agree: ecdf isn't a 0.1 issue. I would say that range is personally, but I'm biased.

@andreasnoack
Copy link
Member

I did an ecdf in #2238. I think it is most common not to interpolate in ecdf's and hence didn't do that in the implementation which is therefore very simple, (but at least fast, I think).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This change will break code
Projects
None yet
Development

No branches or pull requests

6 participants