stats function triage #2200

StefanKarpinski · 2013-02-06T15:41:31Z

Went through base/statistics.jl yesterday, trying to figure out what should stay in base and what should go into a hypothetical new Stats.jl package for basic statistics functionality that doesn't go into base. The reason for this is twofold: some stats stuff probably just doesn't belong in the base; putting it in a package allows us to continue to evolve the stats API with more flexibility while still stabilizing the core language. Here's what I came up with:

keep:
    mean
    median
    var
    std
    hist
    autocor
    cov (pearson; delete the alias)
    cor (pearson; delete the alias)
    quantile
    percentile
    quartile
    quintile
    decile
    iqr

rename:
    dist => distmatrix?
    histc => hist (unify)

move in to Stats.jl:
    mad
    skewness
    kurtosis
    tiedrank
    cov_spearman
    cor_spearman
    rle
    irle

Wanted to get this out there for people more knowledgable about basic status usage than me to comment on.

cc: @dmbates, @johnmyleswhite, @HarlanH.

The text was updated successfully, but these errors were encountered:

andreasnoack · 2013-02-06T16:10:16Z

I find the "keep" list reasonable except that I think the number of quantile functions should be limited to quantile only.

In addition to removing some functions, I think we should try to get consistent behaviour of mean, var and median. Right now we have:

julia> X=randn(4,2);

julia> mean(X)
0.4105754895893154

julia> mean(X, 1)
1x2 Float64 Array:
 0.563512  0.257639

julia> mean(X, 2)
4x1 Float64 Array:
 -0.159016
  0.99812 
  0.916519
 -0.113322

julia> var(X)
0.4116875679884982

julia> var(X, 1)
0.8087404291003555

julia> var(X, 2)
3.298853595753349

julia> median(X)
ERROR: no method median!(Array{Float64,2},)
 in median at statistics.jl:25

julia> median(X, 1)
ERROR: no method median(Array{Float64,2},Int64)

I think I prefer apply functions in the spirit of R to run the functions along axes but I am not sure.

StefanKarpinski · 2013-02-06T16:33:49Z

That was actually how I had the list, but Jeff thought it was weird to have quantile in base and then have the one-line definitions of percentile, quintile, quartile, etc. in stats.

StefanKarpinski · 2013-02-06T16:36:54Z

We do definitely need regularize these behaviors too.

ViralBShah · 2013-02-06T16:44:02Z

Definitely get rid of histc.

HarlanH · 2013-02-06T17:14:31Z

Mostly looks good to me. Does autocor belong in base, though?

I do think that the various quantile functions should remain in base, fwiw.

On Wed, Feb 6, 2013 at 11:44 AM, Viral B. Shah [email protected]:

Definitely get rid of histc.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2200#issuecomment-13191419.

StefanKarpinski · 2013-02-06T17:23:42Z

You're probably right that autocor should move out too. I should mention that moving things to a package can be reversed – we can always put it there for now and then move it back into base once the behavior and API is solidified.

milktrader · 2013-02-06T18:13:20Z

mad includes both mean absolute deviation and median absolute deviation. Any chance to create an alias somewhere for madd(x) = mad(x,1)?

milktrader · 2013-02-06T18:23:34Z

Uhh, disregard the last post. I just figured out I can do this assignment myself. Cool.

johnmyleswhite · 2013-02-07T16:48:27Z

My thoughts:

Yes, let's move autocor() into Stats.jl
Let's rename dist() => distances() rather than distmatrix.
Allow either row or column distances using distances(M, 1) and distances(M, 2)
Keep inverse_rle as irle. We don't call - i+ after all.

I personally agree with Andreas that we should only have quantile in Base, but we have iqr, so there's no harm in other specialized calls to quantile being there as well. It is important, though, that these functions all produce identical values: quantile(v, [0.25, 0.75]) must be iqr(v).

We are also badly missing several really important functions:

range ([min(x), max(x)])
ecdf (The inverse of quantile)

I believe we have the appropriate definition of range in DataFrames that can be copied over. ecdf is tricky because you need to define a step function type first and use some strategy for interpolation, but I find that I use the ecdf function more often than the quantile function.

StefanKarpinski · 2013-02-07T16:57:22Z

ecdf is definitely very important; @timholy's Grid package would be relevant, but seems a bit too big to require just to get ecdf. Note that adding functionality post-0.1 is ok, we just need to do the renaming and deleting now.

johnmyleswhite · 2013-02-07T17:07:05Z

I agree: ecdf isn't a 0.1 issue. I would say that range is personally, but I'm biased.

andreasnoack · 2013-02-08T13:37:22Z

I did an ecdf in #2238. I think it is most common not to interpolate in ecdf's and hence didn't do that in the implementation which is therefore very simple, (but at least fast, I think).

ghost assigned StefanKarpinski Feb 9, 2013

StefanKarpinski closed this as completed in 441b3cb Feb 10, 2013

andreasnoack mentioned this issue Feb 11, 2013

mean function on dimension #2265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stats function triage #2200

stats function triage #2200

StefanKarpinski commented Feb 6, 2013

andreasnoack commented Feb 6, 2013

StefanKarpinski commented Feb 6, 2013

StefanKarpinski commented Feb 6, 2013

ViralBShah commented Feb 6, 2013

HarlanH commented Feb 6, 2013

StefanKarpinski commented Feb 6, 2013

milktrader commented Feb 6, 2013

milktrader commented Feb 6, 2013

johnmyleswhite commented Feb 7, 2013

StefanKarpinski commented Feb 7, 2013

johnmyleswhite commented Feb 7, 2013

andreasnoack commented Feb 8, 2013

stats function triage #2200

stats function triage #2200

Comments

StefanKarpinski commented Feb 6, 2013

andreasnoack commented Feb 6, 2013

StefanKarpinski commented Feb 6, 2013

StefanKarpinski commented Feb 6, 2013

ViralBShah commented Feb 6, 2013

HarlanH commented Feb 6, 2013

StefanKarpinski commented Feb 6, 2013

milktrader commented Feb 6, 2013

milktrader commented Feb 6, 2013

johnmyleswhite commented Feb 7, 2013

StefanKarpinski commented Feb 7, 2013

johnmyleswhite commented Feb 7, 2013

andreasnoack commented Feb 8, 2013