-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add groupby to the standard lib #32331
Comments
See https://github.com/JuliaData/SplitApplyCombine.jl. I'm not sure it adds anything to have it in Base: better have more complete packages with a series of related functions. |
Yes, this is unlikely to be added to Julia itself. |
The natural languages and the computer languages too comply to power law, big head - long tail (20/80 or Pareto or Exponential etc) distribution. 80% of use cases covered by just 20% of features, probably even more extreme in case of software languages.
That's why there's no need in "more complete package". Because those "series of related functions" - are the long tail that you never need. And 36 stars on that package |
What would you propose the Base A major concern is that you want to use the name |
Do you mean for loop like that? Or there's some shorter version?
Yes, that, or some similar data structure like
How it actually looked like. I needed to group the list, and typed There are 2 things that I don't understand with that reasoning:
|
Reopening for discussion. The first point about Python doesn’t make sense to me: there’s absolutely no reason Python needs data frames to implement an operation that takes a collection and returns a dictionary of arrays (or lists as Python calls them). Searching “Python groupby” gives “pandas.DataFrame.groupby” and there is no built-in group by operation. So we are in exactly the same state of affairs as Python, which suggests that it’s not that unreasonable. That said, groupby is a pretty handy operation and slightly annoying to write out, so maybe it would be reasonable to have it.
Not at all. The core should implement what is necessary to have in the core and hard or awkward to have in packages. There must be a good reason to be in the core language. Any complex functionality that can be removed into a package should be. Being stuck in the core language is where things go to fossilize. However, the proposed groupby functionality is clear and simple enough that it might be acceptable to fossilize it.
That depends on if they implement compatible signatures and concepts that are truly the same. If the DataFrames.groupby operation ever wanted to have a method that took an arbitrary collection and produced DataFrames, then that would directly conflict with the proposed groupby function. So multiple dispatch doesn’t necessarily solve this problem and one does have to think—generally pretty carefully—about what a generic function really means. |
FWIW groupby is in the python stdlib: https://docs.python.org/3/library/itertools.html#itertools.groupby. Though not sure the python stdlib is the muse |
I see your point, makes sense. I guess we have slightly different goals. I prefer unified solution ready out of the box. And you prefer something more like a constructor to assemble custom solution you want. |
I think it would be OK for DataFrames to import and override Cc: @andyferris |
IterTools.jl has it In general |
Julia's way of placing arguments follows the reading of the function's name: |
FWIW I've wanted this quite often. An argument for having it in Base is that this would emphasize having a common syntax and intuition. I think exactly the argument that it is challenging to come up with a general API is a good reason for trying to do so, aligning the API across the many (IterTools, Query, DataFrames, SplitApplyCombine) that have it. |
Yes, I think having a single obvious way of grouping data would be better than having a smattering of implementations with slightly different semantics. But what should that shared semantic be? understand the semantics in e.g. the OP and I did think there may be a path to combine the semantics of But ignoring completely tables for a minute, I think obvious and common operations on/between arrays and dictionaries totally belong in |
The groupby operation in Query.jl returns an iterator of |
It would be great to find a common interface, but I think it should first live in a package, and only when everybody is relatively OK with it should it be moved to Base (if at all). |
groupby in others languages (rust, python, ruby, etc.) is one of the less harmonized definition among high-order functions. From datastore to datastore, there are some mismatches too. Not so simple. Edited: after a bit of investigation as suggested by oxinabox, a consensus may have emerged in favor of sql. good god. |
Itertools groupby only group consecutive values eg. is rather useless. |
I wouldn't say it is useless
idk about ruby's. As you said, this is one of the least harmonized defintions. |
I vote +1 and -1 to your post :) First, let's forget itertools, not a stdlib IMHO. Consecutive grouping caps
Total grouping caps According to this sample, a consensus seems to emerge in favor of grouping caps. The problem is however that group by apply materialization of data even solliciting intermediary collections. There will always be someone that need another version of array. |
The main issue at this point is harmonizing with the DataFrames definition in such a way that we don't block DataFrames from using the same name. In other words, we need a generic meaning for groupby that: (a) can be implemented in base, operating on generic iterables, returning a dict of vectors; and (b) agrees with the DataFrames definition of groupby, so that it's legitimate for DataFrames to extend base's groupby function—i.e. they fundamentally do the same thing. |
It is not clear to me that the version that operates on iterables should return a dict. I think the Query/LINQ approach that returns a |
That's good feedback. In that case this should probably not be in Base. |
Yes, DataFrames and Query both return an iterable of groups, which has the advantage that you can aggregate it without necessarily allocating a copy of all grouped rows. For simple reductions like summing, this lazy operation can improve performance a lot as you can even avoid computing the permutation that sorts entries by groups, and instead compute the group sums in one pass over the data. I wonder to what extent that approach makes sense for single vectors, as the cost of allocating copies is lower compared to a multi-column table -- and the vector of group indices takes as much space as the original data. |
I commented it in JuliaData/DataFrames.jl#1256 (comment) but I find quite useful sometimes that Also, I think using non-lazy group vectors is necessary for unindexable iterators. In particular, the input iterator may be stateful (e.g.,
I suppose this may not be the case for vectors of big named tuples or structs? Due to these points, I think |
FWIW I think it would be great to have a Here's one implementation I've been playing around with. It's heavily inspired by @andyferris's SplitApplyCombine, but returns a It's also implemented in terms of using IterTools: IterTools
"""
groupby(by=identity, mapf=identity, groupf=Pair, x)
Group `x` by `by.(x)`, applying `mapf` to each element within the group, and then
calling `groupf(key, values)` on each group.
Example:
julia> data = [
(foo=1, bar=2, baz=3),
(foo=2, bar=1, baz=2),
(foo=1, bar=10, baz=10)]
julia> gs = groupby(r->r.foo, r->r.baz, data)
2-element Array{Pair{Int64,Array{Int64,1}},1}:
1 => [3, 10]
2 => [2]
"""
@inline groupby(x) = groupby(identity, x)
@inline groupby(by, x) = groupby(by, identity, x)
@inline groupby(by, mapf, x) = groupby(by, mapf, Pair, x)
function groupby(by, mapf, groupf, x)
groups = IterTools.groupby(by, sort(x; by=by))
map(g->groupf(by(g[1]), map(mapf, g)), groups)
end |
@ssfrr I have relied on I see good reason to add To put something into Base means that Base has been .. in that specific manner .. lacking or wrongly unsupportive of _ and is only best served by the inclusion of _. Is |
Yeah, you're right, a stdlib is better, I didn't think that through very carefully. In my mind it's closely related to |
By the way, if we were to add
If not returning a dictionary, I suggest returning a lazy iterator and make the lifetime and ownership of each element clear in the API to leave opportunity of optimizations (i.e., we can return a view or reuse the buffer/output vector in the const partitionby = IterTools.groupby
groupby(by, x) = (by(g[1]) => g for g in partitionby(by, sort(x, by = by))) Even though this is one-linear, it can be optimized more (e.g., specialize Alternatively, maybe it's better to use struct Groups
by
partitions
end
iterate(gs::Groups, ...) = ... # forward to partitions
groupby(by, x) = Groups(by, partitionby(by, sort(x, by = by)))
pairs(gs::Groups) = (by(g[1]) => g for g in gs.partitions)
I think |
I think it would be more reasonable to start shipping an implementation in a package. Then if everybody agrees it's a good API, it could be moved to Base or the stdlib. |
Given that this is the topic of the entire thread, with many good arguments above, this seems quite thin. Could you phrase your point so it adresses the arguments above? |
It's just that there's already one implementation in SplitApplyCombine, and that other implementations are being proposed here, so it's already not clear which API is best nor whether it will suit all needs. And even @tkf's nice proposal above seems uncertain about the exact type to return. Anyway, if people want to use that function soon, it would better live in a package, otherwise you won't be able to use it until Julia 1.5 at best. |
I agree that clojure's It seems that a non-sorted (or otherwise indexed) collection can't be grouped with no memory overhead so should be dealt with outside of And in case this wasn't clear - everyone is welcome to treat As for the discussion above about whether returning a dictionary or an array or something simpler is best, or something that iterates groups or key-group pairs - that seems very interesting to me! I always felt we should double-down on
Yes, I strongly agree, it seems essential that whatever we do that there is a way of doing reductions over the groups without sorting or excess allocations, etc. |
The problem there is that could be 1/ very enviable first 2/ then very badly implemented in the long run. One example : the (10 differents) filter semantics of panda vs the (one and universal) All this stuff is a matter of relational calculus / relational algebra, not data structure. Particulary Missing to rely on those strong foundational work, and too much time (eg. years) may be and will be spent examing cosmetic and/or dysfunctional options. |
I have been looking for this for the past hour. I went through the same steps as @al6x:
Much has been said already but (unless I've missed it) this one's new on the table: We already have |
|
I didn't imply that it wasn't; and in fact (or at least my opinion), longer the |
Saying that it's odd that it doesn't exist in Base doesn't really get us anywhere without addressing any of the questions about what its definition should be further up in the issue, which still aren't addressed. What should |
Sorry, I thought it would help you out on deciding where to put it (=Base, contrary to your comment from ~1.5 years ago) once the rest is settled. I also thought that it could give this issue a kick, present a motivation to actually proceed with it. As for your question, I don't have any _should_s, but just personal _opinion_s: It could mirror the julia> unique(x -> mod(x, 0:2), [3, 10, 9, 4, 6, 3, 2, 9, 8, 5])
3-element Vector{Int64}:
3
10
2
julia> groupby(x -> mod(x, 0:2), [3, 10, 9, 4, 6, 3, 2, 9, 8, 5])
3-element Vector{Vector{Int64}}:
[3, 9, 6, 3, 9]
[10, 4]
[2, 5, 8] As a consequence, |
That seems like a sane behavior for groupby on a vector. Then the question becomes: does it match what |
As far as I can tell, there isn't a julia> using DataFrames
julia> df = DataFrame((;a=mod(b, 0:2), b) for b=[3, 10, 9, 4, 6, 3, 2, 9, 8, 5])
10×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 0 3
2 │ 1 10
3 │ 0 9
4 │ 1 4
5 │ 0 6
6 │ 0 3
7 │ 2 2
8 │ 0 9
9 │ 2 8
10 │ 2 5
julia> map(df -> df[:, :b], collect(groupby(df, :a)))
3-element Vector{Vector{Int64}}:
[3, 9, 6, 3, 9]
[10, 4]
[2, 8, 5] or perhaps julia> map(df -> df.b, collect(groupby(df, :a)))
3-element Vector{SubArray{Int64, 1, Vector{Int64}, Tuple{Vector{Int64}}, false}}:
[3, 9, 6, 3, 9]
[10, 4]
[2, 8, 5] I think the definition in #32331 (comment) is what I would like, but I do not see a good way to reconcile it with |
Grouping is something that makes sense - and is extremely useful - in the absence of dataframes. I think it should be in To me it's important that you can look up the groups by key, like in some kind of dictionary or dictionary-like container. Here's a quick-and-dirty implementation for the earlier example: julia> function groupby(f, itr)
out = Dict{Core.Compiler.return_type(f, Tuple{eltype(itr)}), Vector{eltype(itr)}}()
for x in itr
key = f(x)
group = get!(Vector{eltype(itr)}, out, key)
push!(group, x)
end
return out
end
groupby (generic function with 1 method)
julia> groupby(x -> mod(x, 0:2), [3, 10, 9, 4, 6, 3, 2, 9, 8, 5])
Dict{Int64, Vector{Int64}} with 3 entries:
0 => [3, 9, 6, 3, 9]
2 => [2, 8, 5]
1 => [10, 4] (To be clear I am specifically proposing we add a cleaned-up version of the above to I'm not sure that the interface should require that you return an actual |
Part of the problem with @ThoAppelsin's version is efficiency. If you happen to want to look up groups by keys, you need to call There's lots of other interesting efficiency-related stuff in this space (fast group reductions, speedups for when the input is already sorted by the grouping key, etc). |
In DataFrames, keys are DataFrames doesn't currently allow specifying a transformation to apply to rows before grouping, but that should be a relatively natural extension using the Regarding the implementation, I'm not sure returning a |
@andyferris I julia> unique(x -> mod(x, 0:2), [3, 10, 9, 4, 6, 3, 2, 9, 8, 5])
Dict{Int64, Int64} with 3 entries:
0 => 3
2 => 2
1 => 10 I personally am fine with it, though. |
@ThoAppelsin |
There's nothing that says that [ first(vals) for (key, vals) in groupby(f, itr) ] == unique(f, itr) The only argument that needs to be made here is that one typically wants the |
A request from DataFrames.jl is. If this method is added to Base with
(this is probably easy to ensure, but I wanted to be explicit) To make it what @nalimilan said complete: we return
|
This is true - though we have the same problem with composing I'm not aware of a lazy |
Serious work will require to begin by specifying first the expectation about sort and uniquess of your data. Better, then to define index on them. That may be too much for a first implementation in Base. +1 for an eager family and a lazy one in Iterators. |
Agreed. Also column type and element type should be taken into account. However this can be done by the packages. E.g. PooledArrays.jl could opt-in to optimally handle grouping by it. Practice shows that sortedness has a relatively greater impact on performance than uniqueness. Fortunately in typical cases checking |
I'm not completely sure that what I'm describing qualifies as "lazy" in the
What DataFrames does is to compute eagerly only the index of the group to which each observation belongs. Then depending on what you do, it populates more fields lazily when they are accessed (that includes a |
Indeed, in order to use that API you have to sort by unique key first, which turns an O(n) hash based algorithm into an O(n log n) sorting based algorithm. Oops. |
And in general keys do not have to be sortable. |
Yeah I was going to bring up this point, too -
Haha, I was familiar with these and Clojure's julia> Iterators.partition
partition (generic function with 1 method)
help?> Iterators.partition
partition(collection, n)
Iterate over a collection n elements at a time.
Examples
≡≡≡≡≡≡≡≡≡≡
julia> collect(Iterators.partition([1,2,3,4,5], 2))
3-element Vector{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}:
[1, 2]
[3, 4]
[5] So, should we also consider a julia> collect.(Iterators.partition(iseven, [1,3,5,2,4]))
3-element Vector{Vector{Int64}}:
[1, 3, 5]
[2, 4] |
As a newcomer to Julia, I was quite surprised to not see this function directly available. It is as useful as |
Was surprised it's missing, really useful function, something like
The text was updated successfully, but these errors were encountered: