-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deserialization as Vector{SubArray} breaks push!
on DataFrame
#506
Comments
The workaround is to ask DataFrames to copy the columns: DataFrame(Arrow.Table("/tmp/test.arrow")); copycols=true) The reason for the current behavior is:
(not saying it is ideal, just how/why we got here) |
From perspective of Arrow, a Now, if it actually used that, the Btw, if you're interested in a fully systematic way of dealing with Arrow-like schema, https://github.com/JuliaHEP/AwkwardArray.jl is something we're prototyping. |
I don't think that's really accurate, the issue isn't the layout-in-memory, it's that |
When there's compression involved it won't be purely Mmaped. In general I agree, I'm saying if the resultant table uses that it would have worked. But likely out of the gate it's immutable however we implement it |
Right, I'm not saying it's always mmap'd, that was an example, but I'm saying |
Thanks for the quick comments!
Hmm, I don't see any effect of that here: julia> typeof(df.foo)
Vector{Vector{Int64}} (alias for Array{Array{Int64, 1}, 1})
julia> Arrow.write("/tmp/test.arrow", df);
julia> df2 = DataFrame(Arrow.Table("/tmp/test.arrow"); copycols=true);
julia> typeof(df2.foo)
Vector{SubArray{Int64, 1, Primitive{Int64, Vector{Int64}}, Tuple{UnitRange{Int64}}, true}} (alias for Array{SubArray{Int64, 1, Arrow.Primitive{Int64, Array{Int64, 1}}, Tuple{UnitRange{Int64}}, true}, 1}) The snippet you posted is a little ambiguous, but additionally calling julia> df2 = DataFrame(DataFrame(Arrow.Table("/tmp/test.arrow")); copycols=true);
julia> typeof(df2.foo)
Vector{SubArray{Int64, 1, Primitive{Int64, Vector{Int64}}, Tuple{UnitRange{Int64}}, true}} (alias for Array{SubArray{Int64, 1, Arrow.Primitive{Int64, Array{Int64, 1}}, Tuple{UnitRange{Int64}}, true}, 1})
julia> df2 = copy(DataFrame(Arrow.Table("/tmp/test.arrow")); copycols=true);
julia> typeof(df2.foo)
Vector{SubArray{Int64, 1, Primitive{Int64, Vector{Int64}}, Tuple{UnitRange{Int64}}, true}} (alias for Array{SubArray{Int64, 1, Arrow.Primitive{Int64, Array{Int64, 1}}, Tuple{UnitRange{Int64}}, true}, 1}) |
oh I misunderstood, it's inside a nested vector. I guess copying those would do it? df = DataFrame(Arrow.Table("/tmp/test.arrow"); copycols=true);
transform!(df, :foo => ByRow(copy) => :foo) |
I'm using Arrow v2.7.2 with DataFrames v1.6.1 on Julia 1.10, and am running into an issue that seems to stem from Arrow.jl deserializing my
Vector{Vector{T}}
columns asVector{SubArray{...}}
:This breaks certain
push!
es on the dataframe, which I haven't been able to reproduce in isolation, but which looks as follows:It's possible I'm doing something wrong; first time Arrow.jl user here.
The text was updated successfully, but these errors were encountered: