-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deeply nested structs cause long compilation times #503
Comments
second this. We see extremely high latency when doing CI: https://github.com/JuliaHEP/UnROOT.jl/actions/runs/8317241942/job/22757761446#step:6:548 |
There are some epic types getting compiled here. For example, here is the show method:
That line is 234870 characters. |
Perhaps, composite arrow arrays should not carry the types of the underlying data arrays in their own type signature? |
they kind of have to -- the type tree has to exist somewhere, in Julia, it's natural to let the type tree exists using Julia's typing functionality. But we could have precompiled more in Arrow (for common compositions of primitive types and containers) |
You don't have to create an exact Julia type for composite fields. After all, you do not create a Julia type for the table as a whole. |
right, alternative to using Julia's typing system, is to use values, conceptually it's like instead of NamedTuple, just use In that case people may start to complain why Arrow is so slow or type unstable, idk, we need to try to see the tradeoffs |
That's just not true for a column-oriented data format like Arrow. The whole point of column layout is to make processing fast without knowing the exact data layout during compilation. Otherwise, statically compiled languages like C or Rust wouldn't be able to process Arrow data fast. |
It's similar to why naive In C or Rust, you need to know the column type at compile time, you CAN'T just do:
but you can do the moral equivalent of this in Julia. Fundamentally, the type information (schema) of the Arrow table lives as "values" (bytes) on-disk, you have to somehow process that info and at run-time generate a type tree -- using Julia's typing system or using Dict is implementation detail. For Rust and C, it's still doing this type of thing, but in more staged-fashion 1 and probably more efficient, wheres in Arrow.jl I think it really leans heavily on Julia's typing system and compose everything using Julia type composition -- which may not have been smart from the result. Footnotes
|
@Moelf, you made a point that IO with Arrow requires materializing exact field types, otherwise it would be slow. This is not so. Rust is capable of reading and writing Arrow data of arbitrary form, and can do it fast despite being a statically compiled language. People write SQL engines over Arrow in Rust, and they certainly don't know the type of each table and each column in advance. They can do it because in a column-oriented layout any data is ultimately stored as a collection of arrays of primitive types. By reducing any operation on composite data to operations on the underlying primitive arrays, they can process data effectively even if they don't know the exact layout of the composite fields during compilation. DataFrames API does it exactly right by making the step of materializing the row type explicit. The user can request it and pay the compilation cost if they want to process a DataFrame as a vector of tuples, but that's their choice. In fact, Arrow.jl works exactly the same, but only with top-level Table objects, and not with fields of composite types. Which is a problem, as my bug report demonstrates. What I would propose... support two variants of each composite array type: transparent and opaque. The transparent is the current one while the opaque would hide the type of its content. The current definitions almost work, if we represent, say, an opaque Struct array as |
Arrow tables with deeply nested lists and structs can make Arrow.jl spend too much time in compilation.
I uploaded an empty Arrow table that can trigger this behavior: observation.empty.arrow.gz
A simple program that reads this table and writes it again can take several minutes:
Using Julia 1.10.2 and Arrow 2.7.1.
The text was updated successfully, but these errors were encountered: