Releases: Qbeast-io/qbeast-spark
Releases · Qbeast-io/qbeast-spark
v0.1.0
Welcome to the very first release of Qbeast Open Source Format! 😄
Qbeast-Spark is the reference implementation of Qbeast Format. Currently built on top of Delta Lake, qbeast-spark
adds an extra layer for indexing multiple columns of your dataset and extracting valuable information through a sampling pushdown operator.
This is the first alpha release, presenting all the main functionalities:
- Documentation for users and developers
- Writing to your favorite data storage through Spark.
- Extension of Delta Commit Log for adding custom metadata to the file log information.
- Reading the dataset and performing sampling pushdown on qbeast format
API
Write your data on Qbeast Format:
df.write.mode("overwrite").format("qbeast").option("columnsToIndex", "col_a,col_b").save(tmp_dir)
Load the newly indexed dataset.
val qbeast_df = spark.read.format("qbeast").load(tmp_dir)
Notice how the sampler is converted into filters and pushed down to the source:
qbeast_df.sample(0.1).explain(true)
Protocol
What you expect to find in the Delta Commit Log for this first version, is the following AddFile
information:
{
"add" : {
"..." : {},
"tags" : {
"cube" : "A",
"indexedColumns" : "ss_sales_price,ss_ticket_number",
"maxWeight" : "462168771",
"minWeight" : "-2147483648",
"rowCount" : "508765",
"space" : "{\"timestamp\":1631692406506,\"transformations\":[{\"min\":-99.76,\"max\":299.28000000000003,\"scale\":0.0025060144346431435},{\"min\":-119998.5,\"max\":359999.5,\"scale\":2.083342013925058E-6}]}",
"state" : "FLOODED"
}
}
}