Skip to content

Releases: Qbeast-io/qbeast-spark

v0.1.0

31 Mar 11:03
26f49ab
Compare
Choose a tag to compare
v0.1.0 Pre-release
Pre-release

Welcome to the very first release of Qbeast Open Source Format! 😄

Qbeast-Spark is the reference implementation of Qbeast Format. Currently built on top of Delta Lake, qbeast-spark adds an extra layer for indexing multiple columns of your dataset and extracting valuable information through a sampling pushdown operator.

This is the first alpha release, presenting all the main functionalities:

  • Documentation for users and developers
  • Writing to your favorite data storage through Spark.
  • Extension of Delta Commit Log for adding custom metadata to the file log information.
  • Reading the dataset and performing sampling pushdown on qbeast format

API

Write your data on Qbeast Format:

df.write.mode("overwrite").format("qbeast").option("columnsToIndex", "col_a,col_b").save(tmp_dir)

Load the newly indexed dataset.

val qbeast_df = spark.read.format("qbeast").load(tmp_dir)

Notice how the sampler is converted into filters and pushed down to the source:

qbeast_df.sample(0.1).explain(true)

Protocol

What you expect to find in the Delta Commit Log for this first version, is the following AddFile information:

{
  "add" : {
    "..." : {},
    "tags" : {
      "cube" : "A",
      "indexedColumns" : "ss_sales_price,ss_ticket_number",
      "maxWeight" : "462168771",
      "minWeight" : "-2147483648",
      "rowCount" : "508765",
      "space" : "{\"timestamp\":1631692406506,\"transformations\":[{\"min\":-99.76,\"max\":299.28000000000003,\"scale\":0.0025060144346431435},{\"min\":-119998.5,\"max\":359999.5,\"scale\":2.083342013925058E-6}]}",
      "state" : "FLOODED"
    }
  }
}