Welcome to the very first release of Qbeast Open Source Format! 😄

Qbeast-Spark is the reference implementation of Qbeast Format. Currently built on top of Delta Lake, qbeast-spark adds an extra layer for indexing multiple columns of your dataset and extracting valuable information through a sampling pushdown operator.

This is the first alpha release, presenting all the main functionalities:

Documentation for users and developers
Writing to your favorite data storage through Spark.
Extension of Delta Commit Log for adding custom metadata to the file log information.
Reading the dataset and performing sampling pushdown on qbeast format

API

Write your data on Qbeast Format:

df.write.mode("overwrite").format("qbeast").option("columnsToIndex", "col_a,col_b").save(tmp_dir)

Load the newly indexed dataset.

val qbeast_df = spark.read.format("qbeast").load(tmp_dir)

Notice how the sampler is converted into filters and pushed down to the source:

qbeast_df.sample(0.1).explain(true)

Protocol

What you expect to find in the Delta Commit Log for this first version, is the following AddFile information:

{
  "add" : {
    "..." : {},
    "tags" : {
      "cube" : "A",
      "indexedColumns" : "ss_sales_price,ss_ticket_number",
      "maxWeight" : "462168771",
      "minWeight" : "-2147483648",
      "rowCount" : "508765",
      "space" : "{\"timestamp\":1631692406506,\"transformations\":[{\"min\":-99.76,\"max\":299.28000000000003,\"scale\":0.0025060144346431435},{\"min\":-119998.5,\"max\":359999.5,\"scale\":2.083342013925058E-6}]}",
      "state" : "FLOODED"
    }
  }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API

Protocol

Releases: Qbeast-io/qbeast-spark

v0.1.0

API

Protocol