Skip to content

MobileTeleSystems/data-rentgen

Repository files navigation

Data.Rentgen logo

Repo Status PyPI PyPI License PyPI Python Version Docker image Documentation Build Status Coverage pre-commit.ci

What is Data.Rentgen?

Data.Rentgen is a Data Motion Lineage service, compatible with OpenLineage specification.

Note: service is under active development, and is not ready to use yet.

Goals

  • Collect lineage events produced by OpenLineage clients & integrations.
  • Store operation-grained events for better detalization (instead of job grained Marquez).
  • Provide API for fetching job/run ↔ dataset lineage, not dataset ↔ dataset lineage (like Datahub and OpenMetadata).

Features

  • Support consuming large amounts of lineage events, use Apache Kafka as event buffer.
  • Store data in tables partitioned by event timestamp, to speed up lineage graph resolution.
  • Lineage graph is build with user-specified time boundaries (unlike Marquez where lineage is build only for last job run).
  • Lineage graph can be build with different granularity. e.g. merge all individual Spark operations into Spark applicationId or Spark applicationName.
  • Column-level lineage support.
  • Authentication support.

Non-goals

  • This is not a Data Catalog, DataRentgen doesn't track dataset schema change, owner and so on. Use Datahub or OpenMetadata instead.
  • Static Data Lineage like view → table is not supported.

Limitations

  • For now, only Apache Spark and Apache Airflow are supported as lineage event sources. OpenLineage also supports Apache Flink, DBT, Trino and others. DataRentgen support may be added later.
  • Unlike Marquez, DataRentgen parses only limited set of facets send by OpenLineage, and doesn't store custom facets. This can be changed in future.

Documentation

See https://data-rentgen.readthedocs.io/