Data.Rentgen is a Data Motion Lineage service, compatible with OpenLineage specification.
Note: service is under active development, and is not ready to use yet.
- Collect lineage events produced by OpenLineage clients & integrations.
- Store operation-grained events for better detalization (instead of job grained Marquez).
- Provide API for fetching job/run ↔ dataset lineage, not dataset ↔ dataset lineage (like Datahub and OpenMetadata).
- Support consuming large amounts of lineage events, use Apache Kafka as event buffer.
- Store data in tables partitioned by event timestamp, to speed up lineage graph resolution.
- Lineage graph is build with user-specified time boundaries (unlike Marquez where lineage is build only for last job run).
- Lineage graph can be build with different granularity. e.g. merge all individual Spark operations into Spark applicationId or Spark applicationName.
- Column-level lineage support.
- Authentication support.
- This is not a Data Catalog, DataRentgen doesn't track dataset schema change, owner and so on. Use Datahub or OpenMetadata instead.
- Static Data Lineage like view → table is not supported.
- For now, only Apache Spark and Apache Airflow are supported as lineage event sources. OpenLineage also supports Apache Flink, DBT, Trino and others. DataRentgen support may be added later.
- Unlike Marquez, DataRentgen parses only limited set of facets send by OpenLineage, and doesn't store custom facets. This can be changed in future.