Skip to content

Latest commit

 

History

History
460 lines (369 loc) · 41.7 KB

CHANGELOG.md

File metadata and controls

460 lines (369 loc) · 41.7 KB

Changelog

Added

Fixed

  • Spark: databricks improvements to send better events #1330 @pawel-big-lebowski Filter unwanted events, provide meaningful job name.
  • Python: validate eventTime field in python client #1355 @pawel-big-lebowski
    Validate eventTime of a RunEvent within client library.

0.17.0 - 2022-11-16

Added

  • Spark: support latest Spark 3.3.1 #1183 @pawel-big-lebowski
    Adds support for the latest Spark 3.3.1 version.
  • Spark: add Kinesis Transport and support config Kinesis in Spark integration #1200 @yogayang
    Adds support for sending to Kinesis from the Spark integration.
  • Spark: Disable specified facets #1271 @pawel-big-lebowski
    Adds the ability to disable specified facets from generated OpenLineage events.
  • Python: add facets implementation to Python client #1233 @pawel-big-lebowski
    Adds missing facets to the Python client.
  • SQL: add Rust parser interface #1172 @StarostaGit @mobuchowski
    Implements a Java interface in the Rust SQL parser, including a build script, native library loading mechanism, CI support and build fixes.
  • Proxy: add helm chart for the proxy backed #1068 @wslulciuc
    Adds a helm chart for deploying the proxy backend on Kubernetes.
  • Spec: include possible facets usage in spec #1249 @pawel-big-lebowski
    Extends the facets definition with a list of available facets.
  • Website: publish YML version of spec to website #1300 @rossturk
    Adds configuration necessary to make the OpenLineage website auto-generate openAPI docs when the spec is published there.
  • Docs: update language on nominating new committers #1270 @rossturk
    Updates the governance language to reflect the new policy on nominating committers.

Changed

  • Website: publish spec into new website repo location #1295 @rossturk
    Creates a new deploy key, adds it to CircleCI & GitHub, and makes the necessary changes to the release.sh script.
  • Airflow: change how pip installs packages in tox environments #1302 @JDarDagran
    Use deprecated resolver and constraints files provided by Airflow to avoid potential issues caused by pip's new resolver.

Fixed

  • Airflow: fix README for running integration test #1238 @sekikn
    Updates the README for consistency with supported Airflow versions.
  • Airflow: add task_instance argument to get_openlineage_facets_on_complete #1269 @JDarDagran
    Adds the task_instance argument to DefaultExtractor.
  • Java client: fix up all artifactory paths #1290 @harels
    Not all artifactory paths were changed in the build CI script in a previous PR.
  • Python client: fix Mypy errors and adjust to PEP 484 #1264 @JDarDagran
    Adds a --no-namespace-packages argument to the Mypy command and adjusts code to PEP 484.
  • Website: release all specs since last_spec_commit_id, not just HEAD~1 #1298 @rossturk
    The script now ships all specs that have changed since .last_spec_commit_id.

Removed

  • Deprecate HttpTransport.Builder in favor of HttpConfig #1287 @collado-mike
    Deprecates the Builder in favor of HttpConfig only and replaces the existing Builder implementation by delegating to the HttpConfig.

0.16.1 - 2022-11-3

Added

  • Airflow: add dag_run information to Airflow version run facet #1133 @fm100
    Adds the Airflow DAG run ID to the taskInfo facet, making this additional information available to the integration.
  • Airflow: add LoggingMixin to extractors #1149 @JDarDagran
    Adds a LoggingMixin class to the custom extractor to make the output consistent with general Airflow and OpenLineage logging settings.
  • Airflow: add default extractor #1162 @mobuchowski
    Adds a DefaultExtractor to support the default implementation of OpenLineage for external operators without the need for custom extractors.
  • Airflow: add on_complete argument in DefaultExtractor #1188 @JDarDagran
    Adds support for running another method on extract_on_complete.
  • SQL: reorganize the library into multiple packages #1167 @StarostaGit @mobuchowski
    Splits the SQL library into a Rust implementation and foreign language bindings, easing the process of adding language interfaces. Also contains CI fix.

Changed

  • Airflow: move get_connection_uri as extractor's classmethod #1169 @JDarDagran
    The get_connection_uri method allowed for too many params, resulting in unnecessarily long URIs. This changes the logic to whitelisting per extractor.
  • Airflow: change get_openlineage_facets_on_start/complete behavior #1201 @JDarDagran
    Splits up the method for greater legibility and easier maintenance.

Fixed

  • Airflow: always send SQL in SqlJobFacet as a string #1143 @mobuchowski
    Changes the data type of query from array to string to an fix error in the RedshiftSQLOperator.
  • Airflow: include __extra__ case when filtering URI query params #1144 @JDarDagran
    Includes the conn.EXTRA_KEY in the get_connection_uri method to avoid exposing secrets in URIs via the __extra__ key.
  • Airflow: enforce column casing in SQLCheckExtractors #1159 @denimalpaca
    Uses the parent extractor's _is_uppercase_names property to determine if the column should be upper cased in the SQLColumnCheckExtractor's _get_input_facets() method.
  • Spark: prevent exception when no schema provided #1180 @pawel-big-lebowski
    Prevents evalution of column lineage when the schemFacet is null.
  • Great Expectations: add V3 API compatibility #1194 @denimalpaca
    Fixes the Pandas datasource to make it V3 API-compatible.

Removed

  • Airflow: remove support for Airflow 1.10 #1128 @mobuchowski
    Removes the code structures and tests enabling support for Airflow 1.10.

0.15.1 - 2022-10-05

Added

  • Airflow: improve development experience #1101 @JDarDagran
    Adds an interactive development environment to the Airflow integration and improves integration testing.
  • Spark: add description for URL parameters in readme, change overwriteName to appName #1130 @tnazarew
    Adds more information about passing arguments with spark.openlineage.url and changes overwriteName to appName for clarity.
  • Documentation: update issue templates for proposal & add new integration template #1116 @rossturk
    Adds a YAML issue template for new integrations and fixes a bug in the proposal template.

Changed

  • Airflow: lazy load BigQuery client #1119 @mobuchowski
    Moves import of the BigQuery client from top level to local level to decrease DAG import time.

Fixed

  • Airflow: fix UUID generation conflict for Airflow DAGs with same name #1056 @collado-mike
    Adds a namespace to the UUID calculation to avoid conflicts caused by DAGs having the same name in different namespaces in Airflow deployments.
  • Spark/BigQuery: fix issue with spark-bigquery-connector >=0.25.0 #1111 @pawel-big-lebowski
    Makes the Spark integration compatible with the latest connector.
  • Spark: fix column lineage #1069 @pawel-big-lebowski
    Fixes a null pointer exception error and an error when openlineage.timeout is not provided.
  • Spark: set log level of Init OpenLineageContext to DEBUG #1064 @varuntestaz
    Prevents sensitive information from being logged unless debug mode is used.
  • Java client: update version of SnakeYAML #1090 @TheSpeedding
    Bumps the SnakeYAML library version to include a key bug fix.
  • dbt: remove requirement for OPENLINEAGE_URL to be set #1107 @mobuchowski
    Removes erroneous check for OPENLINEAGE_URL in the dbt integration.
  • Python client: remove potentially cyclic import #1126 @mobuchowski
    Hides imports to remove potentially cyclic import.
  • CI: build macos release package on medium resource class #1131 @mobuchowski
    Fixes failing build due to resource class being too large.

0.14.1 - 2022-09-07

Fixed

  • Fix Spark integration issues including error when no openlineage.timeout #1069 @pawel-big-lebowski
    OpenlineageSparkListener was failing when no openlineage.timeout was provided.

0.14.0 - 2022-09-06

Added

  • Support ABFSS and Hadoop Logical Relation in Column-level lineage #1008 @wjohnson
    Introduces an extractDatasetIdentifier that uses similar logic to InsertIntoHadoopFsRelationVisitor to pull out the path on the HDFS compliant file system; tested on ABFSS and DBFS (Databricks FileSystem) to prove that lineage could be extracted using non-SQL commands.
  • Add Kusto relation visitor #939 @hmoazam
    Implements a KustoRelationVisitor to support lineage for Azure Kusto's Spark connector.
  • Add ColumnLevelLineage facet doc #1020 @julienledem
    Adds documentation for the Column-level lineage facet.
  • Include symlinks dataset facet #935 @pawel-big-lebowski
    Includes the recently introduced SymlinkDatasetFacet in generated OpenLineage events.
  • Add support for dbt 1.3 beta's metadata changes #1051 @mobuchowski
    Makes projects that are composed of only SQL models work on 1.3 beta (dbt 1.3 renamed the compiled_sql field to compiled_code to support Python models). Does not provide support for dbt's Python models.
  • Support Flink 1.15 #1009 @mzareba382
    Adds support for Flink 1.15.
  • Add Redshift dialect to the SQL integration #1066 @mobuchowski
    Adds support for Redshift's SQL dialect in OpenLineage's SQL parser, including quirks such as the use of square brackets in JSON paths. (Note, this does not add support for all of Redshift's custom syntax.)

Changed

  • Make the timeout configurable in the Spark integration #1050 @tnazarew
    Makes timeout configurable by the user. (In some cases, the time needed to send events was longer than 5 seconds, which exceeded the timeout value.)

Fixed

  • Add a dialect parameter to Great Expectations SQL parser calls #1049 @collado-mike
    Specifies the dialect name from the SQL engine.
  • Fix Delta 2.1.0 with Spark 3.3.0 #1065 @pawel-big-lebowski
    Allows delta support for Spark 3.3 and fixes potential issues. (The Openlineage integration for Spark 3.3 was turned on without delta support, as delta did not support Spark 3.3 at that time.)

0.13.1 - 2022-08-25

Fixed

  • Rename all parentRun occurrences to parent in Airflow integration 1037 @fm100
    Changes the parentRun property name to parent in the Airflow integration to match the spec.
  • Do not change task instance during on_running event 1028 @JDarDagran
    Fixes an issue in the Airflow integration with the on_running hook, which was changing the TaskInstance object along with the task attribute.

0.13.0 - 2022-08-22

Added

  • Add BigQuery check support #960 @denimalpaca
    Adds logic and support for proper dynamic class inheritance for BigQuery-style operators. (BigQuery's extractor needed additional logic to support the forthcoming BigQueryColumnCheckOperator and BigQueryTableCheckOperator.)
  • Add RUNNING EventType in spec and Python client #972 @mzareba382
    Introduces a RUNNING event state in the OpenLineage spec to indicate a running task and adds a RUNNING event type in the Python API.
  • Use databases & schemas in SQL Extractors #974 @JDarDagran
    Allows the Airflow integration to differentiate between databases and schemas. (There was no notion of databases and schemas when querying and parsing results from information_schema tables.)
  • Implement Event forwarding feature via HTTP protocol #995 @howardyoo
    Adds HttpLineageStream to forward a given OpenLineage event to any HTTP endpoint.
  • Introduce SymlinksDatasetFacet to spec #936 @pawel-big-lebowski
    Creates a new facet, the SymlinksDatasetFacet, to support the storing of alternative dataset names.
  • Add Azure Cosmos Handler to Spark integration #983 @hmoazam
    Defines a new interface, the RelationHandler, to support Spark data sources that do not have TableCatalog, Identifier, or TableProperties set, as is the case with the Azure Cosmos DB Spark connector.
  • Support OL Datasets in manual lineage inputs/outputs #1015 @conorbev
    Allows Airflow users to create OpenLineage Dataset classes directly in DAGs with no conversion necessary. (Manual lineage definition required users to create an airflow.lineage.entities.Table, which was then converted to an OpenLineage Dataset.)
  • Create ownership facets #996 @julienledem
    Adds an ownership facet to both Dataset and Job in the OpenLineage spec to capture ownership of jobs and datasets.

Changed

  • Use RUNNING EventType in Flink integration for currently running jobs #985 @mzareba382
    Makes use of the new RUNNING event type in the Flink integration, changing events sent by Flink jobs from OTHER to this new type.
  • Convert task objects to JSON-encodable objects when creating custom Airflow version facets #1018 @fm100
    Implements a to_json_encodable function in the Airflow integration to make task objects JSON-encodable.

Fixed

  • Add support for custom SQL queries in v3 Great Expectations API #1025 @collado-mike
    Fixes support for custom SQL statements in the Great Expectations provider. (The Great Expectations custom SQL datasource was not applied to the support for the V3 checkpoints API.)

0.12.0 - 2022-08-01

Added

Changed

Fixed

0.11.0 - 2022-07-07

Added

Changed

  • When testing extractors in the Airflow integration, set the extractor length assertion dynamic #882 @denimalpaca
  • Render templates as start of integration tests for TaskListener in the Airflow integration #870 @mobuchowski

Fixed

0.10.0 - 2022-06-24

Added

Changed

Fixed

Added

Fixed

Added

Fixed

  • PostgresOperator fails to retrieve host and conn during extraction (#705) @sekikn
  • SQL parser accepts lists of sql statements (#734) @mobuchowski
  • Missing schema when writing to Delta tables in Databricks (#748) @collado-mike

Added

Fixed

  • GreatExpectations: Fixed bug when invoking GreatExpectations using v3 API (#683) @collado-mike

Added

Fixed

Added

Fixed

Fixed

  • Catch possible failures when emitting events and log them @mobuchowski

Fixed

  • dbt: jinja2 code using do extensions does not crash @mobuchowski

Added

  • Extract source code of PythonOperator code similar to SQL facet @mobuchowski
  • Add DatasetLifecycleStateDatasetFacet to spec @pawel-big-lebowski
  • Airflow: extract source code from BashOperator @mobuchowski
  • Add generic facet to collect environmental properties (EnvironmentFacet) @harishsune
  • OpenLineage sensor for OpenLineage-Dagster integration @dalinkim
  • Java-client: make generator generate enums as well @pawel-big-lebowski
  • Added UnknownOperatorAttributeRunFacet to Airflow integration to record operators that don't produce lineage @collado-mike

Fixed

  • Airflow: increase import timeout in tests, fix exit from integration @mobuchowski
  • Reduce logging level for import errors to info @rossturk
  • Remove AWS secret keys and extraneous Snowflake parameters from connection uri @collado-mike
  • Convert to LifecycleStateChangeDatasetFacet @pawel-big-lebowski

Added

  • Proxy backend example using Kafka @wslulciuc
  • Support Databricks Delta Catalog naming convention with DatabricksDeltaHandler @wjohnson
  • Add javadoc as part of build task @mobuchowski
  • Include TableStateChangeFacet in non V2 commands for Spark @mr-yusupov
  • Support for SqlDWRelation on Databricks' Azure Synapse/SQL DW Connector @wjohnson
  • Implement input visitors for v2 commands @pawel-big-lebowski
  • Enabled SparkListenerJobStart events to trigger open lineage events @collado-mike

Fixed

  • dbt: job namespaces for given dbt run match each other @mobuchowski
  • Fix Breaking SnowflakeOperator Changes from OSS Airflow @denimalpaca
  • Made corrections to account for DeltaDataSource handling @collado-mike

Added

Fixed

  • airflow: fix import failures when dependencies for bigquery, dbt, great_expectations extractors are missing @lukaszlaszko
  • Fixed openlineage-spark jar to correctly rename bundled dependencies @collado-mike

0.4.0 - 2021-12-13

Added

Fixed

  • dbt: column descriptions are properly filled from metadata.json @mobuchowski
  • dbt: allow parsing artifacts with version higher than officially supported @mobuchowski
  • dbt: dbt build command is supported @mobuchowski
  • dbt: fix crash when build command is used with seeds in dbt 1.0.0rc3 @mobuchowski
  • spark: increase logical plan visitor coverage @mobuchowski
  • spark: fix logical serialization recursion issue @OleksandrDvornik
  • Use URL#getFile to fix build on Windows @mobuchowski

0.3.1 - 2021-10-21

Fixed

0.3.0 - 2021-10-21

Added

Fixed

0.2.3 - 2021-10-07

Fixed

0.2.2 - 2021-09-08

Added

  • Implement OpenLineageValidationAction for Great Expectations @collado-mike
  • facet: add expectations assertions facet @mobuchowski

Fixed

  • airflow: pendulum formatting fix, add tests @mobuchowski
  • dbt: do not emit events if run_result file was not updated @mobuchowski

0.2.1 - 2021-08-27

Fixed

  • Default --project-dir argument to current directory in dbt-ol script @mobuchowski

0.2.0 - 2021-08-23

Added

  • Parse dbt command line arguments when invoking dbt-ol @mobuchowski. For example:

    $ dbt-ol run --project-dir path/to/dir
    
  • Set UnknownFacet for spark (captures metadata about unvisited nodes from spark plan not yet supported) @OleksandrDvornik

Changed

Fixed

  • Remove instance references to extractors from DAG and avoid copying log property for serializability @collado-mike

0.1.0 - 2021-08-12

OpenLineage is an Open Standard for lineage metadata collection designed to record metadata for a job in execution. The initial public release includes:

  • An inital specification. The the inital version 1-0-0 of the OpenLineage specification defines the core model and facets.
  • Integrations that collect lineage metadata as OpenLineage events:
  • Clients that send OpenLineage events to an HTTP backend. Both java and python are initially supported.