Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(databricks): fixes profile median #12856

Merged
merged 3 commits into from
Mar 13, 2025

Conversation

sgomezvillamor
Copy link
Contributor

Fixes a couple of exceptions:

offset as bigint instead of int

            "depop_production.datalake_silver.tracking_product_view_by_product_id_v2.event_timestamp <class 'sqlalchemy.exc.DatabaseError'>: (databricks.sql.exc.ServerOperationError) [INVALID_LIMIT_LIKE_EXPRESSION.DATA_TYPE] The limit like expression \"10005558678\" is invalid. The offset expression must be integer type, but got \"BIGINT\". SQLSTATE: 42K0E; line 4 pos 16\n[SQL: SELECT event_timestamp \nFROM depop_production.datalake_silver.tracking_product_view_by_product_id_v2 \nWHERE event_timestamp IS NOT NULL ORDER BY event_timestamp\n LIMIT %(param_1)s OFFSET %(param_2)s]\n[parameters: {'param_1': 2, 'param_2': 10005558678}]\n(Background on this error at: https://sqlalche.me/e/14/4xp6)",

order by may consume too much resources here

               'context': ["depop_production.lakehouse_product.user_segments.user_id <class 'sqlalchemy.exc.DatabaseError'>: "
                           '(databricks.sql.exc.ServerOperationError) Java heap space\n'
                           '[SQL: SELECT user_id \n'
                           'FROM depop_production.lakehouse_product.user_segments \n'
                           'WHERE user_id IS NOT NULL ORDER BY user_id\n'
                           ' LIMIT %(param_1)s OFFSET %(param_2)s]\n'
                           "[parameters: {'param_1': 2, 'param_2': 1065063579}]\n"
                           '(Background on this error at: https://sqlalche.me/e/14/4xp6)',

In both cases, the exception is raised from the fallback implementation we have for the median, which is computed by the SqlAlchemyDataset in Great Expectations as follows:

    def get_column_median(self, column):
        # AWS Athena and presto have an special function that can be used to retrieve the median
        if (
            self.sql_engine_dialect.name.lower() == GXSqlDialect.AWSATHENA
            or self.sql_engine_dialect.name.lower() == GXSqlDialect.TRINO
        ):
            element_values = self.engine.execute(
                f"SELECT approx_percentile({column},  0.5) FROM {self._table}"
            )
            return convert_to_json_serializable(element_values.fetchone()[0])
        else:

            nonnull_count = self.get_column_nonnull_count(column)
            element_values = self.engine.execute(
                sa.select([sa.column(column)])
                .order_by(sa.column(column))
                .where(sa.column(column) != None)
                .offset(max(nonnull_count // 2 - 1, 0))
                .limit(2)
                .select_from(self._table)
            )

The fix consists of using the median function provided by databricks: https://docs.databricks.com/aws/en/sql/language-manual/functions/median and so not using the GX fallback

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Mar 12, 2025
@sgomezvillamor sgomezvillamor marked this pull request as ready for review March 12, 2025 15:23
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Mar 12, 2025
Copy link

codecov bot commented Mar 12, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@datahub-cyborg datahub-cyborg bot added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Mar 12, 2025
@sgomezvillamor
Copy link
Contributor Author

I have tested this locally and this is the query that we were generating before

SELECT annual_revenue
FROM acryl_demo.salesforce.account
WHERE annual_revenue IS NOT NULL ORDER BY annual_revenue
LIMIT :param_1 OFFSET :param_2

and this is the one we are generating with the fix

SELECT median(annual_revenue) AS median_1
FROM acryl_demo.salesforce.account

@sgomezvillamor sgomezvillamor merged commit 30719ac into master Mar 13, 2025
76 checks passed
@sgomezvillamor sgomezvillamor deleted the feature/cus-3105/fix-databricks-median branch March 13, 2025 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants