feat(ingest/powerbi): PowerBI source updates #12857

mminichino · 2025-03-12T15:52:24Z

Fixes dataset column lineage
Adds MySQL support
Adds incremental lineage
Adds ability to filter workspaces by name

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

codecov · 2025-03-12T15:58:52Z

❌ 2 Tests Failed:

Tests completed	Failed	Passed	Skipped
3774	2	3772	68

View the full list of 2 ❄️ flaky tests

tests.entity_versioning.test_versioning::test_link_unlink_three_versions_unlink_middle_and_latest

Flake rate in main: 27.94% (Passed 49 times, Failed 19 times)

Stack Traces | 9.04s run time

graph_client = DataHubGraph: configured to talk to http://localhost:8080 with token: eyJh**********y_oA

    @pytest.fixture(scope="function", autouse=True)
    def ingest_cleanup_data(graph_client: DataHubGraph):
        try:
            for urn in ENTITY_URN_OBJS:
                graph_client.emit_mcp(
                    MetadataChangeProposalWrapper(
                        entityUrn=urn.urn(),
                        aspect=DatasetKeyClass(
                            platform=urn.platform, name=urn.name, origin=urn.env
                        ),
                    )
                )
            for i in [2, 1, 0, 0, 1, 2]:
>               graph_client.unlink_asset_from_version_set(ENTITY_URNS[i])

tests/entity_versioning/test_versioning.py:31: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
...../ingestion/graph/entity_versioning.py:164: in unlink_asset_from_version_set
    response = self.execute_graphql(self.UNLINK_VERSION_MUTATION, variables)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = DataHubGraph: configured to talk to http://localhost:8080 with token: eyJh**********y_oA
query = '\n        mutation($input: UnlinkVersionInput!) {\n            unlinkAssetVersion(input: $input) {\n                urn\n            }\n        }\n    '
variables = {'input': {'unlinkedEntity': 'urn:li:dataset:(urn:li:dataPlatform:snowflake,versioning_0,PROD)', 'versionSet': 'urn:li:versionSet:(12345678910,dataset)'}}
operation_name = None, format_exception = True

    def execute_graphql(
        self,
        query: str,
        variables: Optional[Dict] = None,
        operation_name: Optional[str] = None,
        format_exception: bool = True,
    ) -> Dict:
        url = f"{self.config.server}/api/graphql"
    
        body: Dict = {
            "query": query,
        }
        if variables:
            body["variables"] = variables
        if operation_name:
            body["operationName"] = operation_name
    
        logger.debug(
            f"Executing {operation_name or ''} graphql query: {query} with variables: {json.dumps(variables)}"
        )
        result = self._post_generic(url, body)
        if result.get("errors"):
            if format_exception:
>               raise GraphError(f"Error executing graphql query: {result['errors']}")
E               datahub.configuration.common.GraphError: Error executing graphql query: [{'message': 'An unknown error occurred.', 'locations': [{'line': 3, 'column': 13}], 'path': ['unlinkAssetVersion'], 'extensions': {'code': 500, 'type': 'SERVER_ERROR', 'classification': 'DataFetchingException'}}]

...../ingestion/graph/client.py:1188: GraphError

tests.entity_versioning.test_versioning::test_link_unlink_three_versions_unlink_and_relink

Flake rate in main: 27.94% (Passed 49 times, Failed 19 times)

Stack Traces | 29.8s run time

graph_client = DataHubGraph: configured to talk to http://localhost:8080 with token: eyJh**********y_oA

    @pytest.fixture(scope="function", autouse=True)
    def ingest_cleanup_data(graph_client: DataHubGraph):
        try:
            for urn in ENTITY_URN_OBJS:
                graph_client.emit_mcp(
                    MetadataChangeProposalWrapper(
                        entityUrn=urn.urn(),
                        aspect=DatasetKeyClass(
                            platform=urn.platform, name=urn.name, origin=urn.env
                        ),
                    )
                )
            for i in [2, 1, 0, 0, 1, 2]:
                graph_client.unlink_asset_from_version_set(ENTITY_URNS[i])
            yield
        finally:
            for i in [2, 1, 0, 0, 1, 2]:
>               graph_client.unlink_asset_from_version_set(ENTITY_URNS[i])

tests/entity_versioning/test_versioning.py:35: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
...../ingestion/graph/entity_versioning.py:164: in unlink_asset_from_version_set
    response = self.execute_graphql(self.UNLINK_VERSION_MUTATION, variables)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = DataHubGraph: configured to talk to http://localhost:8080 with token: eyJh**********y_oA
query = '\n        mutation($input: UnlinkVersionInput!) {\n            unlinkAssetVersion(input: $input) {\n                urn\n            }\n        }\n    '
variables = {'input': {'unlinkedEntity': 'urn:li:dataset:(urn:li:dataPlatform:snowflake,versioning_1,PROD)', 'versionSet': 'urn:li:versionSet:(12345678910,dataset)'}}
operation_name = None, format_exception = True

    def execute_graphql(
        self,
        query: str,
        variables: Optional[Dict] = None,
        operation_name: Optional[str] = None,
        format_exception: bool = True,
    ) -> Dict:
        url = f"{self.config.server}/api/graphql"
    
        body: Dict = {
            "query": query,
        }
        if variables:
            body["variables"] = variables
        if operation_name:
            body["operationName"] = operation_name
    
        logger.debug(
            f"Executing {operation_name or ''} graphql query: {query} with variables: {json.dumps(variables)}"
        )
        result = self._post_generic(url, body)
        if result.get("errors"):
            if format_exception:
>               raise GraphError(f"Error executing graphql query: {result['errors']}")
E               datahub.configuration.common.GraphError: Error executing graphql query: [{'message': 'An unknown error occurred.', 'locations': [{'line': 3, 'column': 13}], 'path': ['unlinkAssetVersion'], 'extensions': {'code': 500, 'type': 'SERVER_ERROR', 'classification': 'DataFetchingException'}}]

...../ingestion/graph/client.py:1188: GraphError

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

sgomezvillamor · 2025-03-14T10:21:47Z

metadata-ingestion/src/datahub/ingestion/source/powerbi/config.py

+    workspace_name_pattern: AllowDenyPattern = pydantic.Field(
+        default=AllowDenyPattern.allow_all(),
+        description="Regex patterns to filter PowerBI workspaces in ingestion."
+        " Note: This field works in conjunction with 'workspace_type_filter' and both must be considered when filtering workspaces.",


To ensure a workspace is not filtered out, both the workspace ID and name must be allowed. Is that correct?
Similarly, if either of them is denied, the workspace will be filtered out, right?

If so, we may add such a precise explanation in the description.

IMO this confusion comes from the workspace_id_pattern, which should have never been a pattern. Given workspace id is a uuid-like, having regex patterns there is not practical. Instead, it should be workspace_ids: List[str], with the list of workspaces ids to be included, and then, only if not given, then the workspace_name_pattern would be considered.

We can skip workspace_id_pattern deprecation and just refine docs to avoid confusion.

No, the filter is an OR condition. You can just specify a name filter, or an ID filter or both. Do if you just have a name filer, all IDs will be allowed. I'll update the docs.

Docs updated.

In this scenario and assuming name2 is not the name of the workspace id id1

workspace_id_pattern: allow: - id1 workspace_name_pattern: allow: - name2

No workspace will be allowed, right? that's why I see this as an AND. Of course, if we have allow all (the default) in any of the two, then everything depends on the other

And the same scenario:

workspace_id_pattern: deny: - id1 workspace_name_pattern: deny: - name2

Everything will be allowed but workspace id id1 and workspace name name2. So an OR in this case.

If you allow one by ID and one by name both will be allowed. For the second scenario, if you deny one by ID and another by name, both will be denied.

sgomezvillamor · 2025-03-14T10:22:05Z

metadata-ingestion/src/datahub/ingestion/source/powerbi/config.py

@@ -373,7 +387,7 @@ class PowerBiDashboardSourceConfig(
    )
    # Enable/Disable extracting dataset schema
    extract_dataset_schema: bool = pydantic.Field(
-        default=False,
+        default=True,


This is a sort of breaking change, why?

This is actually not a breaking change. The API call is hard coded to return all the possible attributes, including dataset schema info so this should have been always set to true.

Not sure about the API call, but emitting or not the aspect depends on that config flag

datahub/metadata-ingestion/src/datahub/ingestion/source/powerbi/powerbi.py

Lines 445 to 446 in 730f1b8

if self.__config.extract_dataset_schema:

dataset_mcps.extend(self.extract_dataset_schema(table, ds_urn))

Correct, though the problem is this should have never been an option based on the way the source works. It is confusing at best. You can set extract_column_level_lineage but if you don't extract the schema data from the API scan result, you won't get column level lineage. However, the dataset schema data, if present, will always be returned.

params={ "datasetExpressions": True, "datasetSchema": True, "datasourceDetails": True, "getArtifactUsers": True, "lineage": True, },

metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/pattern_handler.py

sgomezvillamor

Overall looks good
I just miss some testing

Have you considered mocking some test in metadata-ingestion/tests/integration/powerbi/test_powerbi.py?
or has this been end to end tested locally somehow?

mminichino · 2025-03-14T14:14:52Z

My updates have been tested locally.

sgomezvillamor · 2025-03-14T15:49:33Z

CI failure

FAILED tests/integration/powerbi/test_profiling.py::test_profiling - Failed: Metadata files differ (use `pytest --update-golden-files` to update):
Urn changed, urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.articles,PROD):
<schemaMetadata> added
= 1 failed, 286 passed, 21 skipped, 1676 deselected, 196 warnings in 1166.63s (0:19:26) =

> Task :metadata-ingestion:testIntegrationBatch0 FAILED

I would say this is related to the breaking change in the extract_dataset_schema config

mminichino · 2025-03-14T16:23:44Z

The mocked API scan result does not match what the API is actually returning. But the test was wired for a specific input and output. I should just need to either add the setting or update the golden. I'll look at it.

mminichino · 2025-03-14T16:46:56Z

Yes, that golden does not include field data. However, in the documentation we say that Schema Metadata is enabled by default. So, in my view, that golden was incorrect from the start. That exact issue has been raised by several customers.

hsheth2 · 2025-03-14T17:14:41Z

@mminichino sounds like we need to update the mock to include the fields, and then update/regenerate the golden file accordingly

mminichino · 2025-03-14T17:18:34Z

Yes, I am in the process of doing that

mminichino · 2025-03-14T18:44:04Z

Golden files updated. Integration test passes.

fix(ingest/powerbi): Snowflake column lineage

1a384f3

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Mar 12, 2025

github-actions bot deployed to datahub-wheels (Preview) March 12, 2025 16:00 View deployment

datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Mar 12, 2025

vercel bot deployed to Preview March 12, 2025 16:09 View deployment

fix(ingest/powerbi): Snowflake column lineage

851c885

github-actions bot deployed to datahub-wheels (Preview) March 12, 2025 16:56 View deployment

vercel bot deployed to Preview March 12, 2025 17:12 View deployment

fix(ingest/powerbi): Fix column lineage

7e9d056

github-actions bot deployed to datahub-wheels (Preview) March 12, 2025 18:17 View deployment

vercel bot deployed to Preview March 12, 2025 18:31 View deployment

fix(ingest/powerbi): Add workspace name filter

a301f79

github-actions bot deployed to datahub-wheels (Preview) March 12, 2025 19:19 View deployment

vercel bot deployed to Preview March 12, 2025 19:25 View deployment

fix(ingest/powerbi): Lint fix

2df95ac

github-actions bot deployed to datahub-wheels (Preview) March 13, 2025 05:17 View deployment

fix(ingest/powerbi): Lint fix

6177b95

github-actions bot deployed to datahub-wheels (Preview) March 13, 2025 05:35 View deployment

vercel bot deployed to Preview March 13, 2025 05:53 View deployment

fix(ingest/powerbi): MySQL and incremental lineage

3bd6446

github-actions bot deployed to datahub-wheels (Preview) March 14, 2025 04:54 View deployment

vercel bot deployed to Preview March 14, 2025 05:12 View deployment

sgomezvillamor reviewed Mar 14, 2025

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/pattern_handler.py Show resolved Hide resolved

sgomezvillamor reviewed Mar 14, 2025

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/pattern_handler.py Show resolved Hide resolved

sgomezvillamor reviewed Mar 14, 2025

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/pattern_handler.py Show resolved Hide resolved

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 14, 2025

sgomezvillamor reviewed Mar 14, 2025

View reviewed changes

sgomezvillamor requested a review from hsheth2 March 14, 2025 10:36

fix(ingest/powerbi): Doc updates

6f99d99

github-actions bot deployed to datahub-wheels (Preview) March 14, 2025 13:37 View deployment

datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 14, 2025

vercel bot deployed to Preview March 14, 2025 13:54 View deployment

fix(ingest/powerbi): Doc updates

730f1b8

github-actions bot deployed to datahub-wheels (Preview) March 14, 2025 14:12 View deployment

mminichino changed the title ~~fix(ingest/powerbi): Fix column lineage~~ feat(ingest/powerbi): PowerBI source updates Mar 14, 2025

vercel bot deployed to Preview March 14, 2025 14:30 View deployment

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 14, 2025

datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 14, 2025

fix(ingest/powerbi): Update golden files

31f030b

github-actions bot deployed to datahub-wheels (Preview) March 14, 2025 18:50 View deployment

vercel bot had a problem deploying to Preview March 14, 2025 19:00 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest/powerbi): PowerBI source updates #12857

feat(ingest/powerbi): PowerBI source updates #12857

mminichino commented Mar 12, 2025 •

edited by sgomezvillamor

Loading

codecov bot commented Mar 12, 2025 •

edited

Loading

sgomezvillamor Mar 14, 2025

mminichino Mar 14, 2025

mminichino Mar 14, 2025

sgomezvillamor Mar 14, 2025

mminichino Mar 14, 2025

sgomezvillamor Mar 14, 2025

mminichino Mar 14, 2025

sgomezvillamor Mar 14, 2025

mminichino Mar 14, 2025

sgomezvillamor left a comment

mminichino commented Mar 14, 2025

sgomezvillamor commented Mar 14, 2025

mminichino commented Mar 14, 2025 •

edited

Loading

mminichino commented Mar 14, 2025 •

edited

Loading

hsheth2 commented Mar 14, 2025

mminichino commented Mar 14, 2025

mminichino commented Mar 14, 2025

	if self.__config.extract_dataset_schema:
	dataset_mcps.extend(self.extract_dataset_schema(table, ds_urn))

feat(ingest/powerbi): PowerBI source updates #12857

Are you sure you want to change the base?

feat(ingest/powerbi): PowerBI source updates #12857

Conversation

mminichino commented Mar 12, 2025 • edited by sgomezvillamor Loading

Checklist

codecov bot commented Mar 12, 2025 • edited Loading

❌ 2 Tests Failed:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgomezvillamor left a comment

Choose a reason for hiding this comment

mminichino commented Mar 14, 2025

sgomezvillamor commented Mar 14, 2025

mminichino commented Mar 14, 2025 • edited Loading

mminichino commented Mar 14, 2025 • edited Loading

hsheth2 commented Mar 14, 2025

mminichino commented Mar 14, 2025

mminichino commented Mar 14, 2025

mminichino commented Mar 12, 2025 •

edited by sgomezvillamor

Loading

codecov bot commented Mar 12, 2025 •

edited

Loading

mminichino commented Mar 14, 2025 •

edited

Loading

mminichino commented Mar 14, 2025 •

edited

Loading