-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest/powerbi): PowerBI source updates #12857
base: master
Are you sure you want to change the base?
Conversation
❌ 2 Tests Failed:
View the full list of 2 ❄️ flaky tests
To view more test analytics, go to the Test Analytics Dashboard |
workspace_name_pattern: AllowDenyPattern = pydantic.Field( | ||
default=AllowDenyPattern.allow_all(), | ||
description="Regex patterns to filter PowerBI workspaces in ingestion." | ||
" Note: This field works in conjunction with 'workspace_type_filter' and both must be considered when filtering workspaces.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ensure a workspace is not filtered out, both the workspace ID and name must be allowed. Is that correct?
Similarly, if either of them is denied, the workspace will be filtered out, right?
If so, we may add such a precise explanation in the description.
IMO this confusion comes from the workspace_id_pattern
, which should have never been a pattern. Given workspace id is a uuid-like, having regex patterns there is not practical. Instead, it should be workspace_ids: List[str]
, with the list of workspaces ids to be included, and then, only if not given, then the workspace_name_pattern
would be considered.
We can skip workspace_id_pattern
deprecation and just refine docs to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the filter is an OR condition. You can just specify a name filter, or an ID filter or both. Do if you just have a name filer, all IDs will be allowed. I'll update the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this scenario and assuming name2
is not the name of the workspace id id1
workspace_id_pattern:
allow:
- id1
workspace_name_pattern:
allow:
- name2
No workspace will be allowed, right? that's why I see this as an AND. Of course, if we have allow all (the default) in any of the two, then everything depends on the other
And the same scenario:
workspace_id_pattern:
deny:
- id1
workspace_name_pattern:
deny:
- name2
Everything will be allowed but workspace id id1
and workspace name name2
. So an OR in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you allow one by ID and one by name both will be allowed. For the second scenario, if you deny one by ID and another by name, both will be denied.
@@ -373,7 +387,7 @@ class PowerBiDashboardSourceConfig( | |||
) | |||
# Enable/Disable extracting dataset schema | |||
extract_dataset_schema: bool = pydantic.Field( | |||
default=False, | |||
default=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a sort of breaking change, why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually not a breaking change. The API call is hard coded to return all the possible attributes, including dataset schema info so this should have been always set to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the API call, but emitting or not the aspect depends on that config flag
datahub/metadata-ingestion/src/datahub/ingestion/source/powerbi/powerbi.py
Lines 445 to 446 in 730f1b8
if self.__config.extract_dataset_schema: | |
dataset_mcps.extend(self.extract_dataset_schema(table, ds_urn)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, though the problem is this should have never been an option based on the way the source works. It is confusing at best. You can set extract_column_level_lineage
but if you don't extract the schema data from the API scan result, you won't get column level lineage. However, the dataset schema data, if present, will always be returned.
params={
"datasetExpressions": True,
"datasetSchema": True,
"datasourceDetails": True,
"getArtifactUsers": True,
"lineage": True,
},
metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/pattern_handler.py
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/pattern_handler.py
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/pattern_handler.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good
I just miss some testing
Have you considered mocking some test in metadata-ingestion/tests/integration/powerbi/test_powerbi.py
?
or has this been end to end tested locally somehow?
My updates have been tested locally. |
CI failure
I would say this is related to the breaking change in the |
The mocked API scan result does not match what the API is actually returning. But the test was wired for a specific input and output. I should just need to either add the setting or update the golden. I'll look at it. |
Yes, that golden does not include field data. However, in the documentation we say that Schema Metadata is enabled by default. So, in my view, that golden was incorrect from the start. That exact issue has been raised by several customers. |
@mminichino sounds like we need to update the mock to include the fields, and then update/regenerate the golden file accordingly |
Yes, I am in the process of doing that |
Golden files updated. Integration test passes. |
Checklist