Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scoping "Seen Tables". Improving table scan performance. #3741

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

FastLee
Copy link
Contributor

@FastLee FastLee commented Feb 24, 2025

Scoped down table scan for migrated tables.

  1. When upgrading tables, looks only for the mapped tables.
  2. Skipping, Delta Sharing Catalogs and Information Schema
  3. Parallelizing Scan

@FastLee FastLee requested a review from a team as a code owner February 24, 2025 15:17
@FastLee FastLee force-pushed the experimental/scoped_seen_tables branch from 6836326 to 6257fd3 Compare February 24, 2025 15:22
@FastLee FastLee requested a review from JCZuurmond February 24, 2025 15:40
Copy link

github-actions bot commented Feb 24, 2025

❌ 54/55 passed, 1 failed, 5 skipped, 4h37m11s total

❌ test_all_grant_types: AssertionError: assert {('ANONYMOUS ...dummy_twu2b')} == {('ANONYMOUS ..._fzcf6'), ...} (12m12.2s)
AssertionError: assert {('ANONYMOUS ...dummy_twu2b')} == {('ANONYMOUS ..._fzcf6'), ...}
  
  Extra items in the right set:
  ('ANY FILE', None)
  
  Full diff:
    {
        (
            'ANONYMOUS FUNCTION',
  -         None,
  -     ),
  -     (
  -         'ANY FILE',
            None,
        ),
        (
            'CATALOG',
            'hive_metastore',
        ),
        (
            'DATABASE',
            'hive_metastore.dummy_si8om',
        ),
        (
            'TABLE',
            'hive_metastore.dummy_si8om.dummy_tdo9r',
        ),
        (
            'UDF',
            'hive_metastore.dummy_si8om.dummy_fzcf6',
        ),
        (
            'VIEW',
            'hive_metastore.dummy_si8om.dummy_twu2b',
        ),
    }
20:52 INFO [databricks.labs.ucx.install] Creating ucx schemas...
[gw4] linux -- Python 3.10.16 /home/runner/work/ucx/ucx/.venv/bin/python
20:52 INFO [databricks.labs.ucx.install] Creating ucx schemas...
20:53 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_scnlh.grants] fetching grants inventory
20:53 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_scnlh.grants] crawling new set of snapshot data for grants
20:53 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_scnlh.tables] fetching tables inventory
20:53 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_scnlh.tables] crawling new set of snapshot data for tables
20:53 DEBUG [databricks.labs.ucx.hive_metastore.tables] [hive_metastore.dummy_si8om] listing tables and views
20:53 DEBUG [databricks.labs.ucx.hive_metastore.tables] [hive_metastore.dummy_si8om.dummy_tdo9r] fetching table metadata
20:53 DEBUG [databricks.labs.ucx.hive_metastore.tables] [hive_metastore.dummy_si8om.dummy_twu2b] fetching table metadata
20:53 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_scnlh.tables] found 2 new records for tables
20:53 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_scnlh.udfs] fetching udfs inventory
20:53 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_scnlh.udfs] crawling new set of snapshot data for udfs
20:53 DEBUG [databricks.labs.ucx.hive_metastore.udfs] [hive_metastore.dummy_si8om] listing udfs
20:53 DEBUG [databricks.labs.ucx.hive_metastore.udfs] [hive_metastore.dummy_si8om.dummy_fzcf6] fetching udf metadata
20:53 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_scnlh.udfs] found 1 new records for udfs
20:53 WARNING [databricks.labs.ucx.hive_metastore.grants] Schema hive_metastore.dummy_se77b no longer existed
20:53 WARNING [databricks.labs.ucx.hive_metastore.grants] Schema hive_metastore.dummy_sxij9 no longer existed
21:04 ERROR [databricks.labs.ucx.hive_metastore.grants] Couldn't fetch grants for object ANY FILE : TEMPORARILY_UNAVAILABLE: The service at /api/2.0/sql-acl/get-permissions is taking too long to process your request. Please try again later or try a faster operation. [TraceId: 00-5b189c0d1441e7dcfed2ec0982f45743-7d55609841dcf75c-00]
21:04 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_scnlh.grants] found 33 new records for grants
20:52 INFO [databricks.labs.ucx.install] Creating ucx schemas...
20:53 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_scnlh.grants] fetching grants inventory
20:53 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_scnlh.grants] crawling new set of snapshot data for grants
20:53 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_scnlh.tables] fetching tables inventory
20:53 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_scnlh.tables] crawling new set of snapshot data for tables
20:53 DEBUG [databricks.labs.ucx.hive_metastore.tables] [hive_metastore.dummy_si8om] listing tables and views
20:53 DEBUG [databricks.labs.ucx.hive_metastore.tables] [hive_metastore.dummy_si8om.dummy_tdo9r] fetching table metadata
20:53 DEBUG [databricks.labs.ucx.hive_metastore.tables] [hive_metastore.dummy_si8om.dummy_twu2b] fetching table metadata
20:53 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_scnlh.tables] found 2 new records for tables
20:53 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_scnlh.udfs] fetching udfs inventory
20:53 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_scnlh.udfs] crawling new set of snapshot data for udfs
20:53 DEBUG [databricks.labs.ucx.hive_metastore.udfs] [hive_metastore.dummy_si8om] listing udfs
20:53 DEBUG [databricks.labs.ucx.hive_metastore.udfs] [hive_metastore.dummy_si8om.dummy_fzcf6] fetching udf metadata
20:53 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_scnlh.udfs] found 1 new records for udfs
20:53 WARNING [databricks.labs.ucx.hive_metastore.grants] Schema hive_metastore.dummy_se77b no longer existed
20:53 WARNING [databricks.labs.ucx.hive_metastore.grants] Schema hive_metastore.dummy_sxij9 no longer existed
21:04 ERROR [databricks.labs.ucx.hive_metastore.grants] Couldn't fetch grants for object ANY FILE : TEMPORARILY_UNAVAILABLE: The service at /api/2.0/sql-acl/get-permissions is taking too long to process your request. Please try again later or try a faster operation. [TraceId: 00-5b189c0d1441e7dcfed2ec0982f45743-7d55609841dcf75c-00]
21:04 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_scnlh.grants] found 33 new records for grants
[gw4] linux -- Python 3.10.16 /home/runner/work/ucx/ucx/.venv/bin/python

Running from acceptance #8385

Partially verified

This commit is signed with the committer’s verified signature.
JCZuurmond’s contribution has been verified via GPG key.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.

Partially verified

This commit is signed with the committer’s verified signature.
JCZuurmond’s contribution has been verified via GPG key.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.

Partially verified

This commit is signed with the committer’s verified signature.
JCZuurmond’s contribution has been verified via GPG key.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.

Partially verified

This commit is signed with the committer’s verified signature.
JCZuurmond’s contribution has been verified via GPG key.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.

Partially verified

This commit is signed with the committer’s verified signature.
JCZuurmond’s contribution has been verified via GPG key.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.

Partially verified

This commit is signed with the committer’s verified signature.
JCZuurmond’s contribution has been verified via GPG key.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.

Partially verified

This commit is signed with the committer’s verified signature.
JCZuurmond’s contribution has been verified via GPG key.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.

Partially verified

This commit is signed with the committer’s verified signature.
JCZuurmond’s contribution has been verified via GPG key.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.

Partially verified

This commit is signed with the committer’s verified signature.
JCZuurmond’s contribution has been verified via GPG key.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.

Partially verified

This commit is signed with the committer’s verified signature.
JCZuurmond’s contribution has been verified via GPG key.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.

Partially verified

This commit is signed with the committer’s verified signature.
JCZuurmond’s contribution has been verified via GPG key.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.

Partially verified

This commit is signed with the committer’s verified signature.
JCZuurmond’s contribution has been verified via GPG key.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.

Partially verified

This commit is signed with the committer’s verified signature.
JCZuurmond’s contribution has been verified via GPG key.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.

Partially verified

This commit is signed with the committer’s verified signature.
JCZuurmond’s contribution has been verified via GPG key.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.

Partially verified

This commit is signed with the committer’s verified signature.
JCZuurmond’s contribution has been verified via GPG key.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.

Partially verified

This commit is signed with the committer’s verified signature.
JCZuurmond’s contribution has been verified via GPG key.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.

Partially verified

This commit is signed with the committer’s verified signature.
JCZuurmond’s contribution has been verified via GPG key.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.
@FastLee FastLee force-pushed the experimental/scoped_seen_tables branch from 6257fd3 to c51ee78 Compare February 24, 2025 20:51
Copy link
Member

@JCZuurmond JCZuurmond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FastLee : I have to think about the choice to add a scope for a bit.

It feels redundant to pass the in-scope tables, extract the schemas names from those tables, then skip the in-scope schemas when iterating over all schemas. Why not directly loop over the in-scope schemas?

A similar approach is repeated for the table names, which makes the schema loop even more redundant. Why not directly go to the in-scope tables?

Moreover, we typically provide the scope using an include_ at initialization, which might not work here, but I have to think about that

@@ -171,7 +191,10 @@ def _iter_catalogs(self) -> Iterable[CatalogInfo]:

def _iter_schemas(self) -> Iterable[SchemaInfo]:
for catalog in self._iter_catalogs():
if catalog.name is None:
if catalog.name is None or catalog.catalog_type in (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has probably the same effect as the filter on line 186. If not, move the filter there so that filtering catalogs happens inside the _iter_catalog

if (
schema.catalog_name is None
or schema.name is None
or schema.name == "information_schema"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this into _iter_schema

tables: list = []
logger.info(f"Scanning {len(tasks)} schemas for tables")
table_lists = Threads.gather("list tables", tasks, self.API_LIMIT)
# Combine tuple of lists to a list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use itertools.chain instead

for table_list in table_lists[0]:
tables.extend(table_list)
for table in tables:
if not isinstance(table, TableInfo):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to happen? It should not based on the type hinting, right?

)
tables: list = []
logger.info(f"Scanning {len(tasks)} schemas for tables")
table_lists = Threads.gather("list tables", tasks, self.API_LIMIT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not an API limit, but the number of threads

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants