-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scoping "Seen Tables". Improving table scan performance. #3741
base: main
Are you sure you want to change the base?
Conversation
6836326
to
6257fd3
Compare
❌ 54/55 passed, 1 failed, 5 skipped, 4h37m11s total ❌ test_all_grant_types: AssertionError: assert {('ANONYMOUS ...dummy_twu2b')} == {('ANONYMOUS ..._fzcf6'), ...} (12m12.2s)
Running from acceptance #8385 |
6257fd3
to
c51ee78
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FastLee : I have to think about the choice to add a scope
for a bit.
It feels redundant to pass the in-scope tables, extract the schemas names from those tables, then skip the in-scope schemas when iterating over all schemas. Why not directly loop over the in-scope schemas?
A similar approach is repeated for the table names, which makes the schema loop even more redundant. Why not directly go to the in-scope tables?
Moreover, we typically provide the scope using an include_
at initialization, which might not work here, but I have to think about that
@@ -171,7 +191,10 @@ def _iter_catalogs(self) -> Iterable[CatalogInfo]: | |||
|
|||
def _iter_schemas(self) -> Iterable[SchemaInfo]: | |||
for catalog in self._iter_catalogs(): | |||
if catalog.name is None: | |||
if catalog.name is None or catalog.catalog_type in ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has probably the same effect as the filter on line 186. If not, move the filter there so that filtering catalogs happens inside the _iter_catalog
if ( | ||
schema.catalog_name is None | ||
or schema.name is None | ||
or schema.name == "information_schema" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this into _iter_schema
tables: list = [] | ||
logger.info(f"Scanning {len(tasks)} schemas for tables") | ||
table_lists = Threads.gather("list tables", tasks, self.API_LIMIT) | ||
# Combine tuple of lists to a list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use itertools.chain
instead
for table_list in table_lists[0]: | ||
tables.extend(table_list) | ||
for table in tables: | ||
if not isinstance(table, TableInfo): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to happen? It should not based on the type hinting, right?
) | ||
tables: list = [] | ||
logger.info(f"Scanning {len(tasks)} schemas for tables") | ||
table_lists = Threads.gather("list tables", tasks, self.API_LIMIT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not an API limit, but the number of threads
Scoped down table scan for migrated tables.