Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/bigquery): improve profiler to support multiple partition columns and support external table profiling #12825

Draft
wants to merge 28 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
64e48a4
Bq multi partitions (#12824)
acrylJonny Mar 10, 2025
48f41b2
updates for missed partition tables
acrylJonny Mar 10, 2025
9d7a83e
upping timeout to 5m
acrylJonny Mar 10, 2025
a025b85
improvements to pick up non-null partitions
acrylJonny Mar 10, 2025
023365c
testing updates
acrylJonny Mar 11, 2025
3078e25
Update profiler.py
acrylJonny Mar 11, 2025
3f3151a
removing timeouts
acrylJonny Mar 11, 2025
664b217
updating logic ordering
acrylJonny Mar 11, 2025
4020201
trying cheaper sampling methods
acrylJonny Mar 11, 2025
bad6a87
Merge branch 'master' into bq-multi-partition-profiling
acrylJonny Mar 11, 2025
d462564
efficiency improvements
acrylJonny Mar 11, 2025
e95d9df
Update profiler.py
acrylJonny Mar 11, 2025
2af9681
improvements
acrylJonny Mar 13, 2025
cd9eac3
linting
acrylJonny Mar 13, 2025
72cb157
Merge branch 'master' into bq-multi-partition-profiling
acrylJonny Mar 13, 2025
910c7e3
Merge branch 'master' into bq-multi-partition-profiling
acrylJonny Mar 14, 2025
20acb75
Update profiler.py
acrylJonny Mar 14, 2025
b177908
Update profiler.py
acrylJonny Mar 14, 2025
18379ea
Update profiler.py
acrylJonny Mar 14, 2025
bc18e1b
Update profiler.py
acrylJonny Mar 14, 2025
62bafec
improved sql queries
acrylJonny Mar 14, 2025
d30a5ac
Update profiler.py
acrylJonny Mar 14, 2025
3394b49
Update profiler.py
acrylJonny Mar 14, 2025
c4b4ac1
Update profiler.py
acrylJonny Mar 14, 2025
7d93a00
Update profiler.py
acrylJonny Mar 14, 2025
c65aa97
Merge branch 'master' into bq-multi-partition-profiling
acrylJonny Mar 14, 2025
2613db5
Update profiler.py
acrylJonny Mar 14, 2025
0470262
updates
acrylJonny Mar 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -62,20 +62,19 @@ class BigqueryTableConstraint:

@dataclass
class PartitionInfo:
field: str
# Data type is optional as we not have it when we set it from TimePartitioning
column: Optional[BigqueryColumn] = None
fields: List[str] # Changed from single field to list
columns: Optional[List[BigqueryColumn]] = None
type: str = TimePartitioningType.DAY
expiration_ms: Optional[int] = None
require_partition_filter: bool = False

# TimePartitioning field doesn't provide data_type so we have to add it afterwards
@classmethod
def from_time_partitioning(
cls, time_partitioning: TimePartitioning
) -> "PartitionInfo":
"""Convert BigQuery time partitioning to PartitionInfo."""
return cls(
field=time_partitioning.field or "_PARTITIONTIME",
fields=[time_partitioning.field or "_PARTITIONTIME"], # Now a list
type=time_partitioning.type_,
expiration_ms=time_partitioning.expiration_ms,
require_partition_filter=time_partitioning.require_partition_filter,
Expand All @@ -90,7 +89,7 @@ def from_range_partitioning(
return None

return cls(
field=field,
fields=[field],
type=RANGE_PARTITION_NAME,
)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -630,15 +630,13 @@ def _process_table(
)

# If table has time partitioning, set the data type of the partitioning field
if table.partition_info:
table.partition_info.column = next(
(
column
for column in columns
if column.name == table.partition_info.field
),
None,
)
if table.partition_info and table.partition_info.fields:
matching_columns = [
column
for column in columns
if column.name in table.partition_info.fields
]
table.partition_info.columns = matching_columns
yield from self.gen_table_dataset_workunits(
table, columns, project_id, dataset_name
)
Expand Down
Loading
Loading