Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Iceberg table properties building #25183

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jinyangli34
Copy link
Contributor

Description

When building Iceberg table properties with many files under one partition, process this partition only once

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Feb 27, 2025
@github-actions github-actions bot added the iceberg Iceberg connector label Feb 27, 2025
@jinyangli34 jinyangli34 force-pushed the jinyang-dedup_partition_values branch from efea506 to 46c61e3 Compare February 28, 2025 04:38
@ebyhr ebyhr requested a review from raunaqmorarka February 28, 2025 05:03
Iterable<FileScanTask> files = () -> lazyFiles.get().iterator();
Iterable<PartitionStruct> partitionStructs = () -> lazyFiles.get().stream()
.map(PartitionStruct::new)
.distinct()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StructLike does not implement equals and hashCode. I think we need to use StructLikeWrapperWithFieldIdToIndex instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will use a LinkedHashSet internally that also uses PartitionSpec for de-dup and holds on to that object.
Since checking the partition values is sufficient, it would be better to explicitly create a HashSet on StructLikeWrapperWithFieldIdToIndex and write something similar to io.trino.plugin.iceberg.PartitionsTable#getStatisticsByPartition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 updated with StructLikeWrapperWithFieldIdToIndex

@jinyangli34 jinyangli34 force-pushed the jinyang-dedup_partition_values branch 2 times, most recently from b3f40bd to b6f21f6 Compare March 3, 2025 23:27
When building Iceberg table properties with many files under one partition,
process this partition only once
@jinyangli34 jinyangli34 force-pushed the jinyang-dedup_partition_values branch from b6f21f6 to 90bfe5a Compare March 6, 2025 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed iceberg Iceberg connector
Development

Successfully merging this pull request may close these issues.

3 participants