Spark-3.5: Add spark action to compute partition stats #12450

ajantha-bhat · 2025-03-04T06:22:13Z

No description provided.

ajantha-bhat · 2025-03-04T06:24:54Z

@huaxingao, @karuppayya, @flyrain, @aokolnychyi, @deniskuzZ, @gszadovszky, @pvary : Tagging the people who might be intersted in this feature and review.

Will add a call procedure on top it once this PR is merged.

ajantha-bhat · 2025-03-07T07:12:02Z

...5/spark/src/main/java/org/apache/iceberg/spark/actions/ComputePartitionStatsSparkAction.java

+    LOG.info("Computing partition stats for {} (snapshot {})", table.name(), snapshot.snapshotId());
+    PartitionStatisticsFile statisticsFile;
+    try {
+      statisticsFile = PartitionStatsHandler.computeAndWriteStatsFile(table, snapshot.snapshotId());


Using the local algorithm instead of distributed algorithm as we did a POC and JMH benchmarks and found out that serialization cost high for distributive approach. Hence, went with local algorithm.

more info:
#9437 (comment)

and #9437 (comment)

github-actions bot added API spark core labels Mar 4, 2025

ajantha-bhat mentioned this pull request Feb 28, 2025

Partition stats task tracker #8450

Open

12 tasks

ajantha-bhat requested a review from aokolnychyi March 4, 2025 07:22

ajantha-bhat mentioned this pull request Mar 4, 2025

Spark-3.5: Add procedure to compute partition stats #12451

Draft

jbonofre added this to the Iceberg 1.9.0 milestone Mar 6, 2025

Spark-3.5: Add spark action to compute partition stats

b04dddf

ajantha-bhat force-pushed the stats_action branch from a98d3cb to b04dddf Compare March 7, 2025 06:45

ajantha-bhat commented Mar 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark-3.5: Add spark action to compute partition stats #12450

Spark-3.5: Add spark action to compute partition stats #12450

ajantha-bhat commented Mar 4, 2025

ajantha-bhat commented Mar 4, 2025

ajantha-bhat Mar 7, 2025

Spark-3.5: Add spark action to compute partition stats #12450

Are you sure you want to change the base?

Spark-3.5: Add spark action to compute partition stats #12450

Conversation

ajantha-bhat commented Mar 4, 2025

ajantha-bhat commented Mar 4, 2025

ajantha-bhat Mar 7, 2025

Choose a reason for hiding this comment