Spark-3.5: Add procedure to compute partition stats #12451

ajantha-bhat · 2025-03-04T08:36:03Z

Depends on #12450 (Hence kept in Draft).

singhpk234 · 2025-03-04T14:28:17Z

...5/spark/src/main/java/org/apache/iceberg/spark/actions/ComputePartitionStatsSparkAction.java

+    LOG.info("Computing partition stats for {} (branch {})", table.name(), branch);
+    PartitionStatisticsFile statisticsFile;
+    try {
+      statisticsFile = PartitionStatsHandler.computeAndWriteStatsFile(table, branch);


[doubt] wouldn't it be better if we do stats computation in executor, rather than just computing it in driver in a multi-threaded way ?

We did benchmarks and it turns out local algorithm is more performant than distributed way.
#9437 (comment)

So, going with non-distributed way.

That makes sense, thank you for sharing the benchmark, just for my understanding what was the machine configuration used for running the benchmark ?

It was JMH benchmarks on my machine (macbook m2 max), Anton also did similar test and commented on that PR.
#9437 (comment)

api/src/main/java/org/apache/iceberg/actions/ComputePartitionStats.java

github-actions bot added the API label Mar 4, 2025

ajantha-bhat changed the title ~~Procedure~~ Spark-3.5: Add procedure to compute partition stats Mar 4, 2025

github-actions bot added spark core labels Mar 4, 2025

ajantha-bhat marked this pull request as draft March 4, 2025 08:36

ajantha-bhat force-pushed the procedure branch from 68d3683 to 05671b4 Compare March 4, 2025 10:40

singhpk234 reviewed Mar 4, 2025

View reviewed changes

ajantha-bhat mentioned this pull request Mar 6, 2025

Data: Expose snapshot-id instead of branch for computing partition stats #12464

Merged

ajantha-bhat added 2 commits March 7, 2025 12:15

Spark-3.5: Add spark action to compute partition stats

b04dddf

Spark-3.5: Add procedure to compute partition stats

54136ec

ajantha-bhat force-pushed the procedure branch from 05671b4 to 54136ec Compare March 7, 2025 10:28

ajantha-bhat mentioned this pull request Mar 7, 2025

Partition stats task tracker #8450

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark-3.5: Add procedure to compute partition stats #12451

Spark-3.5: Add procedure to compute partition stats #12451

ajantha-bhat commented Mar 4, 2025 •

edited

Loading

singhpk234 Mar 4, 2025

ajantha-bhat Mar 4, 2025

singhpk234 Mar 4, 2025 •

edited

Loading

ajantha-bhat Mar 6, 2025

Spark-3.5: Add procedure to compute partition stats #12451

Are you sure you want to change the base?

Spark-3.5: Add procedure to compute partition stats #12451

Conversation

ajantha-bhat commented Mar 4, 2025 • edited Loading

singhpk234 Mar 4, 2025

Choose a reason for hiding this comment

ajantha-bhat Mar 4, 2025

Choose a reason for hiding this comment

singhpk234 Mar 4, 2025 • edited Loading

Choose a reason for hiding this comment

ajantha-bhat Mar 6, 2025

Choose a reason for hiding this comment

ajantha-bhat commented Mar 4, 2025 •

edited

Loading

singhpk234 Mar 4, 2025 •

edited

Loading