Spark-3.5: Add procedure to compute partition stats #13480

ajantha-bhat · 2025-07-07T09:41:46Z

Wrapper on the Spark action PR that got merged: #12450

Just a

CALL catalog_name.system.compute_partition_stats('db.sample');

And observe the stats files registered to table

table.partitionStatisticsFiles()

Fixes: #10106

ajantha-bhat · 2025-07-07T09:43:23Z

Also cc: @karuppayya

ajantha-bhat · 2025-07-09T00:53:24Z

Thanks @nastra and @hussein-awala for the review.

@amogh-jahagirdar or @RussellSpitzer or @szehon-ho: Anyone of you also wants to do a review? If not, we will go ahead with these changes.

szehon-ho · 2025-07-09T04:44:39Z

.../spark/src/main/java/org/apache/iceberg/spark/procedures/ComputePartitionStatsProcedure.java

+            table
+                .partitionStatisticsFiles()
+                .forEach(file -> updateStats.removePartitionStatistics(file.snapshotId()));
+            updateStats.commit();


Is this two separate trasnactions? There's no way to have it all or nothing? (to prevent the statistic from being removed if the compute fail)

Yes. When we designed the original core API, we didn't provide an option to full refresh as it is very rare case for user to use it (only during corruption), that too when incremental compute can't read the previous stats file it fallback to full compute. So, if at all for other reasons users need a full refresh, they need to unregister and call compute again.

If two separate transaction is a bad idea. I can remove this option and we can add it in the future if required (if any new use case comes up). WDYT?

hm it would be better if the underlying action supported full? what do you think?

Yea maybe we can remove it for now in procedure.

Agree. I just removed the full refresh option now.

Since the underlaying action will fallback to full compute if stats are corrupted automatically, we don't have use case for full refresh yet. If there are new use case in the future, we can support it from underlaying API to make it atomic instead of using remove stats and compute method.

PR is ready now.

szehon-ho

looks good, some nit comments about the comment and procedure description.

If we make the comment more clear/formal, we can just use that wording in the doc

szehon-ho · 2025-07-11T00:28:52Z

.../spark/src/main/java/org/apache/iceberg/spark/procedures/ComputePartitionStatsProcedure.java

+
+/**
+ * A procedure that computes the stats incrementally after the snapshot that has partition stats
+ * file till the given snapshot (uses current snapshot if not specified) and writes the combined


Nit: till => until (formal)

szehon-ho · 2025-07-11T00:30:51Z

.../spark/src/main/java/org/apache/iceberg/spark/procedures/ComputePartitionStatsProcedure.java

+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * A procedure that computes the stats incrementally after the snapshot that has partition stats


Nit:

A procedure that computes partition stats incrementally from the last snapshot that has partition stats ..

It would be more clear. Last snapshot with partition stats is the starting point , correct?

szehon-ho · 2025-07-11T00:32:29Z

.../spark/src/main/java/org/apache/iceberg/spark/procedures/ComputePartitionStatsProcedure.java

+/**
+ * A procedure that computes the stats incrementally after the snapshot that has partition stats
+ * file till the given snapshot (uses current snapshot if not specified) and writes the combined
+ * result into a {@link PartitionStatisticsFile} after merging the stats for a given snapshot. Does


'merging the stats for a given snapshot' is confusing. It makes me think that the 'given snapshot' already has stats but earlier we imply it doesnt. How about just, merging the partition stats?

szehon-ho · 2025-07-11T02:00:56Z

btw, we usually start with spark 4.0. do you have pr for it?

ajantha-bhat · 2025-07-11T02:13:46Z

btw, we usually start with spark 4.0. do you have pr for it?

I know. We started this work on march: #12451
So, spark 4 was not there that time. This is the same work revived.
Since, the PR is approved. I will open a PR to port it to Spark 4 today.

Also, build has passed now. We can merge this.

ajantha-bhat · 2025-07-11T05:43:59Z

PR for spark-4.0: #13523

Spark-3.5: Add procedure to compute partition stats

bbb2b14

github-actions bot added the spark label Jul 7, 2025

ajantha-bhat mentioned this pull request Jul 7, 2025

Spark-3.5: Add procedure to compute partition stats #12451

Closed

ajantha-bhat requested review from amogh-jahagirdar and nastra July 7, 2025 09:42

ajantha-bhat mentioned this pull request Jul 7, 2025

Partition stats task tracker #8450

Closed

13 tasks

nastra approved these changes Jul 7, 2025

View reviewed changes

ajantha-bhat requested review from RussellSpitzer and szehon-ho July 8, 2025 11:11

hussein-awala approved these changes Jul 8, 2025

View reviewed changes

szehon-ho reviewed Jul 9, 2025

View reviewed changes

ajantha-bhat requested a review from pvary July 9, 2025 09:08

remove full_refresh option

9e44b17

ajantha-bhat force-pushed the procedure branch from 40f907a to 9e44b17 Compare July 10, 2025 13:26

szehon-ho reviewed Jul 11, 2025

View reviewed changes

Reword javadoc

5bfc352

szehon-ho approved these changes Jul 11, 2025

View reviewed changes

ajantha-bhat mentioned this pull request Jul 11, 2025

Spark 4.0: Add procedure to compute partition stats #13523

Merged

nastra approved these changes Jul 11, 2025

View reviewed changes

nastra merged commit e9da855 into apache:main Jul 11, 2025
27 checks passed

slfan1989 mentioned this pull request Sep 9, 2025

Spark 3.4: Backport: Add procedure and action to compute partition stats. #14034

Merged

Spark-3.5: Add procedure to compute partition stats #13480

Spark-3.5: Add procedure to compute partition stats #13480

Conversation

ajantha-bhat commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajantha-bhat commented Jul 7, 2025

Uh oh!

ajantha-bhat commented Jul 9, 2025

Uh oh!

szehon-ho Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Jul 11, 2025

Uh oh!

ajantha-bhat commented Jul 11, 2025

Uh oh!

ajantha-bhat commented Jul 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ajantha-bhat commented Jul 7, 2025 •

edited

Loading