Spark-3.5: Add spark action to compute partition stats #12450

ajantha-bhat · 2025-03-04T06:22:13Z

This is a Spark wrapper for PartitionStatsHandler.computeAndWriteStatsFile(). This is a foundation for spark users to compute stats using just a SQL (call procedure).

Even though Spark may not have plans to use it for CBO at the moment[1],
Partition stats can be useful for Spark users to
a) List all the partitions of the table without reading all the manifest file
b) To figure out which partition is affected by the latest snapshot, so users can refresh their pipelines.

[1] IMO, Spark CBO is not matured yet.
They don't have an interface to keep partition level statistics. Only exposed interface is to keep table level statistics as can be seen here
https://github.com/apache/iceberg/blob/6c98a9e470d44f77e1939a1a333bf197a57949a6/sp[…]rk/src/main/java/org/apache/iceberg/spark/source/SparkScan.java
In future, we may need to add interface in Spark and integrate that with Iceberg.

ajantha-bhat · 2025-03-04T06:24:54Z

@huaxingao, @karuppayya, @flyrain, @aokolnychyi, @deniskuzZ, @gszadovszky, @pvary : Tagging the people who might be intersted in this feature and review.

Will add a call procedure on top it once this PR is merged.

ajantha-bhat · 2025-03-07T07:12:02Z

...5/spark/src/main/java/org/apache/iceberg/spark/actions/ComputePartitionStatsSparkAction.java

+    LOG.info("Computing partition stats for {} (snapshot {})", table.name(), snapshot.snapshotId());
+    PartitionStatisticsFile statisticsFile;
+    try {
+      statisticsFile = PartitionStatsHandler.computeAndWriteStatsFile(table, snapshot.snapshotId());


Using the local algorithm instead of distributed algorithm as we did a POC and JMH benchmarks and found out that serialization cost high for distributive approach. Hence, went with local algorithm.

more info:
#9437 (comment)

and #9437 (comment)

ajantha-bhat · 2025-03-19T15:12:42Z

Moving out of 1.9.0, since no one reviewed this :(

github-actions · 2025-04-19T00:15:50Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2025-04-26T00:16:27Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

ajantha-bhat · 2025-05-14T16:37:58Z

@huaxingao, @pvary, @szehon-ho, @amogh-jahagirdar, @nastra, @aokolnychyi :
Can I please get a review on this?

This is a Spark wrapper for PartitionStatsHandler.computeAndWriteStatsFile(). This is a foundation for spark users to compute partition stats using just a SQL (call procedure).

Even though Spark may not have plans to use it for CBO at the moment,
partition stats can be useful for Spark users to
a) List all the partitions of the table without reading all the manifest file (sometimes, reading thousands of manifests will take lot of time).
b) To figure out which partition is affected by the latest snapshot, so users can refresh their pipeline jobs.

We have some users asking this in iceberg-slack workspace and I will add a call procedure on top of it once it is merged.

ajantha-bhat · 2025-05-15T06:46:04Z

Just rebased as the Spark-4.0 PR was merged. Will back port the changes to Spark-4.0 and spark-3.4 once this is merged.

pvary · 2025-05-15T08:26:52Z

@ajantha-bhat: I can only give you a soft +1. I am not an expert on the Spark code yet. Maybe someone else with more Spark expertise could review this.

Thanks for all the work you put into the partition stats!

nastra · 2025-05-15T08:27:50Z

api/src/main/java/org/apache/iceberg/actions/ComputePartitionStats.java

+public interface ComputePartitionStats
+    extends Action<ComputePartitionStats, ComputePartitionStats.Result> {
+  /**
+   * Choose the table snapshot to compute partition stats, by default the current snapshot is used.


the by default the current snapshot is used part is rather confusing imo. The API requires the caller to pass the snapshot ID, so I don't see where the default would be applied here

The design is similar to existing table stats (puffin).

I followed the same java doc as table stats.

iceberg/api/src/main/java/org/apache/iceberg/actions/ComputeTableStats.java

Line 39 in 31c315f

ComputeTableStats snapshot(long snapshotId);

Implementation use the table.currentSnapshot in the constructor.

Do, you think we should remove it from the javadoc here?

to me this rather feels like something that should be in the javadoc of the interface itself and not at the method level that actually takes a snapshotId as a parameter

api/src/main/java/org/apache/iceberg/actions/ComputePartitionStats.java

nastra · 2025-05-15T08:30:59Z

....5/spark/src/test/java/org/apache/iceberg/spark/actions/TestComputePartitionStatsAction.java

+public class TestComputePartitionStatsAction extends CatalogTestBase {
+
+  @TestTemplate
+  public void testEmptyTable() {


since we're adding new tests I don't think we need the test prefix everywhere, because this is a legacy thing

removed test prefix from all the testcase names.
Should we update this in CONTRIBUTING.md, I didn't know about this ?

....5/spark/src/test/java/org/apache/iceberg/spark/actions/TestComputePartitionStatsAction.java

xxubai

LGTM. Just a couple small notes for you.

xxubai · 2025-05-15T15:42:25Z

...5/spark/src/main/java/org/apache/iceberg/spark/actions/ComputePartitionStatsSparkAction.java

+
+  @Override
+  public ComputePartitionStats snapshot(long newSnapshotId) {
+    Snapshot newSnapshot = table.snapshot(newSnapshotId);


Before fetching the table metadata, we can check whether the new snapshot ID is equal to the initial snapshot ID

can you elaborate? what do you mean by initial snapshot id ?

User can pass any snapshot id and stats will be computed and committed for that particular snapshot. If the user has not set the snapshot, it uses table.currentSnapshot(); from the constructor.

This design is same as existing ComputeTableStatsSparkAction

The 'initial snapshot' I referred to is obtained when constructing the ComputePartitionStatsSparkAction class:

this.snapshot = table.currentSnapshot();

When the newly passed snapshot is identical to this initial one, we can avoid fetching the Iceberg metadata file - a small optimization :)

xxubai · 2025-05-15T15:44:48Z

...5/spark/src/main/java/org/apache/iceberg/spark/actions/ComputePartitionStatsSparkAction.java

+  }
+
+  private Result doExecute() {
+    LOG.info("Computing partition stats for {} (snapshot {})", table.name(), snapshot.snapshotId());


nit

Suggested change

LOG.info("Computing partition stats for {} (snapshot {})", table.name(), snapshot.snapshotId());

LOG.info("Computing partition stats for {} (snapshot={})", table.name(), snapshot.snapshotId());

I think I'd prefer without = and we can probably get away from the nested parentheses in a Log statement.

"Computing partition stats for table {} snapshot {}" is sufficient info imo

karuppayya · 2025-05-15T16:40:00Z

I will take.a look today @ajantha-bhat

karuppayya

Thanks @ajantha-bhat for the change, left some comments

...5/spark/src/main/java/org/apache/iceberg/spark/actions/ComputePartitionStatsSparkAction.java

karuppayya · 2025-05-15T18:08:07Z

...5/spark/src/main/java/org/apache/iceberg/spark/actions/ComputePartitionStatsSparkAction.java

+      return EMPTY_RESULT;
+    }
+
+    table.updatePartitionStatistics().setPartitionStatistics(statisticsFile).commit();


If the statistics for a particular snapshot were already computed, and user triggered the action second time what is the behavior?

Does the action recompute? Or does it return the existing statistics file

If it returned the existing file from first computation, how does the user know if the stats in result is from a prior computation. Do we need a field in result to denote this, like CREATED (when computed fresh), EXISTING (when it returns an already computed one)?

Related to 2, if we are returning an existing stats file, do we need a way to force the computation?

If the statistics for a particular snapshot were already computed, and user triggered the action second time what is the behavior?

Currently, it reads the stats file (of the same snapshot) and write it into a new stats file without any additional compute of reading manifests. I will change it to no-op by returning the same file along with adding Info logs that stats file exist for snapshot.

do we need a way to force the computation

The user can unregister the stats from the table using table.updatePartitionStatistics().removePartitionStatistics(snapshotId).commit();, so it can force recompute when the compute API is called again.

change it to no-op

+1
I will let you take a call on whether we need to return whether it was created or an existing file or do it in a subsequent PR .

For a sql user, there is no way to figure why the command completes faster(when returning existing) vs taking time(when computed new) without a status.

Although this might not be an issue using APIs, since the user can check whether stats file is already available using table api (org.apache.iceberg.Table#partitionStatisticsFiles).

For a sql user, there is no way to figure why the command completes faster(when returning existing) vs taking time(when computed new) without a status.

Maybe we need to provide a metadata table to list the statistics file (both table and partition) in future.

we need to return whether it was created or an existing file

I think adding status is an overkill. They think about freshness etc. I will make it no op and print in the logs about why it is no op.

Added no op and logs + testcase.

karuppayya · 2025-05-15T18:11:07Z

...5/spark/src/main/java/org/apache/iceberg/spark/actions/ComputePartitionStatsSparkAction.java

+    }
+
+    if (statisticsFile == null) {
+      return EMPTY_RESULT;


What happens when you truncate a table, do we return a empty stats file or null? Also add a test

Do you mean truncate a whole table? Is there a quick Spark API to do that? Since, Iceberg uses DELETE FROM, actual files are not deleted during truncate. Stats will still be computed with delete counters. It will be similar to existing delete tests.

What happens after a major compaction?
Is a empty stats file generated or no file is created? Would be good to call this out

There are testcase at API level for compaction. Didn't add the same test case here as these are just wrapper and core functionality already tests this.

Compaction, writes new stats and data file count changes per partition after compaction.

Rewrite manifest also has test covered.

check the core API PR for test coverage: #12629

karuppayya · 2025-05-15T18:15:44Z

api/src/main/java/org/apache/iceberg/actions/ComputePartitionStats.java

+  /**
+   * Choose the table snapshot to compute partition stats.
+   *
+   * @param snapshotId long ID of the snapshot for which stats need to be computed


nit

Suggested change

* @param snapshotId long ID of the snapshot for which stats need to be computed

* @param snapshotId long ID of the snapshot for which stats need to be computed, , by default the current snapshot is used.

And Remove from the class javadoc, that way will be in sync with org.apache.iceberg.actions.ComputeTableStats#snapshot

It was like that before. But I got an explicit comment from @nastra to move that away from snapshot api.

more info: #12450 (comment)

karuppayya · 2025-05-15T18:19:24Z

api/src/main/java/org/apache/iceberg/actions/ComputePartitionStats.java

+   * @param snapshotId long ID of the snapshot for which stats need to be computed
+   * @return this for method chaining
+   */
+  ComputePartitionStats snapshot(long snapshotId);


Do we need an api to force the computation, in cases when existing one is corrupted?

The user can unregister the stats from the table using table.updatePartitionStatistics().removePartitionStatistics(snapshotId).commit();, so it can force recompute when the compute API is called again.

As an end user invoking a SQL procedure, i would expect this to be available in the procedure, instead of having to use the Java API to achieve it.
We could do the same (remove and compute) in the Spark Action when forced(and eventually add to procedure)?

Spark action is similar to API, so user can do by themself (unregister and call again).

For, call procedure I will add an option to force refresh.

I think the action is the delegate for the Procedure. Procedure does input validation and result(from action's execution) conversion to internal row.
The business logic should remain in action. In procedure, we would set the boolean to force refresh on the action, and action decides how it will use it.

I think the action is the delegate for the Procedure

I don't think it is a strict guideline. Procedures are just a way to provide functionality from SQL. It can combine one or more APIs.

Added the force compute. I agree that SQL user need a way to clear stats for force compute incase of corruption (even though it is very rare) without depending on APIs.

I was neutral about adding it to the spark action as users of spark action already have API access. But I want to move forward with this work. So, I have added it.

Sorry to go back and forth on this.
Since the probability of statistics file going corrupt is low, and handling in the action introduces additional complexity in making more commits, and also having a workaround in removing the stats file using table api,
feel free to not add this to this change

....5/spark/src/test/java/org/apache/iceberg/spark/actions/TestComputePartitionStatsAction.java

karuppayya · 2025-05-15T18:23:29Z

....5/spark/src/test/java/org/apache/iceberg/spark/actions/TestComputePartitionStatsAction.java

+
+  private void createPartitionTable() {
+    sql(
+        "CREATE TABLE %s (c1 int, c2 string, c3 string) USING iceberg PARTITIONED BY (c2, c3) TBLPROPERTIES('write.delete.mode'='merge-on-read')",


TBLPROPERTIES('write.delete.mode'='merge-on-read')",

when i removed this line, partitionStatsComputeOnLatestSnapshot stared failing. Why is that?
I dont see any validations for this in the action

There are validations. There are counters for positional delete file stats created only from the merge-on-read deletes here.

https://github.com/apache/iceberg/pull/12450/files/368c125f64811b2be17159d292b803cf7ce3c231#diff-3084a2178209406a462c6e2240155cb8e8532b1d56749e9c190be16b5c0fa339R133

Thanks for clarifying. Do we need tests for v1?

Doesn't really need v1 specific tests. Changed one testcase to use v1 table now as you have asked.

ajantha-bhat · 2025-05-16T06:03:32Z

thanks @karuppayya for the review. I have replied to comments and addressed the ones need addressing.

Force refresh will be provided for SQL call procedure as user may not want to use API in that case. For, Spark action, users can use the Java APIs to force refresh by unregistering the stats.

amogh-jahagirdar · 2025-05-16T22:16:57Z

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java

+      if (statisticsFile.snapshotId() == snapshotId) {
+        // no-op
+        LOG.info(
+            "Snapshot {} already has statistics file computed. Returning the same file as result.",


"Using existing statistics file for snapshot {}" probably suffices.

amogh-jahagirdar · 2025-05-16T22:18:39Z

...5/spark/src/main/java/org/apache/iceberg/spark/actions/ComputePartitionStatsSparkAction.java

+  }
+
+  private Result doExecute() {
+    LOG.info("Computing partition stats for {} (snapshot {})", table.name(), snapshot.snapshotId());


I think I'd prefer without = and we can probably get away from the nested parentheses in a Log statement.

"Computing partition stats for table {} snapshot {}" is sufficient info imo

amogh-jahagirdar · 2025-05-16T22:39:29Z

api/src/main/java/org/apache/iceberg/actions/ComputePartitionStats.java

+public interface ComputePartitionStats
+    extends Action<ComputePartitionStats, ComputePartitionStats.Result> {


I'm late to this PR so my apologies if this has already been discussed in the past but have we considered just including the ability for computing partition stats to the existing ComputeTableStats Action? I'm a bit wary of adding too many actions, especially in this case where an external user who is using spark has to know that there are 2 separate actions for "stats" in general.

In my head something like "compute table stats" + a partition stats option API on that seems better.

I know the actual stats files are of course different but I'm mainly thinking from an action and API perspective, not exposing too many different concepts. The action implementation can hide the complexity of the "how" for both types of stats. The result type of the existing procedure would be both types of files. It may not be a big deal for sophisticated users who know when to use which but I think that many people will just want to generate stats and then the sophisticated users would drill down which ones they want and when to run it etc.

Of course if we do try to add flags to the existing procedure we have to think through what's the default behavior when not specified (preserving compatibility) and then what additional modes we have.

Thoughts @ajantha-bhat @karuppayya @nastra @aokolnychyi @rdblue ?

I did analyze it while adding this spark action.

First thing is with the naming ComputeTableStatsProcedure, clearly says it is a table level statistics. Plugging in the partition level statistics didn't seem logical there. Also it takes column arguments to specify which columns need table level stats, which is currently not applicable for partition stats.

Partition stats supports incremental compute and may need an option like force refresh in the call procedure. So, the syntax and input arguments (one needs column names and other needs boolean) differ with respect to the functionality.

Lastly, like you said, they are independent stats. Having an action to independently compute them is better I feel.
Spark CBO is only integrated with table level stats, don't have an interface to use partition level stats. This action is mainly for the use cases mentioned in the PR description.

Thanks @amogh-jahagirdar for bringing this up.
Though we don't need to follow other engines, i can see examples where ANALYZE TABLE is used for both table level and partition level stats.

Hive

Spark

Trino

I think it would be good to follow a similar pattern for Iceberg actions and procedure, so that users don't have to think about gathering iceberg stats differently as against other engines.
But like highlighted, this comes with a complexity of adding conditional code.

@ajantha-bhat , i think the force refresh is something that would be useful for current table stats action as well.
Also, eventually we would need to support incremental stats with the current table stats as well.

@karuppayya, If you see Spark, Hive , Trino
They have unified stats in one compute.

At Iceberg, Table statistics computes just NDV. No column level stats aggregate, need to read manifests for that. Also, partition stats should also include column level stats, so later we can aggregate it into table level stats. @deniskuzZ has a proposal for that.

Currently, table stats is not complete and partition stats is also very different. So, I prefer keeping it separately.
In 2.0, if we unify them. We can have a single procedure and deprecate these.

I am not sure if completeness of functionality should determine this.
I am mainly coming from the end user perspective of having to use different action for stats whereas it is different in most other engine/formats.
But I dont have a strong opinion on this, I am fine either ways.
Would be good to get @aokolnychyi 's thought here since it was raised by him earlier

@ajantha-bhat

Did we think about what happens if one of the compute failed? If table stats compute failed but partition stats compute has succeeded and vice versa.

Yeah my take is that stats computation should be best effort; so let's say someone configures both table and partition stats to run, and whichever one that runs first fails we just move on to the next. Of course at the end if there's any kind of failure we should clearly indicate that to the user, but I think the partial failure case you're referencing can be handled by just treating all stats generation as best effort in the implementation.

We have to add lot of validations like what of user configures columns and calls withOnlyPartitionStats

I guess I'm not understanding why the only additional validation would not just be Preconditions.checkArgument(columns.isEmpty()) in the withOnlyPartitionStats? Are there any validations beyond this?

@RussellSpitzer
My breakdown of the pros/cons, @ajantha-bhat feel free to add stuff in case I'm missing something

Combined action with different options:

Pros:

I think this better aligns with what exists in other engines like
Hive
Spark
Trino
where column stats and partition stats are effectively collected through the same SQL.

I don't really see a distinction here between the procedure/action. I feel like we should generally try
and keep the action API surface area reduced, just as we do with the procedure, it's just that the procedure is SQL oriented instead of programatic.

I think In the long run many of the options that apply for partition stats will also apply for normal table stats. e.g. I think we will also probably want a force refresh; we will also probably want incrementally computing. So if we know much of the same options will apply at some point it seems better to just coalesce the API here.

Cons:

There's a bit of complexity for validation exposing multiple options on the API that won't work together (like partitionStatsOnly and specifying columns). I'm doubtful how complex this really is, this just seems like a simple preconditions check but I could be missing something.

Separate Actions:

Pros:
The main argument I can think for separate procedures is for people familiar with the details of the statistics and their granularities, then having separate procedures is clearer since at least as of today they are different.

Cons:
The main arguments that I have against separate procedures is that if we consider that in the long run the options will converge it seems compelling to keep it combined (just going back to my feeling to keep as minimal of an API surface area as possible). I think it'd be better to do this to begin with rather than separate and then undo it later.

In the end, if folks feel strongly about keeping it separate I think I'm OK, I just want to make sure we're being thoughtful about the implications of it.

I think the argument here I feel strongest about is the mixing of options that are invalid that we can only determine that at runtime. IMHO, we should aim to have apis where there as few gotchas as possible. If the API can't hide the fact that there are two very different things happening under the hood I don't think they should be combined.

So I think we could keep them in the same action but only if we have the apis a little better segregated.

Perhaps instead of "columns" you have something like ".withTableLevelNDVs("columns").withPartitionStats()"

These are breaking changes I know, but they make clear that there are two different mutually exclusive things happening.

Thanks @RussellSpitzer.

I don't want to introduce breaking changes right now. Plus partition stats can evolve. It can even have column level stats per partition in the future. So, once everything is finalized. We can have a unified API if required in the future and deprecate the existing ones. Until now, I think good to keep separate APIs.

I agree on

If the API can't hide the fact that there are two very different things happening under the hood I don't think they should be combined.

@amogh-jahagirdar: Thoughts?

If the API can't hide the fact that there are two very different things happening under the hood I don't think they should be combined.

I definitely agree with this statement but I do think it should be quite possible to hide the fact that there are 2 different things happening under the hood.

I think I'm generally OK (+0.5) to move ahead as it is laid out here and I also want to avoid breaking changes, @RussellSpitzer @ajantha-bhat so if we want to move forward with this PR feel free!

Thanks @amogh-jahagirdar and @RussellSpitzer for the discussion and helping to conclude on this.
I do understand that we have a chance to unify this in future. Once partition stats are matured and we need to provide new unified action (to avoid breaking changes).

For now, Please approve the PR if you are ok, We will merge it today. I just rebased it.

ajantha-bhat · 2025-05-19T13:40:15Z

@karuppayya and @amogh-jahagirdar: Any more comments or need more discussion? I would like to move forward with this work.

karuppayya

Left some comments
(for some reason I am not able to resolve conversations)

karuppayya · 2025-05-16T17:02:29Z

api/src/main/java/org/apache/iceberg/actions/ComputePartitionStats.java

+   * @param snapshotId long ID of the snapshot for which stats need to be computed
+   * @return this for method chaining
+   */
+  ComputePartitionStats snapshot(long snapshotId);


I think the action is the delegate for the Procedure. Procedure does input validation and result(from action's execution) conversion to internal row.
The business logic should remain in action. In procedure, we would set the boolean to force refresh on the action, and action decides how it will use it.

karuppayya · 2025-05-19T17:40:09Z

api/src/main/java/org/apache/iceberg/actions/ComputePartitionStats.java

+public interface ComputePartitionStats
+    extends Action<ComputePartitionStats, ComputePartitionStats.Result> {


Thanks @amogh-jahagirdar for bringing this up.
Though we don't need to follow other engines, i can see examples where ANALYZE TABLE is used for both table level and partition level stats.

Hive

Spark

Trino

I think it would be good to follow a similar pattern for Iceberg actions and procedure, so that users don't have to think about gathering iceberg stats differently as against other engines.
But like highlighted, this comes with a complexity of adding conditional code.

@ajantha-bhat , i think the force refresh is something that would be useful for current table stats action as well.
Also, eventually we would need to support incremental stats with the current table stats as well.

karuppayya · 2025-05-20T21:10:33Z

api/src/main/java/org/apache/iceberg/actions/ComputePartitionStats.java

+public interface ComputePartitionStats
+    extends Action<ComputePartitionStats, ComputePartitionStats.Result> {


I am not sure if completeness of functionality should determine this.
I am mainly coming from the end user perspective of having to use different action for stats whereas it is different in most other engine/formats.
But I dont have a strong opinion on this, I am fine either ways.
Would be good to get @aokolnychyi 's thought here since it was raised by him earlier

api/src/main/java/org/apache/iceberg/actions/ComputePartitionStats.java

karuppayya · 2025-05-20T21:20:09Z

...5/spark/src/main/java/org/apache/iceberg/spark/actions/ComputePartitionStatsSparkAction.java

+          "Clearing the existing partition statistics for all snapshots to enforce force refresh");
+      UpdatePartitionStatistics clearStats = table.updatePartitionStatistics();
+      table
+          .partitionStatisticsFiles()


Should do we remove the statistics file for the given snapshot and initiate statistics collection fresh(instead of incremental) for that given snapshot?
Do we also need to clear stats of other snapshots?

We don't have underlaying API to do full compute to avoid too many public interface.

If we clear the stats, the same API does full compute. We need to clear all the stats because it does incremental compute if previous stats exist.

Also, the usecase or chances of needing full refresh is very rare (only corruption scenario), Hence, no separate API support from core module.

more details: #12629 (comment)

...5/spark/src/main/java/org/apache/iceberg/spark/actions/ComputePartitionStatsSparkAction.java

karuppayya · 2025-05-20T21:32:11Z

api/src/main/java/org/apache/iceberg/actions/ComputePartitionStats.java

  ComputePartitionStats snapshot(long snapshotId);

+  /**
+   * Clears existing partition statistics for all snapshots, allowing the action to recompute them


Do we want to highlight that this will create more than one commit(one for the removal and one for the update)

ajantha-bhat · 2025-05-29T10:13:40Z

@amogh-jahagirdar: Maybe you can discuss offline with your team and let me know the conclusion here as no one responded to #12450 (comment). It has been more than two weeks!

I still prefer separate actions because of the reasons mentioned in that thread.
In future we can unify if both stats become similar. They are not similar right now.

I would like to conclude this and get it for 1.10.0, I don't want to miss the release train.

ajantha-bhat · 2025-06-04T04:33:33Z

Today also one of the spark user has asked for this feature: #13233

Last month user in slack also wanted this feature in Spark: https://apache-iceberg.slack.com/archives/C03LG1D563F/p1745253837302079

cc: @amogh-jahagirdar

ajantha-bhat · 2025-07-03T05:44:26Z

We are good to go on this PR.
Details: #12450 (comment)

I just rebased it.

amogh-jahagirdar

Thanks @ajantha-bhat , and @karuppayya @nastra @xxubai @RussellSpitzer for reviewing.

ajantha-bhat · 2025-07-04T11:28:36Z

Thanks everyone for the review. I will have a PR to port this to spark 4.0 soon.

github-actions bot added API spark core labels Mar 4, 2025

ajantha-bhat mentioned this pull request Feb 28, 2025

Partition stats task tracker #8450

Closed

13 tasks

ajantha-bhat requested a review from aokolnychyi March 4, 2025 07:22

ajantha-bhat mentioned this pull request Mar 4, 2025

Spark-3.5: Add procedure to compute partition stats #12451

Closed

jbonofre added this to the Iceberg 1.9.0 milestone Mar 6, 2025

ajantha-bhat force-pushed the stats_action branch from a98d3cb to b04dddf Compare March 7, 2025 06:45

ajantha-bhat commented Mar 7, 2025

View reviewed changes

ajantha-bhat modified the milestones: Iceberg 1.9.0, Iceberg 1.10.0 Mar 19, 2025

github-actions bot added the stale label Apr 19, 2025

github-actions bot closed this Apr 26, 2025

ajantha-bhat reopened this May 7, 2025

github-actions bot removed the stale label May 8, 2025

ajantha-bhat force-pushed the stats_action branch from b04dddf to c598260 Compare May 14, 2025 16:26

ajantha-bhat requested review from amogh-jahagirdar and nastra May 14, 2025 16:39

ajantha-bhat force-pushed the stats_action branch from c598260 to a8a736e Compare May 15, 2025 06:44

nastra reviewed May 15, 2025

View reviewed changes

api/src/main/java/org/apache/iceberg/actions/ComputePartitionStats.java Outdated Show resolved Hide resolved

nastra reviewed May 15, 2025

View reviewed changes

....5/spark/src/test/java/org/apache/iceberg/spark/actions/TestComputePartitionStatsAction.java Outdated Show resolved Hide resolved

xxubai reviewed May 15, 2025

View reviewed changes

xxubai approved these changes May 15, 2025

View reviewed changes

karuppayya reviewed May 15, 2025

View reviewed changes

amogh-jahagirdar reviewed May 16, 2025

View reviewed changes

karuppayya reviewed May 19, 2025

View reviewed changes

karuppayya reviewed May 20, 2025

View reviewed changes

ajantha-bhat force-pushed the stats_action branch 2 times, most recently from 13421bd to f8e5cdd Compare May 21, 2025 17:17

karuppayya approved these changes May 21, 2025

View reviewed changes

smaheshwar-pltr mentioned this pull request May 22, 2025

Support reading table metadata with partition statistics files apache/iceberg-python#2034

Closed

ajantha-bhat mentioned this pull request Jun 4, 2025

How can I quickly calculate the size of each partition? #13233

Closed

stevenzwu modified the milestones: Iceberg 1.10.0, Iceberg 1.11.0 Jul 2, 2025

ajantha-bhat added 4 commits July 3, 2025 11:03

Spark-3.5: Add spark action to compute partition stats

24be52e

address comments

9426881

Address new comments

c703841

Update log message

6c1565d

ajantha-bhat force-pushed the stats_action branch from f8e5cdd to 6c1565d Compare July 3, 2025 05:38

amogh-jahagirdar approved these changes Jul 3, 2025

View reviewed changes

amogh-jahagirdar merged commit 4b68527 into apache:main Jul 3, 2025
43 checks passed

This was referenced Jul 7, 2025

Spark-4.0: Add spark action to compute partition stats #13478

Merged

Spark-3.5: Add procedure to compute partition stats #13480

Merged

slfan1989 mentioned this pull request Sep 9, 2025

Spark 3.4: Backport: Add procedure and action to compute partition stats. #14034

Merged

	LOG.info("Computing partition stats for {} (snapshot {})", table.name(), snapshot.snapshotId());
	LOG.info("Computing partition stats for {} (snapshot={})", table.name(), snapshot.snapshotId());

	* @param snapshotId long ID of the snapshot for which stats need to be computed
	* @param snapshotId long ID of the snapshot for which stats need to be computed, , by default the current snapshot is used.

		public interface ComputePartitionStats
		extends Action<ComputePartitionStats, ComputePartitionStats.Result> {

Spark-3.5: Add spark action to compute partition stats #12450

Spark-3.5: Add spark action to compute partition stats #12450

Uh oh!

Conversation

ajantha-bhat commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajantha-bhat commented Mar 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat commented Mar 19, 2025

Uh oh!

github-actions bot commented Apr 19, 2025

Uh oh!

github-actions bot commented Apr 26, 2025

Uh oh!

ajantha-bhat commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajantha-bhat commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pvary commented May 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xxubai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karuppayya commented May 15, 2025

Uh oh!

karuppayya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat commented Mar 4, 2025 •

edited

Loading

ajantha-bhat commented May 14, 2025 •

edited

Loading

ajantha-bhat commented May 15, 2025 •

edited

Loading

ajantha-bhat May 15, 2025 •

edited

Loading

ajantha-bhat May 16, 2025 •

edited

Loading

ajantha-bhat commented May 16, 2025 •

edited

Loading