Skip to content

Conversation

@ajantha-bhat
Copy link
Member

@ajantha-bhat ajantha-bhat commented Jul 7, 2025

Wrapper on the Spark action PR that got merged: #12450

Just a

CALL catalog_name.system.compute_partition_stats('db.sample');

And observe the stats files registered to table

table.partitionStatisticsFiles()

Fixes: #10106

@ajantha-bhat
Copy link
Member Author

Also cc: @karuppayya

@ajantha-bhat ajantha-bhat mentioned this pull request Jul 7, 2025
13 tasks
@ajantha-bhat
Copy link
Member Author

Thanks @nastra and @hussein-awala for the review.

@amogh-jahagirdar or @RussellSpitzer or @szehon-ho: Anyone of you also wants to do a review? If not, we will go ahead with these changes.

table
.partitionStatisticsFiles()
.forEach(file -> updateStats.removePartitionStatistics(file.snapshotId()));
updateStats.commit();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this two separate trasnactions? There's no way to have it all or nothing? (to prevent the statistic from being removed if the compute fail)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. When we designed the original core API, we didn't provide an option to full refresh as it is very rare case for user to use it (only during corruption), that too when incremental compute can't read the previous stats file it fallback to full compute. So, if at all for other reasons users need a full refresh, they need to unregister and call compute again.

If two separate transaction is a bad idea. I can remove this option and we can add it in the future if required (if any new use case comes up). WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm it would be better if the underlying action supported full? what do you think?

Yea maybe we can remove it for now in procedure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I just removed the full refresh option now.

Since the underlaying action will fallback to full compute if stats are corrupted automatically, we don't have use case for full refresh yet. If there are new use case in the future, we can support it from underlaying API to make it atomic instead of using remove stats and compute method.

PR is ready now.

@ajantha-bhat ajantha-bhat requested a review from pvary July 9, 2025 09:08
Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, some nit comments about the comment and procedure description.

If we make the comment more clear/formal, we can just use that wording in the doc


/**
* A procedure that computes the stats incrementally after the snapshot that has partition stats
* file till the given snapshot (uses current snapshot if not specified) and writes the combined
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: till => until (formal)

import org.apache.spark.unsafe.types.UTF8String;

/**
* A procedure that computes the stats incrementally after the snapshot that has partition stats
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

A procedure that computes partition stats incrementally from the last snapshot that has partition stats ..

It would be more clear. Last snapshot with partition stats is the starting point , correct?

/**
* A procedure that computes the stats incrementally after the snapshot that has partition stats
* file till the given snapshot (uses current snapshot if not specified) and writes the combined
* result into a {@link PartitionStatisticsFile} after merging the stats for a given snapshot. Does
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'merging the stats for a given snapshot' is confusing. It makes me think that the 'given snapshot' already has stats but earlier we imply it doesnt. How about just, merging the partition stats?

@szehon-ho
Copy link
Member

btw, we usually start with spark 4.0. do you have pr for it?

@ajantha-bhat
Copy link
Member Author

btw, we usually start with spark 4.0. do you have pr for it?

I know. We started this work on march: #12451
So, spark 4 was not there that time. This is the same work revived.
Since, the PR is approved. I will open a PR to port it to Spark 4 today.

Also, build has passed now. We can merge this.

@ajantha-bhat
Copy link
Member Author

PR for spark-4.0: #13523

@nastra nastra merged commit e9da855 into apache:main Jul 11, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spark procedure to compute partition stats.

4 participants