-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spark-3.5: Add procedure to compute partition stats #13480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Also cc: @karuppayya |
|
Thanks @nastra and @hussein-awala for the review. @amogh-jahagirdar or @RussellSpitzer or @szehon-ho: Anyone of you also wants to do a review? If not, we will go ahead with these changes. |
| table | ||
| .partitionStatisticsFiles() | ||
| .forEach(file -> updateStats.removePartitionStatistics(file.snapshotId())); | ||
| updateStats.commit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this two separate trasnactions? There's no way to have it all or nothing? (to prevent the statistic from being removed if the compute fail)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. When we designed the original core API, we didn't provide an option to full refresh as it is very rare case for user to use it (only during corruption), that too when incremental compute can't read the previous stats file it fallback to full compute. So, if at all for other reasons users need a full refresh, they need to unregister and call compute again.
If two separate transaction is a bad idea. I can remove this option and we can add it in the future if required (if any new use case comes up). WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm it would be better if the underlying action supported full? what do you think?
Yea maybe we can remove it for now in procedure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I just removed the full refresh option now.
Since the underlaying action will fallback to full compute if stats are corrupted automatically, we don't have use case for full refresh yet. If there are new use case in the future, we can support it from underlaying API to make it atomic instead of using remove stats and compute method.
PR is ready now.
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, some nit comments about the comment and procedure description.
If we make the comment more clear/formal, we can just use that wording in the doc
|
|
||
| /** | ||
| * A procedure that computes the stats incrementally after the snapshot that has partition stats | ||
| * file till the given snapshot (uses current snapshot if not specified) and writes the combined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: till => until (formal)
| import org.apache.spark.unsafe.types.UTF8String; | ||
|
|
||
| /** | ||
| * A procedure that computes the stats incrementally after the snapshot that has partition stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
A procedure that computes partition stats incrementally from the last snapshot that has partition stats ..
It would be more clear. Last snapshot with partition stats is the starting point , correct?
| /** | ||
| * A procedure that computes the stats incrementally after the snapshot that has partition stats | ||
| * file till the given snapshot (uses current snapshot if not specified) and writes the combined | ||
| * result into a {@link PartitionStatisticsFile} after merging the stats for a given snapshot. Does |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'merging the stats for a given snapshot' is confusing. It makes me think that the 'given snapshot' already has stats but earlier we imply it doesnt. How about just, merging the partition stats?
|
btw, we usually start with spark 4.0. do you have pr for it? |
I know. We started this work on march: #12451 Also, build has passed now. We can merge this. |
|
PR for spark-4.0: #13523 |
Wrapper on the Spark action PR that got merged: #12450
Just a
CALL catalog_name.system.compute_partition_stats('db.sample');And observe the stats files registered to table
table.partitionStatisticsFiles()Fixes: #10106