-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40708][SQL] Auto update table statistics based on write metrics #38496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @wangyum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, we should consider about partition Statistics here. If we overwrite the part of the partitions, it would get wrong table statistcs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @jackylee-ch Thanks for your review. It seems we can only update stats for overwriting non-partition table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... so for overwriting non-partition table, if autoSizeUpdateEnabled is true, we cannot use wroteStats to update statistics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LuciferYang Good idea, update the code, prefer to use wrote stats to update non-partition table statistics if possible.
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
Outdated
Show resolved
Hide resolved
b646993 to
cd3d69d
Compare
|
Retest this please |
1f17cb7 to
b0ed310
Compare
|
Support partition statistics? |
@melin I'm working on the supporting of partition statistics update, it relies on workers to return detailed partition statistics. |
Can consider the table or partition statistics released, the user can listen to these statistics, convenient display, before the magic code to obtain statistics, not very standard. |
|
Hi, @jackylee-ch @melin any update ? |
|
@LuciferYang @jackylee-ch Could you help to review this PR again? |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
Is it done? |

What changes were proposed in this pull request?
Update table size and rowCount based on spark write metrics
Why are the changes needed?
Auto update table stats after write job finished.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Add UT