Skip to content

Conversation

@wankunde
Copy link
Contributor

@wankunde wankunde commented Nov 3, 2022

What changes were proposed in this pull request?

Update table size and rowCount based on spark write metrics

Why are the changes needed?

Auto update table stats after write job finished.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add UT

@github-actions github-actions bot added the SQL label Nov 3, 2022
@wankunde wankunde changed the title [SPARK-40708] Auto update table statistics based on write metrics [SPARK-40708][SQL] Auto update table statistics based on write metrics Nov 3, 2022
@wankunde wankunde changed the title [SPARK-40708][SQL] Auto update table statistics based on write metrics [WIP][SPARK-40708][SQL] Auto update table statistics based on write metrics Nov 4, 2022
@LuciferYang
Copy link
Contributor

cc @wangyum

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, we should consider about partition Statistics here. If we overwrite the part of the partitions, it would get wrong table statistcs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @jackylee-ch Thanks for your review. It seems we can only update stats for overwriting non-partition table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... so for overwriting non-partition table, if autoSizeUpdateEnabled is true, we cannot use wroteStats to update statistics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LuciferYang Good idea, update the code, prefer to use wrote stats to update non-partition table statistics if possible.

@wankunde wankunde changed the title [WIP][SPARK-40708][SQL] Auto update table statistics based on write metrics [SPARK-40708][SQL] Auto update table statistics based on write metrics Nov 9, 2022
@wankunde wankunde force-pushed the writeStats branch 3 times, most recently from b646993 to cd3d69d Compare November 11, 2022 04:23
@wankunde
Copy link
Contributor Author

Retest this please

@wankunde wankunde force-pushed the writeStats branch 2 times, most recently from 1f17cb7 to b0ed310 Compare November 23, 2022 07:41
@wankunde wankunde requested review from LuciferYang and jackylee-ch and removed request for LuciferYang and jackylee-ch November 24, 2022 09:34
@melin
Copy link

melin commented Dec 5, 2022

Support partition statistics?

@jackylee-ch
Copy link
Contributor

Support partition statistics?

@melin I'm working on the supporting of partition statistics update, it relies on workers to return detailed partition statistics.

@melin
Copy link

melin commented Dec 6, 2022

Support partition statistics?

@melin I'm working on the supporting of partition statistics update, it relies on workers to return detailed partition statistics.

Can consider the table or partition statistics released, the user can listen to these statistics, convenient display, before the magic code to obtain statistics, not very standard.
image

@wankunde
Copy link
Contributor Author

Hi, @jackylee-ch @melin any update ?

@wankunde
Copy link
Contributor Author

@LuciferYang @jackylee-ch Could you help to review this PR again?

@LuciferYang
Copy link
Contributor

@wankunde cloud you resolve the conflicts?

@wangyum Does this feature need to be finished in Spark 3.4.0?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Apr 25, 2023
@melin
Copy link

melin commented Apr 25, 2023

Support partition statistics?

@melin I'm working on the supporting of partition statistics update, it relies on workers to return detailed partition statistics.

Is it done?

@jackylee-ch
Copy link
Contributor

@melin I have opened another pr #39114 for this, and unfortunately that one was closed because of no feedback for a long time. If necessary, I can consider reopening it.

@melin
Copy link

melin commented Apr 25, 2023

@melin I have opened another pr #39114 for this, and unfortunately that one was closed because of no feedback for a long time. If necessary, I can consider reopening it.

reopen it,thank you

@github-actions github-actions bot closed this Apr 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants