-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spark: Test custom metric for number of deletes applied, in code path that use streaming delete filter #5742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@flyrain @RussellSpitzer this is a follow up to #4588. In that change, there is a code path that is not tested, which is counting positional deletes when using a streaming delete filter. I manually tested that code path by temporarily changing the threshold to use a streaming filter in The logic behind this change is as follows: |
| super(filePath, deletes, table.schema(), expectedSchema, counter); | ||
| super( | ||
| filePath, | ||
| deletes, | ||
| table.schema(), | ||
| expectedSchema, | ||
| DeleteFilter.DEFAULT_STREAM_FILTER_THRESHOLD, | ||
| counter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only change I make to spark/v3.2 here, which is needed because the superclass, DeleteFilter, now takes the additional parameter.
Once this PR is merged, I can port the changes to spark/v3.3 to v3.2.
|
Hi @wypoon, thanks for the PR. I don't see a strong reason to expose the threshold to users. Instead, it's better to hide it from users. Here are reasons:
What do you think? |
I don't have a strong opinion on whether to expose this threshold to the user. We do expose various optimizations to the user, with sensible defaults, so users who are not interested or have no need to tune them do not need to. So even though this particular setting may not be of interest to most users, I don't see much harm in it. My main interest, though, is in allowing a way to set this threshold easily for testing the code path I mention. If you have a good suggestion for another way to set the threshold, I'm happy to consider it. A hacky way would be to allow the threshold to be set in |
|
@wypoon, we can put the new test case in the class like |
@flyrain I don't think |
This enables us to set the threshold to a low number (2), to exercise the streaming filter code path when counting number of positional deletes applied.
21e1280 to
41490b3
Compare
|
I enable setting the threshold via a system property. I updated the PR title and description. |
|
The change in this PR is now quite small. I can add the spark/v3.2 change here as well, or in a separate follow-up PR if preferred. |
| new Schema(MetadataColumns.DELETE_FILE_PATH, MetadataColumns.DELETE_FILE_POS); | ||
|
|
||
| private final long setFilterThreshold; | ||
| private final long streamFilterThreshold; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think calling this the streamFilterThreshold is more appropriate, since it is the threshold at which we use the streaming delete filter.
|
Hi @wypoon, sorry I may not be clear in my last comment. Let me explain a bit more.
|
|
The code paths that goes through the streaming filters -- are called in In order to avoid double-counting, it checks if the row has already been marked deleted. This happens to come into play in |
|
Even supposing we determined through careful tracing of the code paths that it is sufficient to test calling |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
We enable setting the threshold for using a streaming delete filter to a low number (2), in order to exercise the streaming filter code path when counting number of positional deletes applied. This is a follow up to #4588.