-
Notifications
You must be signed in to change notification settings - Fork 1.1k
add garbage_collect_dictionary to arrow-select
#7716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
arrow-select/src/dictionary.rs
Outdated
| // Create a new values array with the masked values | ||
| let values = filter(dictionary.values(), &BooleanArray::new(mask, None))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This use of filter is how I avoid casting to a concrete value type.
There is also an Interner which is used in other kernels for dictionary deduplication, but I chose not to use it because:
- it doesn't support all data types, and
- this implementation is more consistent with
GenericByteViewArray::gc
|
Thanks, those suggestions have led to a much better implementation. |
|
Looks like legitimate MSRV bug, will fix monday. |
| // FIXME: this is a workaround for MSRV Rust versions below 1.86 where trait upcasting is not stable. | ||
| // From 1.86 onward, `&dyn AnyDictionaryArray` can be directly passed to `downcast_dictionary_array!`. | ||
| let dictionary = &*dictionary.slice(0, dictionary.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixed MSRV without materially changing the PR, I'm open to alternatives like just removing garbage_collect_any_dictionary for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems fine to me
|
Merged main which should hopefully fix the |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @davidhewitt -- I think this looks great.
I had some comment suggestions and questions about pub(crate) but I don't think they are required to merge.
Can you please merge up from main to get a clean CI run on this PR?
Thanks for your patience
| // FIXME: this is a workaround for MSRV Rust versions below 1.86 where trait upcasting is not stable. | ||
| // From 1.86 onward, `&dyn AnyDictionaryArray` can be directly passed to `downcast_dictionary_array!`. | ||
| let dictionary = &*dictionary.slice(0, dictionary.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems fine to me
| /// | ||
| /// `len` is the total length of the merged output | ||
| pub fn should_merge_dictionary_values<K: ArrowDictionaryKeyType>( | ||
| pub(crate) fn should_merge_dictionary_values<K: ArrowDictionaryKeyType>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iis this change still needed? I didn't see should_merge_dictionary_values used anyhwere in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I made dictionary.rs a pub mod so that arrow_select::dictionary::garbage_collect_dictionary can be the public API path.
| } | ||
|
|
||
| pub struct MergedDictionaries<K: ArrowDictionaryKeyType> { | ||
| pub(crate) struct MergedDictionaries<K: ArrowDictionaryKeyType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change needed? I didn't see MergedDictionaries used anywhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as #7716 (comment)
Co-authored-by: Andrew Lamb <[email protected]>
|
Thansk, I think should be good to go now 👍 |
|
Looks like there is a small |
|
Done 👍 |
|
Wow CI has not been kind here! Merged main again. |
|
(Tested locally with latest nightly, looks like those docs should build ok 🤔) |
|
Ungh that was user error on my part - ran |
|
(and that was a fat finger error 🙃 - what a PR!) |
|
I think main had a CI failure on docs that @viirya fixed in Hopefully this one gets a clean(enough) run and I'll merge it in |
|
The docs CI failure https://github.com/apache/arrow-rs/actions/runs/16218327533/job/45824214545?pr=7716 is what was fixed. It think we can merge this PR in safely and it will not fail on main |
|
🚀 |
|
Thanks for you patience @davidhewitt |
|
All good, likewise and thanks for the merge! |
| pub fn garbage_collect_any_dictionary( | ||
| dictionary: &dyn AnyDictionaryArray, | ||
| ) -> Result<ArrayRef, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason we didn't ad this as a method on AnyDictionaryArray itself or closer to it? I guess that requires garbage_collect_dictionary to also move, but I'm also not sure why that one is in arrow-select. It would be nice if these APIs were the same as the view type APIs (array.gc()).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation uses arrow_select::filter kernels, it would be undesirable to move those.
Which issue does this PR close?
Closes #7683
What changes are included in this PR?
I add
arrow_select::dictionary::{garbage_collect_dictionary, garbage_collect_any_dictionary}.The latter is not strictly necessary but I expect it will be helpful to users.
Are there any user-facing changes?
New APIs, documented.