Skip to content

Conversation

@davidhewitt
Copy link
Contributor

Which issue does this PR close?

Closes #7683

What changes are included in this PR?

I add arrow_select::dictionary::{garbage_collect_dictionary, garbage_collect_any_dictionary}.

The latter is not strictly necessary but I expect it will be helpful to users.

Are there any user-facing changes?

New APIs, documented.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 19, 2025
Comment on lines 86 to 87
// Create a new values array with the masked values
let values = filter(dictionary.values(), &BooleanArray::new(mask, None))?;
Copy link
Contributor Author

@davidhewitt davidhewitt Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use of filter is how I avoid casting to a concrete value type.

There is also an Interner which is used in other kernels for dictionary deduplication, but I chose not to use it because:

  • it doesn't support all data types, and
  • this implementation is more consistent with GenericByteViewArray::gc

@davidhewitt
Copy link
Contributor Author

Thanks, those suggestions have led to a much better implementation.

@davidhewitt
Copy link
Contributor Author

Looks like legitimate MSRV bug, will fix monday.

Comment on lines +79 to +81
// FIXME: this is a workaround for MSRV Rust versions below 1.86 where trait upcasting is not stable.
// From 1.86 onward, `&dyn AnyDictionaryArray` can be directly passed to `downcast_dictionary_array!`.
let dictionary = &*dictionary.slice(0, dictionary.len());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixed MSRV without materially changing the PR, I'm open to alternatives like just removing garbage_collect_any_dictionary for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems fine to me

@davidhewitt davidhewitt requested a review from tustvold June 30, 2025 09:26
@davidhewitt
Copy link
Contributor Author

ping @alamb @tustvold just wanted to make sure this doesn't get lost in the backlog please 🙏

@davidhewitt
Copy link
Contributor Author

Merged main which should hopefully fix the clippy failures.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @davidhewitt -- I think this looks great.

I had some comment suggestions and questions about pub(crate) but I don't think they are required to merge.

Can you please merge up from main to get a clean CI run on this PR?

Thanks for your patience

Comment on lines +79 to +81
// FIXME: this is a workaround for MSRV Rust versions below 1.86 where trait upcasting is not stable.
// From 1.86 onward, `&dyn AnyDictionaryArray` can be directly passed to `downcast_dictionary_array!`.
let dictionary = &*dictionary.slice(0, dictionary.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems fine to me

///
/// `len` is the total length of the merged output
pub fn should_merge_dictionary_values<K: ArrowDictionaryKeyType>(
pub(crate) fn should_merge_dictionary_values<K: ArrowDictionaryKeyType>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iis this change still needed? I didn't see should_merge_dictionary_values used anyhwere in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I made dictionary.rs a pub mod so that arrow_select::dictionary::garbage_collect_dictionary can be the public API path.

}

pub struct MergedDictionaries<K: ArrowDictionaryKeyType> {
pub(crate) struct MergedDictionaries<K: ArrowDictionaryKeyType> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change needed? I didn't see MergedDictionaries used anywhere else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as #7716 (comment)

@davidhewitt
Copy link
Contributor Author

Thansk, I think should be good to go now 👍

@alamb
Copy link
Contributor

alamb commented Jul 10, 2025

Looks like there is a small cargo fmt issue to fix

@davidhewitt
Copy link
Contributor Author

Done 👍

@davidhewitt
Copy link
Contributor Author

Wow CI has not been kind here! Merged main again.

@davidhewitt
Copy link
Contributor Author

(Tested locally with latest nightly, looks like those docs should build ok 🤔)

@davidhewitt
Copy link
Contributor Author

Ungh that was user error on my part - ran cargo doc instead of cargo +nightly doc 🤦 . I repro this locally, will either send a PR myself to fixup later or maybe someone else sends one first.

@davidhewitt davidhewitt reopened this Jul 11, 2025
@davidhewitt
Copy link
Contributor Author

(and that was a fat finger error 🙃 - what a PR!)

@alamb
Copy link
Contributor

alamb commented Jul 11, 2025

I think main had a CI failure on docs that @viirya fixed in

Hopefully this one gets a clean(enough) run and I'll merge it in

@alamb
Copy link
Contributor

alamb commented Jul 11, 2025

The docs CI failure https://github.com/apache/arrow-rs/actions/runs/16218327533/job/45824214545?pr=7716 is what was fixed. It think we can merge this PR in safely and it will not fail on main

@alamb
Copy link
Contributor

alamb commented Jul 11, 2025

🚀

@alamb alamb merged commit 6d11232 into apache:main Jul 11, 2025
49 of 51 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 11, 2025

Thanks for you patience @davidhewitt

@davidhewitt
Copy link
Contributor Author

All good, likewise and thanks for the merge!

Comment on lines +82 to +84
pub fn garbage_collect_any_dictionary(
dictionary: &dyn AnyDictionaryArray,
) -> Result<ArrayRef, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason we didn't ad this as a method on AnyDictionaryArray itself or closer to it? I guess that requires garbage_collect_dictionary to also move, but I'm also not sure why that one is in arrow-select. It would be nice if these APIs were the same as the view type APIs (array.gc()).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation uses arrow_select::filter kernels, it would be undesirable to move those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add DictionaryArray::gc method

4 participants