-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10449 [Rust] Make Dictionary::keys be an array #8561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@vertexclique , I remember that you were using |
|
LGTM, this should also fix ARROW-10298. I believe at Signavio we also use the keys array instead of the |
| // This removes NULL values from the keys, but | ||
| // they're encoded by the levels, so that's fine. | ||
| keys.into_iter() | ||
| .flatten() | ||
| .map(|key| { | ||
| key.to_usize() | ||
| .unwrap_or_else(|| panic!("key {:?} does not fit in usize", key)) | ||
| }) | ||
| .map(|key| values.value(key)) | ||
| .map(ByteArray::from) | ||
| .collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is coming from parquet code @carols10cents wrote. Any particular reason this is landing here with Materialize trait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know Q's not for me, but I don't understand. Is the change not only removing the macro in favour of a generic impl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what @nevi-me wrote: this is only a simplification (consequent of the fact that the keys now allow us to write this as a generic).
| if let DataType::Dictionary(_, _) = dtype { | ||
| if let DataType::Dictionary(key_data_type, _) = data.data_type() { | ||
| if key_data_type.as_ref() != &T::DATA_TYPE { | ||
| panic!("DictionaryArray's data type must match.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| panic!("DictionaryArray's data type must match.") | |
| unreachable!("DictionaryArray's data type must match.") |
Since the former is good for defensive programming but doesn't convey the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions.
Isn't unreachable used when the program arrives at an inconsistent state?
IMO, in this case, we are checking user input (this function is public) and ensure that we will not reach an inconsistent state (in the same way assert_eq does). assert_eq calls panic!, which is why I also used panic! here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that unreachable is only for the inconsistent state. But we can leave it as panic here. I was also unsure about the user-facing API's panicking behavior. Especially in array methods with forced asserts. We should prefer Result than direct asserts, but you know... That's also yet another topic.
rust/arrow/src/array/array.rs
Outdated
| // Since both `keys` and `values` derive (are references from) `data`, we only account for `data`. | ||
| self.data.get_array_memory_size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part doesn't look true. Am I missing something? Since keys array is also a primitive array, and values are string array both of the arrays' methods for buffer_memory_size should be called and summed up respectively.
Same applies for the array memory size too. Since zerocopy is not that much of a zero-copy in the physical memory level it would be nice to explicitly add them in array size calculation too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data contains the keys in buffers[0] and the values as child_data[0]. self.values is only another wrapper array, pointing to the same child_data[0].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that @vertexclique 's point is that we should still use buffer from keys and values here, and in the array_size, use size_of itself. I think it is a small amount of memory difference, but to keep things consistent, I should change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vertexclique , I modified this code. Do you think it is correct now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Voila! Yes.
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the simplification, I don't have any questions, so LGTM.
| len: self.data.len(), | ||
| draining: Draining::Ready, | ||
| } | ||
| pub fn keys(&self) -> &PrimitiveArray<K> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Jorge, I like this option more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
me too. that's way better.
| f, | ||
| "DictionaryArray {{keys: {:?}{} values: {:?}}}", | ||
| keys, elipsis, self.values | ||
| "DictionaryArray {{keys: {:?} values: {:?}}}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I'm assuming that we're relying on the keys formatting now that it's a PrimitiveArray
| // This removes NULL values from the keys, but | ||
| // they're encoded by the levels, so that's fine. | ||
| keys.into_iter() | ||
| .flatten() | ||
| .map(|key| { | ||
| key.to_usize() | ||
| .unwrap_or_else(|| panic!("key {:?} does not fit in usize", key)) | ||
| }) | ||
| .map(|key| values.value(key)) | ||
| .map(ByteArray::from) | ||
| .collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know Q's not for me, but I don't understand. Is the change not only removing the macro in favour of a generic impl?
This puts it in line with `array::NullIter`. This PR is complementary to #8561 (which removes `NullIter`), so that the PrimitiveArrayIterator can be reversed in the same way `NullIter` can. Together with #8561, it re-allows Dictionary keys to be iterated backwards. Closes #8562 from jorgecarleitao/double_ended Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
This PR:
DictionaryArray::keysbe anPrimitiveArray.NullIterand many of theunsafecode that it containedDoubleEndedIteratoron which the implementation was not following the spec with respect toend == current(compare with implementation ofDoubleEndedIteratorfor std'sVec'sIntoIter).dataandvalues(valuesis a reference todata)T::DATA_TYPE.Since
NullIterwas not being directly tested, no tests were removed.This is backward incompatible and the migration requires replacing
dict.keys()bydict.keys().into_iter()whenever the user's intention is to usekeys()as an iterator.