Skip to content

Conversation

@jorgecarleitao
Copy link
Member

This PR:

  1. Makes DictionaryArray::keys be an PrimitiveArray.
  2. Removes NullIter and many of the unsafe code that it contained
  3. Simplifies the parquet writer implementation around dictionaries
  4. Indirectly removes a bug on NullIter's DoubleEndedIterator on which the implementation was not following the spec with respect to end == current (compare with implementation of DoubleEndedIterator for std's Vec's IntoIter).
  5. Fixes error in computing the size of a dictionary, which was double-counting data and values (values is a reference to data)
  6. Adds check that the dictionary's ArrayData's datatype matches T::DATA_TYPE.

Since NullIter was not being directly tested, no tests were removed.

This is backward incompatible and the migration requires replacing dict.keys() by dict.keys().into_iter() whenever the user's intention is to use keys() as an iterator.

@github-actions
Copy link

@jorgecarleitao
Copy link
Member Author

@vertexclique , I remember that you were using DoubleEndedIterator implementation of the NullIter. This PR removes that possibility, but I fielded #8562 to re-add it to the PrimitiveArrayIter, so that that functionality is recovered and supported more generally.

@jhorstmann
Copy link
Contributor

LGTM, this should also fix ARROW-10298. I believe at Signavio we also use the keys array instead of the DoubleEndedIterator now, so removing that should not be a blocker for merging.

Comment on lines +305 to +315
// This removes NULL values from the keys, but
// they're encoded by the levels, so that's fine.
keys.into_iter()
.flatten()
.map(|key| {
key.to_usize()
.unwrap_or_else(|| panic!("key {:?} does not fit in usize", key))
})
.map(|key| values.value(key))
.map(ByteArray::from)
.collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is coming from parquet code @carols10cents wrote. Any particular reason this is landing here with Materialize trait?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know Q's not for me, but I don't understand. Is the change not only removing the macro in favour of a generic impl?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what @nevi-me wrote: this is only a simplification (consequent of the fact that the keys now allow us to write this as a generic).

if let DataType::Dictionary(_, _) = dtype {
if let DataType::Dictionary(key_data_type, _) = data.data_type() {
if key_data_type.as_ref() != &T::DATA_TYPE {
panic!("DictionaryArray's data type must match.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
panic!("DictionaryArray's data type must match.")
unreachable!("DictionaryArray's data type must match.")

Since the former is good for defensive programming but doesn't convey the idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions.

Isn't unreachable used when the program arrives at an inconsistent state?

IMO, in this case, we are checking user input (this function is public) and ensure that we will not reach an inconsistent state (in the same way assert_eq does). assert_eq calls panic!, which is why I also used panic! here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that unreachable is only for the inconsistent state. But we can leave it as panic here. I was also unsure about the user-facing API's panicking behavior. Especially in array methods with forced asserts. We should prefer Result than direct asserts, but you know... That's also yet another topic.

Comment on lines 2314 to 2315
// Since both `keys` and `values` derive (are references from) `data`, we only account for `data`.
self.data.get_array_memory_size()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part doesn't look true. Am I missing something? Since keys array is also a primitive array, and values are string array both of the arrays' methods for buffer_memory_size should be called and summed up respectively.

Same applies for the array memory size too. Since zerocopy is not that much of a zero-copy in the physical memory level it would be nice to explicitly add them in array size calculation too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data contains the keys in buffers[0] and the values as child_data[0]. self.values is only another wrapper array, pointing to the same child_data[0].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that @vertexclique 's point is that we should still use buffer from keys and values here, and in the array_size, use size_of itself. I think it is a small amount of memory difference, but to keep things consistent, I should change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vertexclique , I modified this code. Do you think it is correct now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Voila! Yes.

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simplification, I don't have any questions, so LGTM.

len: self.data.len(),
draining: Draining::Ready,
}
pub fn keys(&self) -> &PrimitiveArray<K> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jorge, I like this option more

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

me too. that's way better.

f,
"DictionaryArray {{keys: {:?}{} values: {:?}}}",
keys, elipsis, self.values
"DictionaryArray {{keys: {:?} values: {:?}}}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I'm assuming that we're relying on the keys formatting now that it's a PrimitiveArray

Comment on lines +305 to +315
// This removes NULL values from the keys, but
// they're encoded by the levels, so that's fine.
keys.into_iter()
.flatten()
.map(|key| {
key.to_usize()
.unwrap_or_else(|| panic!("key {:?} does not fit in usize", key))
})
.map(|key| values.value(key))
.map(ByteArray::from)
.collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know Q's not for me, but I don't understand. Is the change not only removing the macro in favour of a generic impl?

@nevi-me nevi-me closed this in 2e284f4 Nov 7, 2020
nevi-me pushed a commit that referenced this pull request Nov 7, 2020
This puts it in line with `array::NullIter`. This PR is complementary to #8561 (which removes `NullIter`), so that the PrimitiveArrayIterator can be reversed in the same way `NullIter` can. Together with #8561, it re-allows Dictionary keys to be iterated backwards.

Closes #8562 from jorgecarleitao/double_ended

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
@jorgecarleitao jorgecarleitao deleted the dictionary_clean branch December 14, 2020 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants