ARROW-10449 [Rust] Make Dictionary::keys be an array #8561

jorgecarleitao · 2020-10-31T07:25:19Z

This PR:

Makes DictionaryArray::keys be an PrimitiveArray.
Removes NullIter and many of the unsafe code that it contained
Simplifies the parquet writer implementation around dictionaries
Indirectly removes a bug on NullIter's DoubleEndedIterator on which the implementation was not following the spec with respect to end == current (compare with implementation of DoubleEndedIterator for std's Vec's IntoIter).
Fixes error in computing the size of a dictionary, which was double-counting data and values (values is a reference to data)
Adds check that the dictionary's ArrayData's datatype matches T::DATA_TYPE.

Since NullIter was not being directly tested, no tests were removed.

This is backward incompatible and the migration requires replacing dict.keys() by dict.keys().into_iter() whenever the user's intention is to use keys() as an iterator.

github-actions · 2020-10-31T07:31:31Z

https://issues.apache.org/jira/browse/ARROW-10449

jorgecarleitao · 2020-10-31T07:38:47Z

@vertexclique , I remember that you were using DoubleEndedIterator implementation of the NullIter. This PR removes that possibility, but I fielded #8562 to re-add it to the PrimitiveArrayIter, so that that functionality is recovered and supported more generally.

jhorstmann · 2020-11-02T15:58:06Z

LGTM, this should also fix ARROW-10298. I believe at Signavio we also use the keys array instead of the DoubleEndedIterator now, so removing that should not be a blocker for merging.

vertexclique · 2020-11-03T10:00:19Z

rust/parquet/src/arrow/arrow_writer.rs

+        // This removes NULL values from the keys, but
+        // they're encoded by the levels, so that's fine.
+        keys.into_iter()
+            .flatten()
+            .map(|key| {
+                key.to_usize()
+                    .unwrap_or_else(|| panic!("key {:?} does not fit in usize", key))
+            })
+            .map(|key| values.value(key))
+            .map(ByteArray::from)
+            .collect()


This is coming from parquet code @carols10cents wrote. Any particular reason this is landing here with Materialize trait?

I know Q's not for me, but I don't understand. Is the change not only removing the macro in favour of a generic impl?

what @nevi-me wrote: this is only a simplification (consequent of the fact that the keys now allow us to write this as a generic).

vertexclique · 2020-11-03T10:06:58Z

rust/arrow/src/array/array.rs

-        if let DataType::Dictionary(_, _) = dtype {
+        if let DataType::Dictionary(key_data_type, _) = data.data_type() {
+            if key_data_type.as_ref() != &T::DATA_TYPE {
+                panic!("DictionaryArray's data type must match.")


Suggested change

panic!("DictionaryArray's data type must match.")

unreachable!("DictionaryArray's data type must match.")

Since the former is good for defensive programming but doesn't convey the idea.

Thanks for the suggestions.

Isn't unreachable used when the program arrives at an inconsistent state?

IMO, in this case, we are checking user input (this function is public) and ensure that we will not reach an inconsistent state (in the same way assert_eq does). assert_eq calls panic!, which is why I also used panic! here.

I don't think that unreachable is only for the inconsistent state. But we can leave it as panic here. I was also unsure about the user-facing API's panicking behavior. Especially in array methods with forced asserts. We should prefer Result than direct asserts, but you know... That's also yet another topic.

rust/arrow/src/array/array.rs

vertexclique · 2020-11-03T10:20:23Z

rust/arrow/src/array/array.rs

+        // Since both `keys` and `values` derive (are references from) `data`, we only account for `data`.
+        self.data.get_array_memory_size()


This part doesn't look true. Am I missing something? Since keys array is also a primitive array, and values are string array both of the arrays' methods for buffer_memory_size should be called and summed up respectively.

Same applies for the array memory size too. Since zerocopy is not that much of a zero-copy in the physical memory level it would be nice to explicitly add them in array size calculation too.

data contains the keys in buffers[0] and the values as child_data[0]. self.values is only another wrapper array, pointing to the same child_data[0].

I think that @vertexclique 's point is that we should still use buffer from keys and values here, and in the array_size, use size_of itself. I think it is a small amount of memory difference, but to keep things consistent, I should change it.

@vertexclique , I modified this code. Do you think it is correct now?

Voila! Yes.

nevi-me

I like the simplification, I don't have any questions, so LGTM.

nevi-me · 2020-11-03T20:01:07Z

rust/arrow/src/array/array.rs

-            len: self.data.len(),
-            draining: Draining::Ready,
-        }
+    pub fn keys(&self) -> &PrimitiveArray<K> {


Thanks Jorge, I like this option more

me too. that's way better.

nevi-me · 2020-11-03T20:08:11Z

rust/arrow/src/array/array.rs

            f,
-            "DictionaryArray {{keys: {:?}{} values: {:?}}}",
-            keys, elipsis, self.values
+            "DictionaryArray {{keys: {:?} values: {:?}}}",


Nice, I'm assuming that we're relying on the keys formatting now that it's a PrimitiveArray

rust/arrow/src/array/builder.rs

nevi-me · 2020-11-03T20:14:38Z

rust/parquet/src/arrow/arrow_writer.rs

+        // This removes NULL values from the keys, but
+        // they're encoded by the levels, so that's fine.
+        keys.into_iter()
+            .flatten()
+            .map(|key| {
+                key.to_usize()
+                    .unwrap_or_else(|| panic!("key {:?} does not fit in usize", key))
+            })
+            .map(|key| values.value(key))
+            .map(ByteArray::from)
+            .collect()


I know Q's not for me, but I don't understand. Is the change not only removing the macro in favour of a generic impl?

This puts it in line with `array::NullIter`. This PR is complementary to #8561 (which removes `NullIter`), so that the PrimitiveArrayIterator can be reversed in the same way `NullIter` can. Together with #8561, it re-allows Dictionary keys to be iterated backwards. Closes #8562 from jorgecarleitao/double_ended Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

jorgecarleitao added 2 commits October 31, 2020 08:24

Fixed error in dictionary slice.

c71677c

Migrated parquet to new keys.

b95ddc9

github-actions bot added the Component: Rust label Oct 31, 2020

jorgecarleitao mentioned this pull request Oct 31, 2020

ARROW-10445: [Rust] Added doubleEnded iterator to PrimitiveArrayIter #8562

Closed

vertexclique reviewed Nov 3, 2020

View reviewed changes

nevi-me approved these changes Nov 3, 2020

View reviewed changes

Addressed review comments.

e4644db

vertexclique approved these changes Nov 4, 2020

View reviewed changes

nevi-me closed this in 2e284f4 Nov 7, 2020

jorgecarleitao deleted the dictionary_clean branch December 14, 2020 07:35

asfimport mentioned this pull request Nov 7, 2020

[Rust] Make dictionary keys be a PrimitiveArray #26426

Closed

	panic!("DictionaryArray's data type must match.")
	unreachable!("DictionaryArray's data type must match.")

		// Since both `keys` and `values` derive (are references from) `data`, we only account for `data`.
		self.data.get_array_memory_size()

ARROW-10449 [Rust] Make Dictionary::keys be an array #8561

ARROW-10449 [Rust] Make Dictionary::keys be an array #8561

Uh oh!

Conversation

jorgecarleitao commented Oct 31, 2020

Uh oh!

github-actions bot commented Oct 31, 2020

Uh oh!

jorgecarleitao commented Oct 31, 2020

Uh oh!

jhorstmann commented Nov 2, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nevi-me left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants