-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10174: [Java] Fix reading/writing dict structs #8363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10174: [Java] Fix reading/writing dict structs #8363
Conversation
|
@liyafan82 do you have time to review? |
@emkornfield Sure. I will take a look in one or two days. |
java/vector/src/main/java/org/apache/arrow/vector/util/DictionaryUtility.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need to verify the encoded vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Encoding of struct vectors is tested in TestDictionaryVector#testEncodeStruct. The encoded vector should be fine.
java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we validate the read dictionary vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I added a check.
java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to consider the null values in v?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, why not. I added an if switch for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we also need to set the value count for the child vector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The struct vector sets the value count of its children in #setValueCount.
c6abf28 to
113ea45
Compare
|
I implemented the changes and force pushed them. Thank you for the advice with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should call setIndexDefined only if v.get(i) != null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure.
The current implementation does not allow setting any struct element to null. If we only call #setIndexDefined if the child element is not null the implementation would not allow (non null) structs with all child values beeing null. I am okay with both but I think we should keep the API simple because this method is just there to make tests shorter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a StructVector, we can set an element to null. For its super class (NonNullableStructVector), we cannot set an element to null.
Here, I generally prefer setting an element to null if all sub-elements are null, because ValueVectorDataPopulator is a generally-purpose class, and we may use it to test scenarios with null elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I changed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be beneficial to close them through a try-with-resource clause to avoid resource leak?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I changed it. I did not do it before because I did not want the deep nesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Deep nesting is not good-looking.
113ea45 to
2d270a5
Compare
When translating between the memory FieldType and message FieldType for dictionary encoded vectors the children of the dictionary field were not handled correctly. * When going from memory format to message format the Field must have the children of the dictionary field. * When going from message format to memory format the Field must have no children but the dictionary must have the mapped children
2d270a5 to
7026c25
Compare
liyafan82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, will merge soon if there are no more comments.
|
Merging. Thanks for your effort. @HedgehogCode |
When translating between the memory FieldType and message FieldType for dictionary encoded vectors the children of the dictionary field were not handled correctly. * When going from memory format to message format the Field must have the children of the dictionary field. * When going from message format to memory format the Field must have no children but the dictionary must have the mapped children Closes #8363 from HedgehogCode/bug/ARROW-10174-dict-structs Authored-by: Benjamin Wilhelm <[email protected]> Signed-off-by: liyafan82 <[email protected]>
When translating between the memory FieldType and message FieldType for dictionary encoded vectors the children of the dictionary field were not handled correctly. * When going from memory format to message format the Field must have the children of the dictionary field. * When going from message format to memory format the Field must have no children but the dictionary must have the mapped children Closes apache#8363 from HedgehogCode/bug/ARROW-10174-dict-structs Authored-by: Benjamin Wilhelm <[email protected]> Signed-off-by: liyafan82 <[email protected]>
When translating between the memory FieldType and message FieldType for
dictionary encoded vectors the children of the dictionary field were not
handled correctly.
children of the dictionary field.
children but the dictionary must have the mapped children