Skip to content

Conversation

@HedgehogCode
Copy link
Contributor

When translating between the memory FieldType and message FieldType for
dictionary encoded vectors the children of the dictionary field were not
handled correctly.

  • When going from memory format to message format the Field must have the
    children of the dictionary field.
  • When going from message format to memory format the Field must have no
    children but the dictionary must have the mapped children

@github-actions
Copy link

github-actions bot commented Oct 6, 2020

@emkornfield
Copy link
Contributor

@liyafan82 do you have time to review?

@liyafan82
Copy link
Contributor

@liyafan82 do you have time to review?

@emkornfield Sure. I will take a look in one or two days.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to verify the encoded vector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Encoding of struct vectors is tested in TestDictionaryVector#testEncodeStruct. The encoded vector should be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we validate the read dictionary vector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I added a check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to consider the null values in v?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, why not. I added an if switch for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we also need to set the value count for the child vector

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The struct vector sets the value count of its children in #setValueCount.

@HedgehogCode HedgehogCode force-pushed the bug/ARROW-10174-dict-structs branch from c6abf28 to 113ea45 Compare October 9, 2020 13:34
@HedgehogCode
Copy link
Contributor Author

I implemented the changes and force pushed them.

Thank you for the advice with the RangeEqualsVisitor. Using it I discovered another bug in DictionaryUtility. The field for the Dictionary was always created with nullable=false but this is incorrect because dictionaries can be nullable. Therefore, I replaced this with the nullability read from the schema.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should call setIndexDefined only if v.get(i) != null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure.
The current implementation does not allow setting any struct element to null. If we only call #setIndexDefined if the child element is not null the implementation would not allow (non null) structs with all child values beeing null. I am okay with both but I think we should keep the API simple because this method is just there to make tests shorter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a StructVector, we can set an element to null. For its super class (NonNullableStructVector), we cannot set an element to null.

Here, I generally prefer setting an element to null if all sub-elements are null, because ValueVectorDataPopulator is a generally-purpose class, and we may use it to test scenarios with null elements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I changed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be beneficial to close them through a try-with-resource clause to avoid resource leak?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I changed it. I did not do it before because I did not want the deep nesting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Deep nesting is not good-looking.

@HedgehogCode HedgehogCode force-pushed the bug/ARROW-10174-dict-structs branch from 113ea45 to 2d270a5 Compare October 12, 2020 08:50
When translating between the memory FieldType and message FieldType for
dictionary encoded vectors the children of the dictionary field were not
handled correctly.
* When going from memory format to message format the Field must have the
  children of the dictionary field.
* When going from message format to memory format the Field must have no
  children but the dictionary must have the mapped children
@HedgehogCode HedgehogCode force-pushed the bug/ARROW-10174-dict-structs branch from 2d270a5 to 7026c25 Compare October 12, 2020 12:37
Copy link
Contributor

@liyafan82 liyafan82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, will merge soon if there are no more comments.

@liyafan82
Copy link
Contributor

Merging. Thanks for your effort. @HedgehogCode

@liyafan82 liyafan82 closed this in 35ace39 Oct 16, 2020
kszucs pushed a commit that referenced this pull request Oct 19, 2020
When translating between the memory FieldType and message FieldType for
dictionary encoded vectors the children of the dictionary field were not
handled correctly.
* When going from memory format to message format the Field must have the
  children of the dictionary field.
* When going from message format to memory format the Field must have no
  children but the dictionary must have the mapped children

Closes #8363 from HedgehogCode/bug/ARROW-10174-dict-structs

Authored-by: Benjamin Wilhelm <[email protected]>
Signed-off-by: liyafan82 <[email protected]>
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
When translating between the memory FieldType and message FieldType for
dictionary encoded vectors the children of the dictionary field were not
handled correctly.
* When going from memory format to message format the Field must have the
  children of the dictionary field.
* When going from message format to memory format the Field must have no
  children but the dictionary must have the mapped children

Closes apache#8363 from HedgehogCode/bug/ARROW-10174-dict-structs

Authored-by: Benjamin Wilhelm <[email protected]>
Signed-off-by: liyafan82 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants