Skip to content

Fix New Parquet Reader reading array<struct<type:string>>#10849

Closed
zhenxiao wants to merge 1 commit intoprestodb:masterfrom
zhenxiao:parquet-fix
Closed

Fix New Parquet Reader reading array<struct<type:string>>#10849
zhenxiao wants to merge 1 commit intoprestodb:masterfrom
zhenxiao:parquet-fix

Conversation

@zhenxiao
Copy link
Collaborator

Hi @dain @electrum @nezihyigitbasi @kgalieva
found a bug when running New Parquet reader for our production traffic (seems like the only bug)

when reading array<structtype:string>, columnIO is primitive for the inside struct

With this fix, all of our highly nested Parquet queries are good

@kgalieva
Copy link
Contributor

@zhenxiao Thank you for your commit and for catching this issue! I've spent some time investigating the root cause of the problem and realised that the issue is a little bit broader than parsing array<struct<primitive>> structure. Arrays with nested struct with single element(array<struct<struct<...>>) are also affected.

It's my fault, I missed one of the backward-compatibility rules from parquet spec.
It states:
"If the repeated field is a group with one field and is named either array or uses the LIST-annotated group's name with _tuple appended then the repeated type is the element type and elements are required." (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists)

This means that:

  optional group list (LIST) {
    repeated group array_element {
      optional binary test (UTF8);
    }
  }

must be parsed as array<string>

Whereas

  optional group list (LIST) {
    repeated group array {
      optional binary test (UTF8);
    }
  }

must be parsed as array<struct<test:string>>.

I've implemented another fix, it covers all affected cases. Additionally, I had to refactor parquet writer used for tests, so that it could be used for writing arrays with and without middle group level.

Thanks again for catching this issue. Will really appreciate if you review the fix #10852. Hope it solves all backward-compatibility issues!

@zhenxiao
Copy link
Collaborator Author

thank you @kgalieva
#10852 looks good. I will close this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants