Fixing presto native parquet read path for parquet types with repetit…#9110
Fixing presto native parquet read path for parquet types with repetit…#9110Parth-Brahmbhatt wants to merge 3 commits intoprestodb:masterfrom
Conversation
|
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed. If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
|
@Parth-Brahmbhatt I will take a look, thanks! We have several reports about incorrect complex type (and also null) handling so hopefully this will get those fixed. Please also add those tests to this PR. @zhenxiao please also take a look. |
|
@Parth-Brahmbhatt @nezihyigitbasi thanks for looking at it |
|
@nezihyigitbasi - here is my parquet file. I tried the PR but it still fails on my parquet queries. cc: @zhenxiao |
|
@dotcomputercraft what is the query that you are trying to run against this file? |
|
@nezihyigitbasi - I ran this query -- select * from hive.parquethdfs.superman_member_price_memberprice_v0 limit 100; --- here is the error: java.lang.IndexOutOfBoundsException: Invalid position 0 in block with 3 positions |
|
@nezihyigitbasi - this type of type struct generates the following error as well: --- com.facebook.presto.spi.PrestoException: length of field blocks differ: field 0: 1024, block 27: 1022 |
|
@Parth-Brahmbhatt any ideas why this PR fails with this input & query? |
|
When I tried to explore the file's schema I get the following exception. parquet meta returns the structure. I will try and debug through it sometime this week. |
|
@Parth-Brahmbhatt - how can I help debug this parquet file. Is there a bootstrap project I can use to help me get started? |
|
I was trying to verify that this is a valid parquet file, which is why I first tried to just cat it using parquet CLI. Given it failed to do so I was first going to check why Parquet CLI can't read it. If it just turned out to be an issue with Parquet CLI and file was indeed valid I was going to use ParquetTester and modify it to read through this file. You can use https://github.com/apache/parquet-mr/tree/master/parquet-cli or ParquetTester.java used in this PR as bootstrap to just strep though code. |
|
@Parth-Brahmbhatt - Thank you. Please let me know what you find. I will try to figure out how to run ParquetTester.java locally. |
|
@Parth-Brahmbhatt - Can you share your modified ParquetTester.java file when make your changes to it? I really want to help test out the parquet functionality. I also have another parquet file that generates another exception. |
|
@Parth-Brahmbhatt - Any updates on the input file? |
|
@dotcomputercraft From the looks of it the issue is you have 3 unions each defining a column named
|
|
@Parth-Brahmbhatt - Thank you for looking into this. I will add this PR to my local presto cluster as well. Can you share with me how you debugged the input file? I love learn how to do this as well to contribute to the presto code base in the future. I also have another parquet file that I need to evaluate. |
|
I did not really got to the presto part as parquet-cli failed to read. I just ran
and attached a debugger to it to see why it was failing. For the presto part I am relying on the Parquet tests that are existing. It is simple enough to add a unit test and just run it locally using your ide or maven. |
|
Archive.zip @Parth-Brahmbhatt - these are the input files that are generating this error -- com.facebook.presto.spi.PrestoException: length of field blocks differ: field 0: 1024, block 28: 1023 |
|
I used parquet-1.8.1 not the master branch.
|
|
To be clear we also have our internal version of parquet that has parquet CLI built and I used that to test but I was able to build parquet using the steps mentioned above and you can use parquet-tools after that build goes through. |
|
@Parth-Brahmbhatt - Thank you for the amazing tips and support. I will follow your steps. Thanks Parth-Brahmbhatt. Have an amazing weekend. |
Please add those tests to this PR and let me know as I want to start reviewing this shortly. |
…ion level != 0, i.e. map, arrays, structs
…only works for non nested complex types even with these changes, i.e. list of list still won't work.
888ad93 to
ec13bf1
Compare
|
@nezihyigitbasi Added the test cases for row type and array type. Even after these changes I don't think the native read path will be able to support nested complex types in the current state. |
|
Don't intend to disable |
|
@Parth-Brahmbhatt did you give #9156 a try for nesting support? |
|
@nezihyigitbasi I did not see that change. I will take a look sometime this week. |
|
@Parth-Brahmbhatt can you please run your internal tests with #9156? That would give us more confidence. |
…ion level != 0, i.e. map, arrays, structs
The issue turned out to be more involved than what I originally thought. Parquet native read path essentially failed to read any type with repetition level !=0 correctly, which is basically all non primitive types. I added a test for maptypes as that was the original issue but I plan to add couple more test cases, specifically for array type and row type to ensure future releases can identify these issues early on. Let me know if you would rather have those tests as part of this PR as oppose to in a separate PR.