Skip to content

Fixing presto native parquet read path for parquet types with repetit…#9110

Closed
Parth-Brahmbhatt wants to merge 3 commits intoprestodb:masterfrom
Parth-Brahmbhatt:issue-8709
Closed

Fixing presto native parquet read path for parquet types with repetit…#9110
Parth-Brahmbhatt wants to merge 3 commits intoprestodb:masterfrom
Parth-Brahmbhatt:issue-8709

Conversation

@Parth-Brahmbhatt
Copy link
Contributor

@Parth-Brahmbhatt Parth-Brahmbhatt commented Oct 6, 2017

…ion level != 0, i.e. map, arrays, structs

The issue turned out to be more involved than what I originally thought. Parquet native read path essentially failed to read any type with repetition level !=0 correctly, which is basically all non primitive types. I added a test for maptypes as that was the original issue but I plan to add couple more test cases, specifically for array type and row type to ensure future releases can identify these issues early on. Let me know if you would rather have those tests as part of this PR as oppose to in a separate PR.

@facebook-github-bot
Copy link
Collaborator

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@facebook-github-bot
Copy link
Collaborator

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@nezihyigitbasi nezihyigitbasi self-assigned this Oct 6, 2017
@nezihyigitbasi
Copy link
Contributor

@Parth-Brahmbhatt I will take a look, thanks! We have several reports about incorrect complex type (and also null) handling so hopefully this will get those fixed. Please also add those tests to this PR.

@zhenxiao please also take a look.

@zhenxiao
Copy link
Collaborator

zhenxiao commented Oct 6, 2017

@Parth-Brahmbhatt @nezihyigitbasi thanks for looking at it
I will take a look soon

@dotcomputercraft
Copy link

@nezihyigitbasi - here is my parquet file. I tried the PR but it still fails on my parquet queries. cc: @zhenxiao
part-00024.parquet.zip

@nezihyigitbasi
Copy link
Contributor

@dotcomputercraft what is the query that you are trying to run against this file?

@dotcomputercraft
Copy link

@nezihyigitbasi - I ran this query -- select * from hive.parquethdfs.superman_member_price_memberprice_v0 limit 100; --- here is the error: java.lang.IndexOutOfBoundsException: Invalid position 0 in block with 3 positions
at com.facebook.presto.spi.block.AbstractRowBlock.getRegionSizeInBytes(AbstractRowBlock.java:104)
at com.facebook.presto.spi.block.ArrayBlock.calculateSize(ArrayBlock.java:91)
at com.facebook.presto.spi.block.ArrayBlock.getSizeInBytes(ArrayBlock.java:82)
at com.facebook.presto.spi.Page.getSizeInBytes(Page.java:66)
at com.facebook.presto.operator.OperatorContext.recordGetOutput(OperatorContext.java:202)
at com.facebook.presto.operator.Driver.processInternal(Driver.java:338)
at com.facebook.presto.operator.Driver.lambda$processFor$6(Driver.java:241)
at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:614)
at com.facebook.presto.operator.Driver.processFor(Driver.java:235)
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
at com.facebook.presto.execution.executor.LegacyPrioritizedSplitRunner.process(LegacyPrioritizedSplitRunner.java:23)
at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:485)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) here is the description of the table ---presto> desc hive.parquethdfs.superman_member_price_memberprice_v0;
Column |
-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
autokillthreshold | row(categoryid varchar, dollarmargin bigint, entityid varchar, jetdirectspecific boolean, percentmargin double, pricerangehigh bigint, pricerangelow bigint, relativedollarmargin big
dateutc | varchar
id | varchar
memberpricedata | row(affiliatepricedata row(factor double, inventory bigint, nodeid varchar, packquantity bigint, price double, sourceskufactor bigint, sourceskuid varchar), brand varchar, competito
offers | array(row(basecommission double, competitiveguardrails row(ceiling row(factortype varchar, multiplier double, price double), floor row(factortype varchar, multiplier double, price d
pricecontrol | row(allothersresult row(policystrategyresult row(policystrategy varchar, sourceskuid varchar), pricestrategyresult row(pricestrategy varchar, sourceskuid varchar)), brandoverride ro
pricingstrategy | row(id bigint, name varchar)
qcpricebounds | row(lowerbound bigint, upperbound varchar)
retailsku | row(id varchar, jetnodeid varchar, jetskuid varchar, packquantity bigint, referencepriceoverride double, strategyid bigint)
retailskuid | varchar
savingsstrategy | row(id bigint)
sourceskuid | varchar
status | bigint
statusreason | varchar
timestamp | varchar
event_type | varchar
year | integer
month | integer
day | integer
hour | integer

@dotcomputercraft
Copy link

@nezihyigitbasi - this type of type struct generates the following error as well: --- com.facebook.presto.spi.PrestoException: length of field blocks differ: field 0: 1024, block 27: 1022
at com.facebook.presto.hive.parquet.ParquetPageSource.getNextPage(ParquetPageSource.java:214)
at com.facebook.presto.hive.HivePageSource.getNextPage(HivePageSource.java:197)
at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:262)
at com.facebook.presto.operator.Driver.processInternal(Driver.java:337)
at com.facebook.presto.operator.Driver.lambda$processFor$6(Driver.java:241)
at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:614)
at com.facebook.presto.operator.Driver.processFor(Driver.java:235)
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
at com.facebook.presto.execution.executor.LegacyPrioritizedSplitRunner.process(LegacyPrioritizedSplitRunner.java:23)
at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:485)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: length of field blocks differ: field 0: 1024, block 27: 1022
at com.facebook.presto.spi.block.RowBlock.(RowBlock.java:51)
at com.facebook.presto.hive.parquet.reader.ParquetReader.readStruct(ParquetReader.java:239)
at com.facebook.presto.hive.parquet.reader.ParquetReader.readStruct(ParquetReader.java:218)
at com.facebook.presto.hive.parquet.ParquetPageSource.getNextPage(ParquetPageSource.java:187)
... 13 more

@nezihyigitbasi
Copy link
Contributor

@Parth-Brahmbhatt any ideas why this PR fails with this input & query?

@Parth-Brahmbhatt
Copy link
Contributor Author

When I tried to explore the file's schema I get the following exception. parquet cat part-00024.parquet Unknown error shaded.org.apache.avro.SchemaParseException: Can't redefine: policystrategyresult at shaded.org.apache.avro.Schema$Names.put(Schema.java:1127) at shaded.org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:561) at shaded.org.apache.avro.Schema$RecordSchema.toJson(Schema.java:689) at shaded.org.apache.avro.Schema$UnionSchema.toJson(Schema.java:881) at shaded.org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:715) at shaded.org.apache.avro.Schema$RecordSchema.toJson(Schema.java:700) at shaded.org.apache.avro.Schema$UnionSchema.toJson(Schema.java:881) at shaded.org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:715) at shaded.org.apache.avro.Schema$RecordSchema.toJson(Schema.java:700) at shaded.org.apache.avro.Schema$UnionSchema.toJson(Schema.java:881) at shaded.org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:715) at shaded.org.apache.avro.Schema$RecordSchema.toJson(Schema.java:700) at shaded.org.apache.avro.Schema.toString(Schema.java:323) at shaded.org.apache.avro.Schema.toString(Schema.java:313) at org.apache.parquet.avro.AvroReadSupport.setRequestedProjection(AvroReadSupport.java:56) at org.apache.parquet.cli.BaseCommand.openDataFile(BaseCommand.java:290) at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:66) at org.apache.parquet.cli.Main.run(Main.java:142) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.parquet.cli.Main.main(Main.java:172) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

parquet meta returns the structure. I will try and debug through it sometime this week.

@dotcomputercraft
Copy link

@Parth-Brahmbhatt - how can I help debug this parquet file. Is there a bootstrap project I can use to help me get started?

@Parth-Brahmbhatt
Copy link
Contributor Author

I was trying to verify that this is a valid parquet file, which is why I first tried to just cat it using parquet CLI. Given it failed to do so I was first going to check why Parquet CLI can't read it.

If it just turned out to be an issue with Parquet CLI and file was indeed valid I was going to use ParquetTester and modify it to read through this file.

You can use https://github.com/apache/parquet-mr/tree/master/parquet-cli or ParquetTester.java used in this PR as bootstrap to just strep though code.

@dotcomputercraft
Copy link

@Parth-Brahmbhatt - Thank you. Please let me know what you find. I will try to figure out how to run ParquetTester.java locally.

@dotcomputercraft
Copy link

@Parth-Brahmbhatt - Can you share your modified ParquetTester.java file when make your changes to it? I really want to help test out the parquet functionality. I also have another parquet file that generates another exception.

@dotcomputercraft
Copy link

@Parth-Brahmbhatt - Any updates on the input file?

@Parth-Brahmbhatt
Copy link
Contributor Author

Parth-Brahmbhatt commented Oct 13, 2017

@dotcomputercraft From the looks of it the issue is you have 3 unions each defining a column named policystrategyresult and avro's code does not seem to namespace them based on the union that they are part of. See this avro issue and the proposed fix that uses avro namespace. Try generating a file with different name for this column and check if it works. This is unrelated to this PR , it just seems like an issue in Avro and there seem to be a work around if you want to keep the same names.

optional group pricecontrol {
optional group allothersresult {
optional group policystrategyresult {
optional binary policystrategy (UTF8);
optional binary sourceskuid (UTF8);
}
optional group pricestrategyresult {
optional binary pricestrategy (UTF8);
optional binary sourceskuid (UTF8);
}
}
optional group brandoverride {
optional binary brandid (UTF8);
optional int64 jetnodeid;
optional group policystrategyresult {
optional boolean fromauthorizedretailer;
optional int64 policy;
optional binary policystrategy (UTF8);
optional binary sourceskuid (UTF8);
}
optional group pricestrategyresult {
optional boolean fromauthorizedretailer;
optional binary pricestrategy (UTF8);
optional binary sourceskuid (UTF8);
}
}
optional group categoryoverride {
optional int64 jetnodeid;
optional group policystrategyresult {
optional boolean fromauthorizedretailer;
optional binary policystrategy (UTF8);
optional binary sourceskuid (UTF8);
}
optional group pricestrategyresult {
optional boolean fromauthorizedretailer;
optional binary pricestrategy (UTF8);
optional binary sourceskuid (UTF8);
}
}

@dotcomputercraft
Copy link

@Parth-Brahmbhatt - Thank you for looking into this. I will add this PR to my local presto cluster as well. Can you share with me how you debugged the input file? I love learn how to do this as well to contribute to the presto code base in the future. I also have another parquet file that I need to evaluate.

@Parth-Brahmbhatt
Copy link
Contributor Author

I did not really got to the presto part as parquet-cli failed to read. I just ran

parquet cat your-parquet-file

and attached a debugger to it to see why it was failing.

For the presto part I am relying on the Parquet tests that are existing. It is simple enough to add a unit test and just run it locally using your ide or maven.

@dotcomputercraft
Copy link

dotcomputercraft commented Oct 13, 2017

Archive.zip @Parth-Brahmbhatt - these are the input files that are generating this error -- com.facebook.presto.spi.PrestoException: length of field blocks differ: field 0: 1024, block 28: 1023
at com.facebook.presto.hive.parquet.ParquetPageSource.getNextPage(ParquetPageSource.java:214)
at com.facebook.presto.hive.HivePageSource.getNextPage(HivePageSource.java:197)
at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:262)
at com.facebook.presto.operator.Driver.processInternal(Driver.java:337)
at com.facebook.presto.operator.Driver.lambda$processFor$6(Driver.java:241)
at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:614)
at com.facebook.presto.operator.Driver.processFor(Driver.java:235)
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
at com.facebook.presto.execution.executor.LegacyPrioritizedSplitRunner.process(LegacyPrioritizedSplitRunner.java:23)
at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:485)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: length of field blocks differ: field 0: 1024, block 28: 1023
at com.facebook.presto.spi.block.RowBlock.(RowBlock.java:51)
at com.facebook.presto.hive.parquet.reader.ParquetReader.readStruct(ParquetReader.java:239)
at com.facebook.presto.hive.parquet.reader.ParquetReader.readStruct(ParquetReader.java:218)
at com.facebook.presto.hive.parquet.ParquetPageSource.getNextPage(ParquetPageSource.java:187)
... 13 more
I did not have any luck compiling the parque-cli which is why I ask how you debugged the issue. The parquet command you describe is that part of what project?

@Parth-Brahmbhatt
Copy link
Contributor Author

I used parquet-1.8.1 not the master branch.
I had to use java-7 to avoid hitting the velocity issue and I had to follow the steps mentioned here to install protobuf and thrift.
Finally I built it at the root level with

LC_ALL=C mvn clean install -Drat.skip=true -Dmaven.test.skip=true

@Parth-Brahmbhatt
Copy link
Contributor Author

To be clear we also have our internal version of parquet that has parquet CLI built and I used that to test but I was able to build parquet using the steps mentioned above and you can use parquet-tools after that build goes through.

@dotcomputercraft
Copy link

@Parth-Brahmbhatt - Thank you for the amazing tips and support. I will follow your steps. Thanks Parth-Brahmbhatt. Have an amazing weekend.

@nezihyigitbasi
Copy link
Contributor

@Parth-Brahmbhatt

Let me know if you would rather have those tests as part of this PR as oppose to in a separate PR.

Please add those tests to this PR and let me know as I want to start reviewing this shortly.

…only works for non nested complex types even with these changes, i.e. list of list still won't work.
@Parth-Brahmbhatt
Copy link
Contributor Author

@nezihyigitbasi Added the test cases for row type and array type. Even after these changes I don't think the native read path will be able to support nested complex types in the current state.

@Parth-Brahmbhatt
Copy link
Contributor Author

Don't intend to disable com/google/common/base/Function from modernizer. I will remove it.

@nezihyigitbasi
Copy link
Contributor

@Parth-Brahmbhatt did you give #9156 a try for nesting support?

@Parth-Brahmbhatt
Copy link
Contributor Author

@nezihyigitbasi I did not see that change. I will take a look sometime this week.

@Parth-Brahmbhatt
Copy link
Contributor Author

After looking at the changes in 9156 I think that is a much better and wider change. I am going to close this PR and I think we should focus on getting 9156 in.

@nezihyigitbasi
Copy link
Contributor

@Parth-Brahmbhatt can you please run your internal tests with #9156? That would give us more confidence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants