Skip to content

cleanup fix for type mismatch in Parquet predicate pushdown#11118

Closed
nishantrayan wants to merge 1 commit intoprestodb:masterfrom
lyft:predicate_fix_cleanup
Closed

cleanup fix for type mismatch in Parquet predicate pushdown#11118
nishantrayan wants to merge 1 commit intoprestodb:masterfrom
lyft:predicate_fix_cleanup

Conversation

@nishantrayan
Copy link
Contributor

follow up from #9975 to clean up the logic

@nishantrayan
Copy link
Contributor Author

CC @nezihyigitbasi

Copy link
Contributor

@nezihyigitbasi nezihyigitbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick pass. I will take a detailed look after you address the comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to pass this in the constructor as far as I can see in the code. So, please do so, mark hiveType as final and mark RichColumnDescriptor as Immutable. Also please update the comment in this class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stream expression is complex, please use a regular foreach loop instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto (use foreach to make this easier to read)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can change this logic to pass the hive type to the rich column descriptor constructor instead of setting it via a setter. Please see my comment below for RichColumnDescriptor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use instanceof

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary else.

@nishantrayan nishantrayan force-pushed the predicate_fix_cleanup branch from a5a4661 to 9ddd30e Compare July 28, 2018 01:08
@nishantrayan
Copy link
Contributor Author

@nezihyigitbasi made the cleanups requested

Copy link
Contributor

@nezihyigitbasi nezihyigitbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I left some more comments and this is getting pretty close.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImmutableList.copyOf(typeColumns.keySet())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImmutableList.copyOf(typeColumns.keySet())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put one parameter per line as this line is long:

    public static Map<List<String>, RichColumnDescriptor> getDescriptors(
            MessageType fileSchema,
            MessageType requestedSchema,
            Map<parquet.schema.Type, HiveColumnHandle> typeColumns)
    {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify this a bit:

Optional<RichColumnDescriptor> richColumnDescriptor = getDescriptor(columns, columnPath, hiveColumnHandle);
richColumnDescriptor.ifPresent(descriptor -> descriptorsByPath.put(columnPath, descriptor));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put in on the same line:

HiveType hiveType = hiveColumnHandle == null ? null : hiveColumnHandle.getHiveType();
return Optional.of(new RichColumnDescriptor(columnIO.getColumnDescriptor(), columnIO.getType().asPrimitiveType(), hiveType));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this else, it's unnecessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// RichColumnDescriptor extends ColumnDescriptor and exposes the PrimitiveType and HiveType information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.hiveType = requireNonNull(hiveType, "hiveType is null");

It would be nice if you add non-null checks to other fields too (preferably in a separate commit).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we need hiveType to be optional looking at couple of places where richColumnDescriptor is initialized https://github.com/prestodb/presto/pull/11118/files#diff-0f09736fc6c9cfc0691776d4365d409bL233

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then please update the constructor to accept Optional< HiveType> and add the non-null check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Map<Type, HiveColumnHandle> typeColumns = ImmutableMap.of(getParquetType(columnHandle, fileSchema, true), columnHandle);

Please also update the changes below.

@nishantrayan nishantrayan force-pushed the predicate_fix_cleanup branch 2 times, most recently from df137a6 to f345023 Compare July 31, 2018 18:50
@nishantrayan
Copy link
Contributor Author

need 👀 @nezihyigitbasi
Also having trouble with passing CI tried to repro locally without any success.

@nezihyigitbasi
Copy link
Contributor

@nishantrayan I can reproduce the failures if I run TestFullParquetReader locally. One issue that's causing test failures is in ParquetTypeUtils::getDescriptor() on L139: Optional.of(hiveType). hiveType can be null so that call can throw, it should be Optional.ofNullable(hiveType). Please also rebase.

@nishantrayan nishantrayan force-pushed the predicate_fix_cleanup branch from f345023 to 60f8564 Compare August 5, 2018 23:14
@nezihyigitbasi
Copy link
Contributor

There are test failures.

@nishantrayan
Copy link
Contributor Author

👀 into the failures

@nishantrayan
Copy link
Contributor Author

@nezihyigitbasi I am having trouble looking at the actual failure. the integration test console doesn't give me much info. local tests are passing. anyway I can run / reproduce failure locally.

@nishantrayan
Copy link
Contributor Author

rebased with latest master

@nishantrayan
Copy link
Contributor Author

@nezihyigitbasi wondering if you have some pointers on how to find the actual failure reason and get to the bottom of this.

@nishantrayan nishantrayan force-pushed the predicate_fix_cleanup branch 2 times, most recently from 191623b to efd3093 Compare October 11, 2018 03:31
if (descriptor.getHiveType().isPresent()) {
TypeInfo typeInfo = descriptor.getHiveType().get().getTypeInfo();
switch(typeInfo.getTypeName()){
case StandardTypes.SMALLINT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does tinyint need to be handled here? In #8243 (comment) they mention getting the error Mismatched Domain types: tinyint vs integer.

Also, maybe verify dictionary filtering works for these smaller int types. It looks like min/max filtering would work here but dictionary filtering doesn't check for those types e.g. here although that's probably less common of a use case.

.map(column -> getParquetType(column, fileSchema, useParquetColumnNames))
.filter(Objects::nonNull)
.collect(toList());
Map<parquet.schema.Type, HiveColumnHandle> typeColumns = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter nonNull got removed from this but then the copy of this in ParquetPageSourceFactory still does that. Possibly just consolidate this logic into ParquetTypeUtils and call it from both places.

@findepi
Copy link
Contributor

findepi commented Mar 4, 2019

@nishantrayan
i changed Parquet predicate pushdown code in trinodb/trino#131 .
I was aware of your PR in this space, but it appeared I can fix this in a simpler way.
This change was released in Presto 302 (about a month ago).
(@nezihyigitbasi is backporting the fix in #12408, so it will be available in this repo as well)

@puneetjaiswal puneetjaiswal restored the predicate_fix_cleanup branch April 8, 2019 17:21
@aweisberg
Copy link
Contributor

Seems like Nezih backported the PrestoSQL fix for this? #12408

Closing. Please reopen if I am incorrect.

@aweisberg aweisberg closed this Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants