Skip to content

Fix type mismatch in Parquet predicate pushdown#9975

Closed
nishantrayan wants to merge 1 commit intoprestodb:masterfrom
lyft:fix_parquet_predicate_pushdown
Closed

Fix type mismatch in Parquet predicate pushdown#9975
nishantrayan wants to merge 1 commit intoprestodb:masterfrom
lyft:fix_parquet_predicate_pushdown

Conversation

@nishantrayan
Copy link
Contributor

The parquet predicate pushdown logic was refactored to allow nested pushdown in the beginning of last year and seems like there was one minor but annoying bug that is still in the latest version and issue was opened here: #9084
We use presto with hive and can reliably reproduce this with the schema containing column of fixed varchar (say varchar(255))
The effectivePredicate seems to have the hive's column type which is varchar(255) however the type header of the parquet chunk gets translated to varchar and because of this mismatch we fail the type check that happens later.

Caused by: java.lang.IllegalArgumentException: Mismatched Domain types: varchar(255) vs varchar
	at com.facebook.presto.spi.predicate.Domain.checkCompatibility(Domain.java:228)
	at com.facebook.presto.spi.predicate.Domain.intersect(Domain.java:183)
	at com.facebook.presto.spi.predicate.TupleDomain.intersect(TupleDomain.java:196)
	at com.facebook.presto.spi.predicate.TupleDomain.overlaps(TupleDomain.java:299)

Please have a look at this @nezihyigitbasi @zhenxiao @dain

Copy link
Contributor

@nezihyigitbasi nezihyigitbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash all the commits and make sure the commit title follows our convention: https://chris.beams.io/posts/git-commit/

domains.put(column, domain);
}
TupleDomain<ColumnDescriptor> stripeDomain = TupleDomain.withColumnDomains(domains.build());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary change.


private Type getType(RichColumnDescriptor column)
{
Optional<Map<ColumnDescriptor, Domain>> predicateDomains = effectivePredicate.getDomains();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment here about why we first look at effectivePredicate here.


@Test
public void testMatchesWithStatistics()
throws Exception
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary throws clause.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify this method as (variables are renamed, unnecessary line breaks removed, methods static imported)

public void testMatchesWithStatistics()
{
    RichColumnDescriptor column = getColumn();
    String value = "Test";
    TupleDomain<ColumnDescriptor> effectivePredicate = getEffectivePredicate(column, createVarcharType(255), value);
    List<RichColumnDescriptor> columns = singletonList(column);
    TupleDomainParquetPredicate predicate = new TupleDomainParquetPredicate(effectivePredicate, columns);
    Statistics stats = getStatsBasedOnType(column.getType());
    stats.setNumNulls(1L);
    stats.setMinMaxFromBytes(value.getBytes(), value.getBytes());
    assertTrue(predicate.matches(2, singletonMap(column, stats)));
}

return TupleDomain.withColumnDomains(predicateColumns);
}

private RichColumnDescriptor getTableColumn()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename as getColumn


private RichColumnDescriptor getTableColumn()
{
PrimitiveType.PrimitiveTypeName typeName = PrimitiveType.PrimitiveTypeName.BINARY;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import PrimitiveTypeName

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simplify this method as

private RichColumnDescriptor getColumn()
{
    PrimitiveType type = new PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.BINARY, "Test column");
    return new RichColumnDescriptor(new String[] {"path"}, type, 0, 0);
}

new ParquetDictionaryDescriptor(tableColumn, Optional.of(page)))));
}

private TupleDomain<ColumnDescriptor> getEffectivePredicate(RichColumnDescriptor tableColumn,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simplify this method as

private TupleDomain<ColumnDescriptor> getEffectivePredicate(RichColumnDescriptor column, VarcharType type, String value)
{
    ColumnDescriptor predicateColumn = new ColumnDescriptor(column.getPath(), column.getType(), 0, 0);
    Domain predicateDomain = Domain.singleValue(type, utf8Slice(value));
    Map<ColumnDescriptor, Domain> predicateColumns = singletonMap(predicateColumn, predicateDomain);
    return TupleDomain.withColumnDomains(predicateColumns);
}


@Test
public void testMatchesWithDescriptors()
throws Exception
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary throws clause.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simplify this method as

public void testMatchesWithDescriptors()
{
    RichColumnDescriptor column = getColumn();
    String value = "Test";
    TupleDomain<ColumnDescriptor> effectivePredicate = getEffectivePredicate(column, createVarcharType(255), value);
    List<RichColumnDescriptor> columns = singletonList(column);
    TupleDomainParquetPredicate predicate = new TupleDomainParquetPredicate(effectivePredicate, columns);
    ParquetDictionaryPage page = new ParquetDictionaryPage(utf8Slice(value), 2, PLAIN_DICTIONARY);
    assertTrue(predicate.matches(singletonMap(column, new ParquetDictionaryDescriptor(column, Optional.of(page)))));
}

import static org.testng.Assert.assertEquals;
import static org.testng.Assert.assertTrue;

public class TestTupleDomainParquetPredicate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I proposed some changes to the test code to rename variables, static import some methods, remove throws, etc. Here is the entire test class with these changes: https://gist.github.com/nezihyigitbasi/6c60c0fc163fe03eb82bd61d5545130a

I also added comments below regarding those changes.

@nishantrayan nishantrayan force-pushed the fix_parquet_predicate_pushdown branch from a85cd63 to 977fcd7 Compare February 21, 2018 00:26
@nishantrayan nishantrayan changed the title Fixing bug in parquet predicate pushdown for varchar column type Fix parquet predicate pushdown type mismatch bug Feb 21, 2018
@nishantrayan nishantrayan force-pushed the fix_parquet_predicate_pushdown branch from 977fcd7 to 36b2447 Compare February 22, 2018 02:59
@nishantrayan
Copy link
Contributor Author

@nezihyigitbasi Thanks for the prompt feedback on the PR. I have made the changes you requested. I was also able to deploy this fix in our internal cluster and verify that its working great.

Copy link
Contributor

@nezihyigitbasi nezihyigitbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the commit message as Fix type mismatch in Parquet predicate pushdown

{
// we look at effective predicate domain because it more accurately matches the hive column
// than the type available in the parquet metadata passed here as RichColumnDescriptor
// for example varchar(len) hive column is translated to binary and then to varchar type using parquet metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example varchar(len) hive column is translated to binary.

Can you elaborate please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a stab at breaking it down to as much detail as I can.

  1. During predicate match we need the domain type of both the column in the predicate (if specified) and column from schema to match. TupleDomainParquetPredicate#matches
  2. Now for varchar(fixedLen)effectivePredicate's domain aligns with the type specified in the hive schema.
  3. However the hive schema columns uses RichColumnDescriptor which extends ColumnDescriptor from hive. This class cannot capture the varchar(fixLen) the same way and it doesn't seem straightforward to change that.

Looking at my comment above I can see that I made the mistake of saying from parquet metadata. Let me know if the explanation above make sense and if you have any suggestion to reword that comment.
Again thanks for offering feedback on this. I think this might be affecting a lot more users than just us.

@nishantrayan nishantrayan changed the title Fix parquet predicate pushdown type mismatch bug Fix type mismatch in Parquet predicate pushdown Feb 23, 2018
@nezihyigitbasi
Copy link
Contributor

I feel like this is a bit hacky as we get the type of the column from the predicate instead of the column descriptor (which currently does not have that information). So, I think a more proper approach is to add the Presto/Hive type info to RichColumnDescriptor.

InParquetPageSourceFactory::createParquetPageSource() we have the HiveTypes as part of List<HiveColumnHandle> columns (from HiveType you can get the type information you want) and we can keep this type information in the RichColumnDescriptor when we create them. Then in ParquetTypeUtils::getPrestoType() we can consult that type info to get the Presto types. What do you think?

@nishantrayan
Copy link
Contributor Author

hey @nezihyigitbasi really good point. I think it makes sense. Let me take a stab at it. If it works out shall I open a separate PR since the new way might be significantly different from my initial one.

@nezihyigitbasi
Copy link
Contributor

Yes, please open a follow-up PR and link/refer to this one.

@tooptoop4
Copy link

can this be included in .206 release? its real blocker for us as essentially means presto can't read parquet data from hive. Similar issues: #10957, #9084

@nezihyigitbasi
Copy link
Contributor

@nishantrayan are you still working on this issue? Do you have plans to open a follow up PR as discussed above?

@nishantrayan
Copy link
Contributor Author

Hi @nezihyigitbasi sorry for the silence on this one. Yes. I will open a follow up PR soon on this one.

@nezihyigitbasi
Copy link
Contributor

@nishantrayan For now I am merging this PR with several changes as several people are asking for it and we are in the process of removing the old Parquet reader, so we want to resolve the issues in the new/optimized reader. Please open a follow up PR if you want to do the changes we talked about before.

The changes I made to this PR are:

  • The test testMatchesWithDescriptors was broken, it wasn't really testing the new code path (you can easily verify this by reverting your change and running this test, you will see that it still passes without your change). If you put a breakpoint to TupleDomainParquetPredicate::getDomain() you will see that the dictionary page used in the test throws an exception on line dictionary = dictionaryPage.get().getEncoding().initDictionary(columnDescriptor, dictionaryPage.get());. I had to craft an empty dictionary page carefully to test the new code path.

  • I moved the change from TupleDomainParquetPredicate to ParquetTypeUtils and localized the change to only varchar type handling, because that's the only type that's affected.

Thanks for your contribution!

@nishantrayan
Copy link
Contributor Author

@nezihyigitbasi thanks for merging. The changes you made makes sense.
I also made an attempt to cleanup based on your suggestion in a separate PR : #11118
please have a look at it when you get a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants