Fix type mismatch in Parquet predicate pushdown#9975

Closed

nishantrayan wants to merge 1 commit intoprestodb:masterfrom

lyft:fix_parquet_predicate_pushdown

Contributor

nishantrayan commented Feb 15, 2018

The parquet predicate pushdown logic was refactored to allow nested pushdown in the beginning of last year and seems like there was one minor but annoying bug that is still in the latest version and issue was opened here: #9084
We use presto with hive and can reliably reproduce this with the schema containing column of fixed varchar (say varchar(255))
The effectivePredicate seems to have the hive's column type which is varchar(255) however the type header of the parquet chunk gets translated to varchar and because of this mismatch we fail the type check that happens later.

Caused by: java.lang.IllegalArgumentException: Mismatched Domain types: varchar(255) vs varchar
	at com.facebook.presto.spi.predicate.Domain.checkCompatibility(Domain.java:228)
	at com.facebook.presto.spi.predicate.Domain.intersect(Domain.java:183)
	at com.facebook.presto.spi.predicate.TupleDomain.intersect(TupleDomain.java:196)
	at com.facebook.presto.spi.predicate.TupleDomain.overlaps(TupleDomain.java:299)

Please have a look at this @nezihyigitbasi @zhenxiao @dain

facebook-github-bot added the CLA Signed label

nezihyigitbasi requested changes

View reviewed changes

Contributor

nezihyigitbasi left a comment

Please squash all the commits and make sure the commit title follows our convention: https://chris.beams.io/posts/git-commit/

...ve/src/main/java/com/facebook/presto/hive/parquet/predicate/TupleDomainParquetPredicate.java

                           domains.put(column, domain);
                       }
                       TupleDomain<ColumnDescriptor> stripeDomain = TupleDomain.withColumnDomains(domains.build());

Contributor

nezihyigitbasi Feb 16, 2018

Unnecessary change.

...ve/src/main/java/com/facebook/presto/hive/parquet/predicate/TupleDomainParquetPredicate.java

+                  private Type getType(RichColumnDescriptor column)
+                  {
+                      Optional<Map<ColumnDescriptor, Domain>> predicateDomains = effectivePredicate.getDomains();

Contributor

nezihyigitbasi Feb 16, 2018

Please add a comment here about why we first look at effectivePredicate here.

presto-hive/src/test/java/com/facebook/presto/hive/parquet/TestTupleDomainParquetPredicate.java Outdated

+                  @Test
+                  public void testMatchesWithStatistics()
+                          throws Exception

Contributor

nezihyigitbasi Feb 16, 2018

Unnecessary throws clause.

Contributor

nezihyigitbasi Feb 16, 2018

You can simplify this method as (variables are renamed, unnecessary line breaks removed, methods static imported)

public void testMatchesWithStatistics()
{
    RichColumnDescriptor column = getColumn();
    String value = "Test";
    TupleDomain<ColumnDescriptor> effectivePredicate = getEffectivePredicate(column, createVarcharType(255), value);
    List<RichColumnDescriptor> columns = singletonList(column);
    TupleDomainParquetPredicate predicate = new TupleDomainParquetPredicate(effectivePredicate, columns);
    Statistics stats = getStatsBasedOnType(column.getType());
    stats.setNumNulls(1L);
    stats.setMinMaxFromBytes(value.getBytes(), value.getBytes());
    assertTrue(predicate.matches(2, singletonMap(column, stats)));
}

presto-hive/src/test/java/com/facebook/presto/hive/parquet/TestTupleDomainParquetPredicate.java Outdated

+                      return TupleDomain.withColumnDomains(predicateColumns);
+                  }
+                  private RichColumnDescriptor getTableColumn()

Contributor

nezihyigitbasi Feb 16, 2018

rename as getColumn

presto-hive/src/test/java/com/facebook/presto/hive/parquet/TestTupleDomainParquetPredicate.java Outdated

+                  private RichColumnDescriptor getTableColumn()
+                  {
+                      PrimitiveType.PrimitiveTypeName typeName = PrimitiveType.PrimitiveTypeName.BINARY;

Contributor

nezihyigitbasi Feb 16, 2018

import PrimitiveTypeName

Contributor

nezihyigitbasi Feb 16, 2018

you can simplify this method as

private RichColumnDescriptor getColumn()
{
    PrimitiveType type = new PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.BINARY, "Test column");
    return new RichColumnDescriptor(new String[] {"path"}, type, 0, 0);
}

presto-hive/src/test/java/com/facebook/presto/hive/parquet/TestTupleDomainParquetPredicate.java Outdated

+                              new ParquetDictionaryDescriptor(tableColumn, Optional.of(page)))));
+                  }
+                  private TupleDomain<ColumnDescriptor> getEffectivePredicate(RichColumnDescriptor tableColumn,

Contributor

nezihyigitbasi Feb 16, 2018

you can simplify this method as

private TupleDomain<ColumnDescriptor> getEffectivePredicate(RichColumnDescriptor column, VarcharType type, String value)
{
    ColumnDescriptor predicateColumn = new ColumnDescriptor(column.getPath(), column.getType(), 0, 0);
    Domain predicateDomain = Domain.singleValue(type, utf8Slice(value));
    Map<ColumnDescriptor, Domain> predicateColumns = singletonMap(predicateColumn, predicateDomain);
    return TupleDomain.withColumnDomains(predicateColumns);
}

presto-hive/src/test/java/com/facebook/presto/hive/parquet/TestTupleDomainParquetPredicate.java Outdated

+                  @Test
+                  public void testMatchesWithDescriptors()
+                          throws Exception

Contributor

nezihyigitbasi Feb 16, 2018

unnecessary throws clause.

Contributor

nezihyigitbasi Feb 16, 2018

you can simplify this method as

public void testMatchesWithDescriptors()
{
    RichColumnDescriptor column = getColumn();
    String value = "Test";
    TupleDomain<ColumnDescriptor> effectivePredicate = getEffectivePredicate(column, createVarcharType(255), value);
    List<RichColumnDescriptor> columns = singletonList(column);
    TupleDomainParquetPredicate predicate = new TupleDomainParquetPredicate(effectivePredicate, columns);
    ParquetDictionaryPage page = new ParquetDictionaryPage(utf8Slice(value), 2, PLAIN_DICTIONARY);
    assertTrue(predicate.matches(singletonMap(column, new ParquetDictionaryDescriptor(column, Optional.of(page)))));
}

presto-hive/src/test/java/com/facebook/presto/hive/parquet/TestTupleDomainParquetPredicate.java

               import static org.testng.Assert.assertEquals;
+              import static org.testng.Assert.assertTrue;
               public class TestTupleDomainParquetPredicate

Contributor

nezihyigitbasi Feb 16, 2018

I proposed some changes to the test code to rename variables, static import some methods, remove throws, etc. Here is the entire test class with these changes: https://gist.github.com/nezihyigitbasi/6c60c0fc163fe03eb82bd61d5545130a

I also added comments below regarding those changes.

nishantrayan force-pushed the fix_parquet_predicate_pushdown branch from a85cd63 to 977fcd7 Compare

February 21, 2018 00:26

nishantrayan changed the title ~~Fixing bug in parquet predicate pushdown for varchar column type~~ Fix parquet predicate pushdown type mismatch bug


          Fix parquet predicate pushdown type mismatch bug

36b2447

nishantrayan force-pushed the fix_parquet_predicate_pushdown branch from 977fcd7 to 36b2447 Compare

February 22, 2018 02:59

Contributor Author

nishantrayan commented Feb 22, 2018

@nezihyigitbasi Thanks for the prompt feedback on the PR. I have made the changes you requested. I was also able to deploy this fix in our internal cluster and verify that its working great.

nezihyigitbasi requested changes

View reviewed changes

Contributor

nezihyigitbasi left a comment

Can you update the commit message as Fix type mismatch in Parquet predicate pushdown

...ve/src/main/java/com/facebook/presto/hive/parquet/predicate/TupleDomainParquetPredicate.java

+                  {
+                      // we look at effective predicate domain because it more accurately matches the hive column
+                      // than the type available in the parquet metadata passed here as RichColumnDescriptor
+                      // for example varchar(len) hive column is translated to binary and then to varchar type using parquet metadata

Contributor

nezihyigitbasi Feb 23, 2018

for example varchar(len) hive column is translated to binary.

Can you elaborate please?

Contributor Author

nishantrayan Feb 25, 2018

I took a stab at breaking it down to as much detail as I can.

During predicate match we need the domain type of both the column in the predicate (if specified) and column from schema to match. TupleDomainParquetPredicate#matches
Now for varchar(fixedLen)effectivePredicate's domain aligns with the type specified in the hive schema.
However the hive schema columns uses RichColumnDescriptor which extends ColumnDescriptor from hive. This class cannot capture the varchar(fixLen) the same way and it doesn't seem straightforward to change that.

Looking at my comment above I can see that I made the mistake of saying from parquet metadata. Let me know if the explanation above make sense and if you have any suggestion to reword that comment.
Again thanks for offering feedback on this. I think this might be affecting a lot more users than just us.

nishantrayan changed the title ~~Fix parquet predicate pushdown type mismatch bug~~ Fix type mismatch in Parquet predicate pushdown

luohao mentioned this pull request

Fix type mismatch in Parquet predicate pushdown #10008

Closed

Contributor

nezihyigitbasi commented Feb 27, 2018

I feel like this is a bit hacky as we get the type of the column from the predicate instead of the column descriptor (which currently does not have that information). So, I think a more proper approach is to add the Presto/Hive type info to RichColumnDescriptor.

InParquetPageSourceFactory::createParquetPageSource() we have the HiveTypes as part of List<HiveColumnHandle> columns (from HiveType you can get the type information you want) and we can keep this type information in the RichColumnDescriptor when we create them. Then in ParquetTypeUtils::getPrestoType() we can consult that type info to get the Presto types. What do you think?

Contributor Author

nishantrayan commented Mar 5, 2018

hey @nezihyigitbasi really good point. I think it makes sense. Let me take a stab at it. If it works out shall I open a separate PR since the new way might be significantly different from my initial one.

Contributor

nezihyigitbasi commented Mar 5, 2018

Yes, please open a follow-up PR and link/refer to this one.

tooptoop4 commented Jul 6, 2018

can this be included in .206 release? its real blocker for us as essentially means presto can't read parquet data from hive. Similar issues: #10957, #9084

tooptoop4 mentioned this pull request

Mismatched Domain types: varchar(30) vs varchar #10957

Closed

Contributor

nezihyigitbasi commented Jul 9, 2018

@nishantrayan are you still working on this issue? Do you have plans to open a follow up PR as discussed above?

Contributor Author

nishantrayan commented Jul 9, 2018

Hi @nezihyigitbasi sorry for the silence on this one. Yes. I will open a follow up PR soon on this one.

Contributor

nezihyigitbasi commented Jul 23, 2018

@nishantrayan For now I am merging this PR with several changes as several people are asking for it and we are in the process of removing the old Parquet reader, so we want to resolve the issues in the new/optimized reader. Please open a follow up PR if you want to do the changes we talked about before.

The changes I made to this PR are:

The test testMatchesWithDescriptors was broken, it wasn't really testing the new code path (you can easily verify this by reverting your change and running this test, you will see that it still passes without your change). If you put a breakpoint to TupleDomainParquetPredicate::getDomain() you will see that the dictionary page used in the test throws an exception on line dictionary = dictionaryPage.get().getEncoding().initDictionary(columnDescriptor, dictionaryPage.get());. I had to craft an empty dictionary page carefully to test the new code path.
I moved the change from TupleDomainParquetPredicate to ParquetTypeUtils and localized the change to only varchar type handling, because that's the only type that's affected.

Thanks for your contribution!

nezihyigitbasi closed this

nishantrayan mentioned this pull request

cleanup fix for type mismatch in Parquet predicate pushdown #11118

Closed

Contributor Author

nishantrayan commented Jul 24, 2018

@nezihyigitbasi thanks for merging. The changes you made makes sense.
I also made an attempt to cleanup based on your suggestion in a separate PR : #11118
please have a look at it when you get a chance.

msdotnetclr mentioned this pull request

Queries getting QUEUED after upgrading to 0.203-e.0.2 starburstdata/facebook-presto#51

Closed

shubhamtagra mentioned this pull request

Query failures due to disparity in Presto type deduction logic from Hive schema and from Parquet column descriptor #9084

Closed

puneetjaiswal deleted the fix_parquet_predicate_pushdown branch

February 8, 2019 22:21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet