Fix type mismatch in Parquet predicate pushdown#9975
Fix type mismatch in Parquet predicate pushdown#9975nishantrayan wants to merge 1 commit intoprestodb:masterfrom
Conversation
nezihyigitbasi
left a comment
There was a problem hiding this comment.
Please squash all the commits and make sure the commit title follows our convention: https://chris.beams.io/posts/git-commit/
| domains.put(column, domain); | ||
| } | ||
| TupleDomain<ColumnDescriptor> stripeDomain = TupleDomain.withColumnDomains(domains.build()); | ||
|
|
There was a problem hiding this comment.
Unnecessary change.
|
|
||
| private Type getType(RichColumnDescriptor column) | ||
| { | ||
| Optional<Map<ColumnDescriptor, Domain>> predicateDomains = effectivePredicate.getDomains(); |
There was a problem hiding this comment.
Please add a comment here about why we first look at effectivePredicate here.
|
|
||
| @Test | ||
| public void testMatchesWithStatistics() | ||
| throws Exception |
There was a problem hiding this comment.
Unnecessary throws clause.
There was a problem hiding this comment.
You can simplify this method as (variables are renamed, unnecessary line breaks removed, methods static imported)
public void testMatchesWithStatistics()
{
RichColumnDescriptor column = getColumn();
String value = "Test";
TupleDomain<ColumnDescriptor> effectivePredicate = getEffectivePredicate(column, createVarcharType(255), value);
List<RichColumnDescriptor> columns = singletonList(column);
TupleDomainParquetPredicate predicate = new TupleDomainParquetPredicate(effectivePredicate, columns);
Statistics stats = getStatsBasedOnType(column.getType());
stats.setNumNulls(1L);
stats.setMinMaxFromBytes(value.getBytes(), value.getBytes());
assertTrue(predicate.matches(2, singletonMap(column, stats)));
}
| return TupleDomain.withColumnDomains(predicateColumns); | ||
| } | ||
|
|
||
| private RichColumnDescriptor getTableColumn() |
There was a problem hiding this comment.
rename as getColumn
|
|
||
| private RichColumnDescriptor getTableColumn() | ||
| { | ||
| PrimitiveType.PrimitiveTypeName typeName = PrimitiveType.PrimitiveTypeName.BINARY; |
There was a problem hiding this comment.
import PrimitiveTypeName
There was a problem hiding this comment.
you can simplify this method as
private RichColumnDescriptor getColumn()
{
PrimitiveType type = new PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.BINARY, "Test column");
return new RichColumnDescriptor(new String[] {"path"}, type, 0, 0);
}
| new ParquetDictionaryDescriptor(tableColumn, Optional.of(page))))); | ||
| } | ||
|
|
||
| private TupleDomain<ColumnDescriptor> getEffectivePredicate(RichColumnDescriptor tableColumn, |
There was a problem hiding this comment.
you can simplify this method as
private TupleDomain<ColumnDescriptor> getEffectivePredicate(RichColumnDescriptor column, VarcharType type, String value)
{
ColumnDescriptor predicateColumn = new ColumnDescriptor(column.getPath(), column.getType(), 0, 0);
Domain predicateDomain = Domain.singleValue(type, utf8Slice(value));
Map<ColumnDescriptor, Domain> predicateColumns = singletonMap(predicateColumn, predicateDomain);
return TupleDomain.withColumnDomains(predicateColumns);
}
|
|
||
| @Test | ||
| public void testMatchesWithDescriptors() | ||
| throws Exception |
There was a problem hiding this comment.
unnecessary throws clause.
There was a problem hiding this comment.
you can simplify this method as
public void testMatchesWithDescriptors()
{
RichColumnDescriptor column = getColumn();
String value = "Test";
TupleDomain<ColumnDescriptor> effectivePredicate = getEffectivePredicate(column, createVarcharType(255), value);
List<RichColumnDescriptor> columns = singletonList(column);
TupleDomainParquetPredicate predicate = new TupleDomainParquetPredicate(effectivePredicate, columns);
ParquetDictionaryPage page = new ParquetDictionaryPage(utf8Slice(value), 2, PLAIN_DICTIONARY);
assertTrue(predicate.matches(singletonMap(column, new ParquetDictionaryDescriptor(column, Optional.of(page)))));
}
| import static org.testng.Assert.assertEquals; | ||
| import static org.testng.Assert.assertTrue; | ||
|
|
||
| public class TestTupleDomainParquetPredicate |
There was a problem hiding this comment.
I proposed some changes to the test code to rename variables, static import some methods, remove throws, etc. Here is the entire test class with these changes: https://gist.github.com/nezihyigitbasi/6c60c0fc163fe03eb82bd61d5545130a
I also added comments below regarding those changes.
a85cd63 to
977fcd7
Compare
977fcd7 to
36b2447
Compare
|
@nezihyigitbasi Thanks for the prompt feedback on the PR. I have made the changes you requested. I was also able to deploy this fix in our internal cluster and verify that its working great. |
nezihyigitbasi
left a comment
There was a problem hiding this comment.
Can you update the commit message as Fix type mismatch in Parquet predicate pushdown
| { | ||
| // we look at effective predicate domain because it more accurately matches the hive column | ||
| // than the type available in the parquet metadata passed here as RichColumnDescriptor | ||
| // for example varchar(len) hive column is translated to binary and then to varchar type using parquet metadata |
There was a problem hiding this comment.
for example varchar(len) hive column is translated to binary.
Can you elaborate please?
There was a problem hiding this comment.
I took a stab at breaking it down to as much detail as I can.
- During predicate match we need the domain type of both the column in the predicate (if specified) and column from schema to match. TupleDomainParquetPredicate#matches
- Now for varchar(fixedLen)
effectivePredicate's domain aligns with the type specified in the hive schema. - However the hive schema columns uses RichColumnDescriptor which extends ColumnDescriptor from hive. This class cannot capture the varchar(fixLen) the same way and it doesn't seem straightforward to change that.
Looking at my comment above I can see that I made the mistake of saying from parquet metadata. Let me know if the explanation above make sense and if you have any suggestion to reword that comment.
Again thanks for offering feedback on this. I think this might be affecting a lot more users than just us.
|
I feel like this is a bit hacky as we get the type of the column from the predicate instead of the column descriptor (which currently does not have that information). So, I think a more proper approach is to add the Presto/Hive type info to In |
|
hey @nezihyigitbasi really good point. I think it makes sense. Let me take a stab at it. If it works out shall I open a separate PR since the new way might be significantly different from my initial one. |
|
Yes, please open a follow-up PR and link/refer to this one. |
|
@nishantrayan are you still working on this issue? Do you have plans to open a follow up PR as discussed above? |
|
Hi @nezihyigitbasi sorry for the silence on this one. Yes. I will open a follow up PR soon on this one. |
|
@nishantrayan For now I am merging this PR with several changes as several people are asking for it and we are in the process of removing the old Parquet reader, so we want to resolve the issues in the new/optimized reader. Please open a follow up PR if you want to do the changes we talked about before. The changes I made to this PR are:
Thanks for your contribution! |
|
@nezihyigitbasi thanks for merging. The changes you made makes sense. |
The parquet predicate pushdown logic was refactored to allow nested pushdown in the beginning of last year and seems like there was one minor but annoying bug that is still in the latest version and issue was opened here: #9084
We use presto with hive and can reliably reproduce this with the schema containing column of fixed varchar (say varchar(255))
The effectivePredicate seems to have the hive's column type which is varchar(255) however the type header of the parquet chunk gets translated to varchar and because of this mismatch we fail the type check that happens later.
Please have a look at this @nezihyigitbasi @zhenxiao @dain