-
Notifications
You must be signed in to change notification settings - Fork 986
DRILL-6353: Upgrade Parquet MR dependencies #1259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| <dep.guava.version>18.0</dep.guava.version> | ||
| <forkCount>2</forkCount> | ||
| <parquet.version>1.8.1-drill-r0</parquet.version> | ||
| <parquet.version>1.10.0</parquet.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.10? Is it safer to upgrade to 1.8.3 and then test out 1.9/1.10 before upgrading to it?
Also, with this change the Hive parquet version is 1.8.2. I wonder what impact that might have on compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.8.3 as well as 1.8.1-drill-r0 are supposed to be a patch release on top of 1.8.0. Unfortunately parquet-mr does not properly follow semantic versioning and have functional and API level changes in the patch releases. On top of that, both 1.9.0 and 1.10.0 are not backward compatible with 1.8.0 and/or 1.8.3, so upgrade to 1.8.3 will not help us with the upgrade to 1.10.0 later either or make it safer. I'd suggest to pay the price once. Additionally, the latest parquet version may have a functionality to be used in filter pushdown. @arina-ielchiieva what is your take?
Parquet libraries used by hive are shaded within drill-hive-exec-shaded, so hive is guarded from the parquet-mr library upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I need this upgrade to implement varchar filter push down, since it has been fixed in 1.10.0 but not in 1.8.3. I think if all unit tests pass, along with Functional & Advanced we are safe to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
|
@vrozov Please fix Travis failures |
|
I am still working on fixing the unit and functional tests. The PR is open to initiate a discussion on |
parthchandra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| <dep.guava.version>18.0</dep.guava.version> | ||
| <forkCount>2</forkCount> | ||
| <parquet.version>1.8.1-drill-r0</parquet.version> | ||
| <parquet.version>1.10.0</parquet.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
|
@vrozov sounds like, we are all in consensus on upgrade to the latest parquet version. So when all tests pass, please ping us and we'll finish code review. |
|
@arina-ielchiieva Please review |
| } | ||
| } | ||
|
|
||
| @Ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain the reason why these tests should be ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvysotskyi no worries, work is still in progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not plan to enable the tests back as part of the PR. The test relies on wrong statistics and needs to be fixed/modified for the new parquet library. As I am not familiar with the functionality it tests, I'll file JIRA to work on enabling those tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I don't think it's a good idea to disable unit tests. You can consider asking for help to resolve unit tests failures but not disable them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test needs to be fixed as part of a separate JIRA/PR (another option is to remove the check for the filter, but IMO it is even less desirable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same applies to testIntervalYearPartitionPruning: statistics for col_intrvl_yr is also not available for the same reason:
{
"encodingStats" : null,
"dictionaryPageOffset" : 0,
"valueCount" : 6,
"totalSize" : 81,
"totalUncompressedSize" : 91,
"statistics" : {
"max" : null,
"min" : null,
"maxBytes" : null,
"minBytes" : null,
"empty" : true,
"numNulls" : -1,
"numNullsSet" : false
},
"firstDataPageOffset" : 451,
"type" : "FIXED_LEN_BYTE_ARRAY",
"path" : [ "col_intrvl_yr" ],
"primitiveType" : {
"name" : "col_intrvl_yr",
"repetition" : "OPTIONAL",
"originalType" : "INTERVAL",
"id" : null,
"primitive" : true,
"primitiveTypeName" : "FIXED_LEN_BYTE_ARRAY",
"decimalMetadata" : null,
"typeLength" : 12
},
"codec" : "SNAPPY",
"encodings" : [ "RLE", "BIT_PACKED", "PLAIN" ],
"startingPos" : 451
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And for testDecimalPartitionPruning statistics for MANAGER_ID is not available either:
{
"encodingStats" : null,
"dictionaryPageOffset" : 0,
"valueCount" : 107,
"totalSize" : 168,
"totalUncompressedSize" : 363,
"statistics" : {
"max" : null,
"min" : null,
"maxBytes" : null,
"minBytes" : null,
"empty" : true,
"numNulls" : -1,
"numNullsSet" : false
},
"firstDataPageOffset" : 5550,
"type" : "FIXED_LEN_BYTE_ARRAY",
"path" : [ "MANAGER_ID" ],
"codec" : "SNAPPY",
"primitiveType" : {
"name" : "MANAGER_ID",
"repetition" : "OPTIONAL",
"originalType" : "DECIMAL",
"id" : null,
"primitive" : true,
"primitiveTypeName" : "FIXED_LEN_BYTE_ARRAY",
"decimalMetadata" : {
"precision" : 6,
"scale" : 0
},
"typeLength" : 3
},
"encodings" : [ "PLAIN", "BIT_PACKED", "RLE" ],
"startingPos" : 5550
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet library behavior for DECIMAL statistics was changed in PARQUET-686 (see Parquet PR #367). I filed PARQUET-1322 to track statistics availability for DECIMAL types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vlad thanks for investigating the issue. Since it's Parquet problem, it can leave tests to be ignored just please add comment in each of them to indicated the root cause. @parthchandra are you ok with this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also had an offline chat with Vlad on this one. The problem is that Parquet has changed its behaviour and will not give us the stats for Decimal when we read footers.
We have, therefore, no way of knowing whether Decimal stats are correct or not (even if they are correct) unless we try to hack something in Parquet. Hacking something in Parquet is not an option since that is exactly what this PR is trying to fix !
Also, we have never supported Decimal in Drill, so we do not have to consider backward compatibility. There are some users using Decimal (based on posts to the mailing list), but the old implementation never worked reliably so this will be an overall improvement for all parties.
+1. And thanks Vlad, Arina for pursuing this one to the end :)
|
The fix for the stats was part of a big commit to add support for
ByteBuffers in Parquet (PARQUET-77
<https://issues.apache.org/jira/browse/PARQUET-77>; commit 6b605a4
<apache/parquet-java@6b605a4>).
See the included commit 7bc2a4d
<apache/parquet-java@7bc2a4d>
which
was to fix the overwriting of stats.
…On Thu, May 24, 2018 at 6:55 PM, Vlad Rozov ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/
TestParquetMetadataCache.java
<#1259 (comment)>:
> @@ -737,6 +738,7 @@ public void testBooleanPartitionPruning() throws Exception {
}
}
+ @ignore
It will be good if you can point to JIRA with the fix that Drill uses to
correct statistics. Without JIRA it is not clear what particular fix is
used by Drill to workaround bugs in how parquet library handles statistics
and for what data types.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1259 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGOgcMtYzCv0qqlu9GzWLrWi08_w1YtTks5t12SEgaJpZM4T-Tzb>
.
|
|
The fix for PARQUET-77 is included into 1.10.0, 1.9.0 and 1.8.3 as it is not specific for Apache Drill. |
|
@parthchandra @arina-ielchiieva The PR is ready for the final review. |
|
@parthchandra @arina-ielchiieva Please review |
|
@vrozov so we gonna do with the ignored tests? :) Do you have an explanation why they fail? |
|
@arina-ielchiieva Please see my comment |
|
Based on Arina's analysis, I don't think it is ok to ignore this test failure. |
| if (parentColumnReader.columnDescriptor.getMaxDefinitionLevel() != 0) { | ||
| throw new UnsupportedOperationException("Unsupoorted Operation"); | ||
| } | ||
| Preconditions.checkState(parentColumnReader.columnDescriptor.getMaxDefinitionLevel() == 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sachouche Please review fix for DRILL-6447
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that max-definition can be either zero or one. Zero if all values are null.
FYI - Did you also include the fix that I made in "VarLenBulkPageReader.java"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case getMaxDefinitionLevel() is zero, the resetDefinitionLevelReader() should not be called as it unconditionally reads definitionLevels that is only valid when getMaxDefinitionLevel() is not zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I included the other fix as well. Please review changes to VarLenBulkPageReader.java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Vlad!
LGTM
|
it means , once Upgrade Parquet MR dependencies , filter push down can support type of varchar ? |
|
@xiexingguang no, support for varchar push down is not fully implemented yet. |
|
@arina-ielchiieva varchar push down function seems import for us, do you have a plan to supoort it fully ? if do have plan, please inform us .tks |
|
@xiexingguang Please create new JIRA. PR is a place to discuss code modifications. |
@parthchandra Please review