DRILL-6353: Upgrade Parquet MR dependencies #1259

vrozov · 2018-05-14T18:42:38Z

parthchandra · 2018-05-14T20:33:12Z

pom.xml

    <dep.guava.version>18.0</dep.guava.version>
    <forkCount>2</forkCount>
-    <parquet.version>1.8.1-drill-r0</parquet.version>
+    <parquet.version>1.10.0</parquet.version>


1.10? Is it safer to upgrade to 1.8.3 and then test out 1.9/1.10 before upgrading to it?
Also, with this change the Hive parquet version is 1.8.2. I wonder what impact that might have on compatibility?

1.8.3 as well as 1.8.1-drill-r0 are supposed to be a patch release on top of 1.8.0. Unfortunately parquet-mr does not properly follow semantic versioning and have functional and API level changes in the patch releases. On top of that, both 1.9.0 and 1.10.0 are not backward compatible with 1.8.0 and/or 1.8.3, so upgrade to 1.8.3 will not help us with the upgrade to 1.10.0 later either or make it safer. I'd suggest to pay the price once. Additionally, the latest parquet version may have a functionality to be used in filter pushdown. @arina-ielchiieva what is your take?

Parquet libraries used by hive are shaded within drill-hive-exec-shaded, so hive is guarded from the parquet-mr library upgrade.

Well, I need this upgrade to implement varchar filter push down, since it has been fixed in 1.10.0 but not in 1.8.3. I think if all unit tests pass, along with Functional & Advanced we are safe to go.

Fair enough.

ilooner · 2018-05-15T05:41:44Z

@vrozov Please fix Travis failures

Tests in error: 
  TestCTASPartitionFilter.withoutDistribution Â» UserRemote SYSTEM ERROR: Illegal...
  TestCTASPartitionFilter>BaseTestQuery.closeClient:286 Â» Runtime Exception whil...

vrozov · 2018-05-15T14:15:58Z

I am still working on fixing the unit and functional tests. The PR is open to initiate a discussion on parquet-mr version upgrade.

parthchandra

LGTM

parthchandra · 2018-05-15T21:03:05Z

pom.xml

    <dep.guava.version>18.0</dep.guava.version>
    <forkCount>2</forkCount>
-    <parquet.version>1.8.1-drill-r0</parquet.version>
+    <parquet.version>1.10.0</parquet.version>


Fair enough.

arina-ielchiieva · 2018-05-16T09:16:28Z

@vrozov sounds like, we are all in consensus on upgrade to the latest parquet version. So when all tests pass, please ping us and we'll finish code review.

vrozov · 2018-05-17T15:33:48Z

@arina-ielchiieva Please review

vvysotskyi · 2018-05-24T07:03:31Z

exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java

    }
  }

+  @Ignore


Could you please explain the reason why these tests should be ignored?

@vvysotskyi no worries, work is still in progress.

I do not plan to enable the tests back as part of the PR. The test relies on wrong statistics and needs to be fixed/modified for the new parquet library. As I am not familiar with the functionality it tests, I'll file JIRA to work on enabling those tests.

In this case, I don't think it's a good idea to disable unit tests. You can consider asking for help to resolve unit tests failures but not disable them.

The test needs to be fixed as part of a separate JIRA/PR (another option is to remove the check for the filter, but IMO it is even less desirable).

The same applies to testIntervalYearPartitionPruning: statistics for col_intrvl_yr is also not available for the same reason:

{ "encodingStats" : null, "dictionaryPageOffset" : 0, "valueCount" : 6, "totalSize" : 81, "totalUncompressedSize" : 91, "statistics" : { "max" : null, "min" : null, "maxBytes" : null, "minBytes" : null, "empty" : true, "numNulls" : -1, "numNullsSet" : false }, "firstDataPageOffset" : 451, "type" : "FIXED_LEN_BYTE_ARRAY", "path" : [ "col_intrvl_yr" ], "primitiveType" : { "name" : "col_intrvl_yr", "repetition" : "OPTIONAL", "originalType" : "INTERVAL", "id" : null, "primitive" : true, "primitiveTypeName" : "FIXED_LEN_BYTE_ARRAY", "decimalMetadata" : null, "typeLength" : 12 }, "codec" : "SNAPPY", "encodings" : [ "RLE", "BIT_PACKED", "PLAIN" ], "startingPos" : 451 }

And for testDecimalPartitionPruning statistics for MANAGER_ID is not available either:

{ "encodingStats" : null, "dictionaryPageOffset" : 0, "valueCount" : 107, "totalSize" : 168, "totalUncompressedSize" : 363, "statistics" : { "max" : null, "min" : null, "maxBytes" : null, "minBytes" : null, "empty" : true, "numNulls" : -1, "numNullsSet" : false }, "firstDataPageOffset" : 5550, "type" : "FIXED_LEN_BYTE_ARRAY", "path" : [ "MANAGER_ID" ], "codec" : "SNAPPY", "primitiveType" : { "name" : "MANAGER_ID", "repetition" : "OPTIONAL", "originalType" : "DECIMAL", "id" : null, "primitive" : true, "primitiveTypeName" : "FIXED_LEN_BYTE_ARRAY", "decimalMetadata" : { "precision" : 6, "scale" : 0 }, "typeLength" : 3 }, "encodings" : [ "PLAIN", "BIT_PACKED", "RLE" ], "startingPos" : 5550 }

Parquet library behavior for DECIMAL statistics was changed in PARQUET-686 (see Parquet PR #367). I filed PARQUET-1322 to track statistics availability for DECIMAL types.

Vlad thanks for investigating the issue. Since it's Parquet problem, it can leave tests to be ignored just please add comment in each of them to indicated the root cause. @parthchandra are you ok with this approach?

I also had an offline chat with Vlad on this one. The problem is that Parquet has changed its behaviour and will not give us the stats for Decimal when we read footers.
We have, therefore, no way of knowing whether Decimal stats are correct or not (even if they are correct) unless we try to hack something in Parquet. Hacking something in Parquet is not an option since that is exactly what this PR is trying to fix !
Also, we have never supported Decimal in Drill, so we do not have to consider backward compatibility. There are some users using Decimal (based on posts to the mailing list), but the old implementation never worked reliably so this will be an overall improvement for all parties.

+1. And thanks Vlad, Arina for pursuing this one to the end :)

parthchandra · 2018-05-29T18:26:34Z

The fix for the stats was part of a big commit to add support for ByteBuffers in Parquet (PARQUET-77 <https://issues.apache.org/jira/browse/PARQUET-77>; commit 6b605a4 <apache/parquet-java@6b605a4>). See the included commit 7bc2a4d <apache/parquet-java@7bc2a4d> which was to fix the overwriting of stats.

…

On Thu, May 24, 2018 at 6:55 PM, Vlad Rozov ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/ TestParquetMetadataCache.java <#1259 (comment)>: > @@ -737,6 +738,7 @@ public void testBooleanPartitionPruning() throws Exception { } } + @ignore It will be good if you can point to JIRA with the fix that Drill uses to correct statistics. Without JIRA it is not clear what particular fix is used by Drill to workaround bugs in how parquet library handles statistics and for what data types. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1259 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGOgcMtYzCv0qqlu9GzWLrWi08_w1YtTks5t12SEgaJpZM4T-Tzb> .

vrozov · 2018-05-29T19:58:50Z

The fix for PARQUET-77 is included into 1.10.0, 1.9.0 and 1.8.3 as it is not specific for Apache Drill.

vrozov · 2018-06-03T00:29:21Z

@parthchandra @arina-ielchiieva The PR is ready for the final review.

vrozov · 2018-06-05T14:11:37Z

@parthchandra @arina-ielchiieva Please review

arina-ielchiieva · 2018-06-05T14:39:32Z

@vrozov so we gonna do with the ignored tests? :) Do you have an explanation why they fail?

vrozov · 2018-06-05T15:36:22Z

@arina-ielchiieva Please see my comment

parthchandra · 2018-06-05T21:07:49Z

Based on Arina's analysis, I don't think it is ok to ignore this test failure.

vrozov · 2018-06-12T13:44:49Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/PageReader.java

-    if (parentColumnReader.columnDescriptor.getMaxDefinitionLevel() != 0) {
-      throw new UnsupportedOperationException("Unsupoorted Operation");
-    }
+    Preconditions.checkState(parentColumnReader.columnDescriptor.getMaxDefinitionLevel() == 1);


@sachouche Please review fix for DRILL-6447

@vrozov,

I believe that max-definition can be either zero or one. Zero if all values are null.

FYI - Did you also include the fix that I made in "VarLenBulkPageReader.java"

In case getMaxDefinitionLevel() is zero, the resetDefinitionLevelReader() should not be called as it unconditionally reads definitionLevels that is only valid when getMaxDefinitionLevel() is not zero.

Yes, I included the other fix as well. Please review changes to VarLenBulkPageReader.java.

Thank you Vlad!
LGTM

closes apache#1259

xiexingguang · 2018-07-16T06:04:23Z

it means , once Upgrade Parquet MR dependencies , filter push down can support type of varchar ?

arina-ielchiieva · 2018-07-16T11:04:30Z

@xiexingguang no, support for varchar push down is not fully implemented yet.

xiexingguang · 2018-07-17T02:18:00Z

@arina-ielchiieva varchar push down function seems import for us, do you have a plan to supoort it fully ? if do have plan, please inform us .tks

vrozov · 2018-07-17T13:51:48Z

@xiexingguang Please create new JIRA. PR is a place to discuss code modifications.

parthchandra reviewed May 14, 2018

View reviewed changes

parthchandra approved these changes May 15, 2018

View reviewed changes

vvysotskyi reviewed May 24, 2018

View reviewed changes

vrozov mentioned this pull request May 29, 2018

DRILL-6447: Fixed a sanity check condition #1291

Closed

vrozov mentioned this pull request Jun 3, 2018

DRILL-5796: Filter pruning for multi rowgroup parquet file #1298

Closed

parthchandra closed this Jun 5, 2018

parthchandra reopened this Jun 5, 2018

DRILL-6353: Upgrade Parquet MR dependencies

1866b61

vrozov commented Jun 12, 2018

View reviewed changes

ilooner pushed a commit to ilooner/drill that referenced this pull request Jun 13, 2018

DRILL-6353: Upgrade Parquet MR dependencies

5b60c3b

closes apache#1259

ilooner closed this in ac8e698 Jun 14, 2018

vrozov deleted the DRILL-6353 branch June 20, 2018 20:37

                   }
                 }
+                @Ignore

DRILL-6353: Upgrade Parquet MR dependencies #1259

DRILL-6353: Upgrade Parquet MR dependencies #1259

Uh oh!

Conversation

vrozov commented May 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arina-ielchiieva May 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilooner commented May 15, 2018

Uh oh!

vrozov commented May 15, 2018

Uh oh!

parthchandra left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arina-ielchiieva commented May 16, 2018

Uh oh!

vrozov commented May 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

parthchandra commented May 29, 2018 via email

Uh oh!

vrozov commented May 29, 2018

Uh oh!

vrozov commented Jun 3, 2018

Uh oh!

vrozov commented Jun 5, 2018

Uh oh!

arina-ielchiieva commented Jun 5, 2018

Uh oh!

vrozov commented Jun 5, 2018

Uh oh!

parthchandra commented Jun 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sachouche Jun 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiexingguang commented Jul 16, 2018

Uh oh!

arina-ielchiieva commented Jul 16, 2018

Uh oh!

xiexingguang commented Jul 17, 2018

arina-ielchiieva May 15, 2018 •

edited

Loading

sachouche Jun 12, 2018 •

edited

Loading