Skip to content

Conversation

@vrozov
Copy link
Member

@vrozov vrozov commented May 14, 2018

@parthchandra Please review

<dep.guava.version>18.0</dep.guava.version>
<forkCount>2</forkCount>
<parquet.version>1.8.1-drill-r0</parquet.version>
<parquet.version>1.10.0</parquet.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.10? Is it safer to upgrade to 1.8.3 and then test out 1.9/1.10 before upgrading to it?
Also, with this change the Hive parquet version is 1.8.2. I wonder what impact that might have on compatibility?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.8.3 as well as 1.8.1-drill-r0 are supposed to be a patch release on top of 1.8.0. Unfortunately parquet-mr does not properly follow semantic versioning and have functional and API level changes in the patch releases. On top of that, both 1.9.0 and 1.10.0 are not backward compatible with 1.8.0 and/or 1.8.3, so upgrade to 1.8.3 will not help us with the upgrade to 1.10.0 later either or make it safer. I'd suggest to pay the price once. Additionally, the latest parquet version may have a functionality to be used in filter pushdown. @arina-ielchiieva what is your take?

Parquet libraries used by hive are shaded within drill-hive-exec-shaded, so hive is guarded from the parquet-mr library upgrade.

Copy link
Member

@arina-ielchiieva arina-ielchiieva May 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I need this upgrade to implement varchar filter push down, since it has been fixed in 1.10.0 but not in 1.8.3. I think if all unit tests pass, along with Functional & Advanced we are safe to go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

@ilooner
Copy link
Contributor

ilooner commented May 15, 2018

@vrozov Please fix Travis failures

Tests in error: 
  TestCTASPartitionFilter.withoutDistribution » UserRemote SYSTEM ERROR: Illegal...
  TestCTASPartitionFilter>BaseTestQuery.closeClient:286 » Runtime Exception whil...

@vrozov
Copy link
Member Author

vrozov commented May 15, 2018

I am still working on fixing the unit and functional tests. The PR is open to initiate a discussion on parquet-mr version upgrade.

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

<dep.guava.version>18.0</dep.guava.version>
<forkCount>2</forkCount>
<parquet.version>1.8.1-drill-r0</parquet.version>
<parquet.version>1.10.0</parquet.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

@arina-ielchiieva
Copy link
Member

@vrozov sounds like, we are all in consensus on upgrade to the latest parquet version. So when all tests pass, please ping us and we'll finish code review.

@vrozov
Copy link
Member Author

vrozov commented May 17, 2018

@arina-ielchiieva Please review

}
}

@Ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain the reason why these tests should be ignored?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvysotskyi no worries, work is still in progress.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not plan to enable the tests back as part of the PR. The test relies on wrong statistics and needs to be fixed/modified for the new parquet library. As I am not familiar with the functionality it tests, I'll file JIRA to work on enabling those tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I don't think it's a good idea to disable unit tests. You can consider asking for help to resolve unit tests failures but not disable them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test needs to be fixed as part of a separate JIRA/PR (another option is to remove the check for the filter, but IMO it is even less desirable).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same applies to testIntervalYearPartitionPruning: statistics for col_intrvl_yr is also not available for the same reason:

{
  "encodingStats" : null,
  "dictionaryPageOffset" : 0,
  "valueCount" : 6,
  "totalSize" : 81,
  "totalUncompressedSize" : 91,
  "statistics" : {
    "max" : null,
    "min" : null,
    "maxBytes" : null,
    "minBytes" : null,
    "empty" : true,
    "numNulls" : -1,
    "numNullsSet" : false
  },
  "firstDataPageOffset" : 451,
  "type" : "FIXED_LEN_BYTE_ARRAY",
  "path" : [ "col_intrvl_yr" ],
  "primitiveType" : {
    "name" : "col_intrvl_yr",
    "repetition" : "OPTIONAL",
    "originalType" : "INTERVAL",
    "id" : null,
    "primitive" : true,
    "primitiveTypeName" : "FIXED_LEN_BYTE_ARRAY",
    "decimalMetadata" : null,
    "typeLength" : 12
  },
  "codec" : "SNAPPY",
  "encodings" : [ "RLE", "BIT_PACKED", "PLAIN" ],
  "startingPos" : 451
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for testDecimalPartitionPruning statistics for MANAGER_ID is not available either:

{
  "encodingStats" : null,
  "dictionaryPageOffset" : 0,
  "valueCount" : 107,
  "totalSize" : 168,
  "totalUncompressedSize" : 363,
  "statistics" : {
    "max" : null,
    "min" : null,
    "maxBytes" : null,
    "minBytes" : null,
    "empty" : true,
    "numNulls" : -1,
    "numNullsSet" : false
  },
  "firstDataPageOffset" : 5550,
  "type" : "FIXED_LEN_BYTE_ARRAY",
  "path" : [ "MANAGER_ID" ],
  "codec" : "SNAPPY",
  "primitiveType" : {
    "name" : "MANAGER_ID",
    "repetition" : "OPTIONAL",
    "originalType" : "DECIMAL",
    "id" : null,
    "primitive" : true,
    "primitiveTypeName" : "FIXED_LEN_BYTE_ARRAY",
    "decimalMetadata" : {
      "precision" : 6,
      "scale" : 0
    },
    "typeLength" : 3
  },
  "encodings" : [ "PLAIN", "BIT_PACKED", "RLE" ],
  "startingPos" : 5550
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet library behavior for DECIMAL statistics was changed in PARQUET-686 (see Parquet PR #367). I filed PARQUET-1322 to track statistics availability for DECIMAL types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vlad thanks for investigating the issue. Since it's Parquet problem, it can leave tests to be ignored just please add comment in each of them to indicated the root cause. @parthchandra are you ok with this approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also had an offline chat with Vlad on this one. The problem is that Parquet has changed its behaviour and will not give us the stats for Decimal when we read footers.
We have, therefore, no way of knowing whether Decimal stats are correct or not (even if they are correct) unless we try to hack something in Parquet. Hacking something in Parquet is not an option since that is exactly what this PR is trying to fix !
Also, we have never supported Decimal in Drill, so we do not have to consider backward compatibility. There are some users using Decimal (based on posts to the mailing list), but the old implementation never worked reliably so this will be an overall improvement for all parties.

+1. And thanks Vlad, Arina for pursuing this one to the end :)

@parthchandra
Copy link
Contributor

parthchandra commented May 29, 2018 via email

@vrozov
Copy link
Member Author

vrozov commented May 29, 2018

The fix for PARQUET-77 is included into 1.10.0, 1.9.0 and 1.8.3 as it is not specific for Apache Drill.

@vrozov
Copy link
Member Author

vrozov commented Jun 3, 2018

@parthchandra @arina-ielchiieva The PR is ready for the final review.

@vrozov
Copy link
Member Author

vrozov commented Jun 5, 2018

@parthchandra @arina-ielchiieva Please review

@arina-ielchiieva
Copy link
Member

@vrozov so we gonna do with the ignored tests? :) Do you have an explanation why they fail?

@vrozov
Copy link
Member Author

vrozov commented Jun 5, 2018

@arina-ielchiieva Please see my comment

@parthchandra
Copy link
Contributor

Based on Arina's analysis, I don't think it is ok to ignore this test failure.

@parthchandra parthchandra reopened this Jun 5, 2018
if (parentColumnReader.columnDescriptor.getMaxDefinitionLevel() != 0) {
throw new UnsupportedOperationException("Unsupoorted Operation");
}
Preconditions.checkState(parentColumnReader.columnDescriptor.getMaxDefinitionLevel() == 1);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sachouche Please review fix for DRILL-6447

Copy link
Contributor

@sachouche sachouche Jun 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vrozov,

I believe that max-definition can be either zero or one. Zero if all values are null.

FYI - Did you also include the fix that I made in "VarLenBulkPageReader.java"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case getMaxDefinitionLevel() is zero, the resetDefinitionLevelReader() should not be called as it unconditionally reads definitionLevels that is only valid when getMaxDefinitionLevel() is not zero.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I included the other fix as well. Please review changes to VarLenBulkPageReader.java.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Vlad!
LGTM

ilooner pushed a commit to ilooner/drill that referenced this pull request Jun 13, 2018
@ilooner ilooner closed this in ac8e698 Jun 14, 2018
@vrozov vrozov deleted the DRILL-6353 branch June 20, 2018 20:37
@xiexingguang
Copy link

it means , once Upgrade Parquet MR dependencies , filter push down can support type of varchar ?

@arina-ielchiieva
Copy link
Member

@xiexingguang no, support for varchar push down is not fully implemented yet.

@xiexingguang
Copy link

@arina-ielchiieva varchar push down function seems import for us, do you have a plan to supoort it fully ? if do have plan, please inform us .tks

@vrozov
Copy link
Member Author

vrozov commented Jul 17, 2018

@xiexingguang Please create new JIRA. PR is a place to discuss code modifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants