[SPARK-11103][SQL] Filter applied on Merged Parquet shema with new column fail #9327

HyukjinKwon · 2015-10-28T08:44:35Z

When enabling mergedSchema and predicate filter, this fails since Parquet does not accept filters pushed down when the columns of the filters do not exist in the schema.
This is related with Parquet issue (https://issues.apache.org/jira/browse/PARQUET-389).

For now, it just simply disables predicate push down when using merged schema in this PR.

…ma for Parquet.

HyukjinKwon · 2015-10-29T02:31:25Z

/cc @liancheng

liancheng · 2015-10-29T06:56:58Z

ok to test

liancheng · 2015-10-29T07:12:09Z

@HyukjinKwon Could you please add a test for this?

SparkQA · 2015-10-29T08:56:42Z

Test build #44578 has finished for PR 9327 at commit 85dadbc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2015-10-30T05:42:06Z

@liancheng oh, right. I just added at ParquetFilterSuite

SparkQA · 2015-10-30T07:47:51Z

Test build #44664 has finished for PR 9327 at commit 7007c21.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-10-30T10:06:54Z

Thanks! I'm merging this to master and branch-1.5.

liancheng · 2015-10-30T10:15:57Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

Several nits here, but I'm going to merge this one first since 1.5.2rc2 is being cut soon.

Please use val instead of var here.

To construct the test DF, the following way is more preferable for better readability:

sqlContext.range(3).select('id as 'c, 'id cast StringType as 'b)

or

sqlContext.range(3).selectExpr("id AS c", "CAST(id AS STRING) AS b")

val was used by mistake... Thanks for the comments!

…lumn fail When enabling mergedSchema and predicate filter, this fails since Parquet does not accept filters pushed down when the columns of the filters do not exist in the schema. This is related with Parquet issue (https://issues.apache.org/jira/browse/PARQUET-389). For now, it just simply disables predicate push down when using merged schema in this PR. Author: hyukjinkwon <[email protected]> Closes #9327 from HyukjinKwon/SPARK-11103. (cherry picked from commit 59db9e9) Signed-off-by: Cheng Lian <[email protected]>

liancheng · 2015-10-30T10:24:37Z

This PR doesn't merge cleanly with branch-1.5, manually resolved the conflicts while merging.

yhuai · 2015-10-30T19:08:15Z

@liancheng that 1.5 cherry-pick picked a unnecessary test. I will fix it.

yhuai · 2015-10-30T19:17:20Z

Problem fixed with 6b10ea5.

yhuai · 2015-10-30T22:50:02Z

Can you guys take a look at https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3894/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/testReport/junit/org.apache.spark.sql.execution.datasources.parquet/ParquetFilterSuite/SPARK_11103__Filter_applied_on_merged_Parquet_schema_with_new_column_fails/.

yhuai · 2015-10-31T00:09:30Z

Can you guys take a look at https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44697/testReport/junit/org.apache.spark.sql.execution.datasources.parquet/ParquetFilterSuite/SPARK_11103__Filter_applied_on_merged_Parquet_schema_with_new_column_fails/?

yhuai · 2015-10-31T00:38:53Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

It will great if we can specify the columns for this kind of cases because the ordering of the columns can be changed.

I see. I just wonder if the inconsistent order is another issue. I think users might think it is weird if they run the same script with SELECT * (using merging schemas) but the column order of the results are different sometimes.

Could I open an issue for this if you think it is a separate issue?

@HyukjinKwon It will be weird if the column ordering of sqlContext.read.parquet(pathOne, pathTwo) is not deterministic. Can you try it out and see if it is the case?

I will try to check this. Thanks.

@yhuai
I investigated that. It does not guarantee the order.

This is because of FileStatusCache in HadoopFsRelation (which ParquetRelation extends as you know). When FileStatusCache.listLeafFiles() is called, this returns Set[FileStatus] which messes up the order of Array[FileStatus].

So, after retrieving the list of leaf files including _metadata and _common_metadata, this starts to merge (separately and if necessary) the Sets of _metadata, _common_metadata and part-files in ParquetRelation.mergeSchemasInParallel(), which ends up in the different column order having the leading columns (of the first file) which the other files do not have.

I think this can be resolved by using LinkedHashSet.

I will open an issue for this. I would like to work on this if this is really an issue.

Filed here https://issues.apache.org/jira/browse/SPARK-11500

liancheng · 2015-10-31T00:41:24Z

I'm fixing the test case.

liancheng · 2015-10-31T00:45:54Z

Actually @yhuai already opened #9387 to fix this.

[SPARK-11103][SQL] Disable predicate push down when using merged sche…

85dadbc

…ma for Parquet.

[SPARK-11103][SQL] Add test code and update comments

7007c21

liancheng reviewed Oct 30, 2015
View reviewed changes

asfgit closed this in 59db9e9 Oct 30, 2015

yhuai reviewed Oct 31, 2015
View reviewed changes

HyukjinKwon deleted the SPARK-11103 branch September 23, 2016 18:28

[SPARK-11103][SQL] Filter applied on Merged Parquet shema with new column fail #9327

[SPARK-11103][SQL] Filter applied on Merged Parquet shema with new column fail #9327

Uh oh!

Conversation

HyukjinKwon commented Oct 28, 2015

Uh oh!

HyukjinKwon commented Oct 29, 2015

Uh oh!

liancheng commented Oct 29, 2015

Uh oh!

liancheng commented Oct 29, 2015

Uh oh!

SparkQA commented Oct 29, 2015

Uh oh!

HyukjinKwon commented Oct 30, 2015

Uh oh!

SparkQA commented Oct 30, 2015

Uh oh!

liancheng commented Oct 30, 2015

Uh oh!

liancheng Oct 30, 2015

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 30, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng commented Oct 30, 2015

Uh oh!

yhuai commented Oct 30, 2015

Uh oh!

yhuai commented Oct 30, 2015

Uh oh!

yhuai commented Oct 30, 2015

Uh oh!

yhuai commented Oct 31, 2015

Uh oh!

yhuai Oct 31, 2015

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 2, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai Nov 2, 2015

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 3, 2015

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 4, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng commented Oct 31, 2015

Uh oh!

liancheng commented Oct 31, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants