-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11103][SQL] Filter applied on Merged Parquet shema with new column fail #9327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/cc @liancheng |
|
ok to test |
|
@HyukjinKwon Could you please add a test for this? |
|
Test build #44578 has finished for PR 9327 at commit
|
|
@liancheng oh, right. I just added at |
|
Test build #44664 has finished for PR 9327 at commit
|
|
Thanks! I'm merging this to master and branch-1.5. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several nits here, but I'm going to merge this one first since 1.5.2rc2 is being cut soon.
-
Please use
valinstead ofvarhere. -
To construct the test DF, the following way is more preferable for better readability:
sqlContext.range(3).select('id as 'c, 'id cast StringType as 'b)
or
sqlContext.range(3).selectExpr("id AS c", "CAST(id AS STRING) AS b")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val was used by mistake... Thanks for the comments!
…lumn fail When enabling mergedSchema and predicate filter, this fails since Parquet does not accept filters pushed down when the columns of the filters do not exist in the schema. This is related with Parquet issue (https://issues.apache.org/jira/browse/PARQUET-389). For now, it just simply disables predicate push down when using merged schema in this PR. Author: hyukjinkwon <[email protected]> Closes #9327 from HyukjinKwon/SPARK-11103. (cherry picked from commit 59db9e9) Signed-off-by: Cheng Lian <[email protected]>
|
This PR doesn't merge cleanly with branch-1.5, manually resolved the conflicts while merging. |
|
@liancheng that 1.5 cherry-pick picked a unnecessary test. I will fix it. |
|
Problem fixed with 6b10ea5. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will great if we can specify the columns for this kind of cases because the ordering of the columns can be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I just wonder if the inconsistent order is another issue. I think users might think it is weird if they run the same script with SELECT * (using merging schemas) but the column order of the results are different sometimes.
Could I open an issue for this if you think it is a separate issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon It will be weird if the column ordering of sqlContext.read.parquet(pathOne, pathTwo) is not deterministic. Can you try it out and see if it is the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to check this. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yhuai
I investigated that. It does not guarantee the order.
This is because of FileStatusCache in HadoopFsRelation (which ParquetRelation extends as you know). When FileStatusCache.listLeafFiles() is called, this returns Set[FileStatus] which messes up the order of Array[FileStatus].
So, after retrieving the list of leaf files including _metadata and _common_metadata, this starts to merge (separately and if necessary) the Sets of _metadata, _common_metadata and part-files in ParquetRelation.mergeSchemasInParallel(), which ends up in the different column order having the leading columns (of the first file) which the other files do not have.
I think this can be resolved by using LinkedHashSet.
I will open an issue for this. I would like to work on this if this is really an issue.
Filed here https://issues.apache.org/jira/browse/SPARK-11500
|
I'm fixing the test case. |
When enabling mergedSchema and predicate filter, this fails since Parquet does not accept filters pushed down when the columns of the filters do not exist in the schema.
This is related with Parquet issue (https://issues.apache.org/jira/browse/PARQUET-389).
For now, it just simply disables predicate push down when using merged schema in this PR.