Skip to content

Conversation

@HyukjinKwon
Copy link
Member

When enabling mergedSchema and predicate filter, this fails since Parquet does not accept filters pushed down when the columns of the filters do not exist in the schema.
This is related with Parquet issue (https://issues.apache.org/jira/browse/PARQUET-389).

For now, it just simply disables predicate push down when using merged schema in this PR.

@HyukjinKwon
Copy link
Member Author

/cc @liancheng

@liancheng
Copy link
Contributor

ok to test

@liancheng
Copy link
Contributor

@HyukjinKwon Could you please add a test for this?

@SparkQA
Copy link

SparkQA commented Oct 29, 2015

Test build #44578 has finished for PR 9327 at commit 85dadbc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

@liancheng oh, right. I just added at ParquetFilterSuite

@SparkQA
Copy link

SparkQA commented Oct 30, 2015

Test build #44664 has finished for PR 9327 at commit 7007c21.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

Thanks! I'm merging this to master and branch-1.5.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several nits here, but I'm going to merge this one first since 1.5.2rc2 is being cut soon.

  • Please use val instead of var here.

  • To construct the test DF, the following way is more preferable for better readability:

    sqlContext.range(3).select('id as 'c, 'id cast StringType as 'b)

    or

    sqlContext.range(3).selectExpr("id AS c", "CAST(id AS STRING) AS b")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val was used by mistake... Thanks for the comments!

@asfgit asfgit closed this in 59db9e9 Oct 30, 2015
asfgit pushed a commit that referenced this pull request Oct 30, 2015
…lumn fail

When enabling mergedSchema and predicate filter, this fails since Parquet does not accept filters pushed down when the columns of the filters do not exist in the schema.
This is related with Parquet issue (https://issues.apache.org/jira/browse/PARQUET-389).

For now, it just simply disables predicate push down when using merged schema in this PR.

Author: hyukjinkwon <[email protected]>

Closes #9327 from HyukjinKwon/SPARK-11103.

(cherry picked from commit 59db9e9)
Signed-off-by: Cheng Lian <[email protected]>
@liancheng
Copy link
Contributor

This PR doesn't merge cleanly with branch-1.5, manually resolved the conflicts while merging.

@yhuai
Copy link
Contributor

yhuai commented Oct 30, 2015

@liancheng that 1.5 cherry-pick picked a unnecessary test. I will fix it.

@yhuai
Copy link
Contributor

yhuai commented Oct 30, 2015

Problem fixed with 6b10ea5.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will great if we can specify the columns for this kind of cases because the ordering of the columns can be changed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I just wonder if the inconsistent order is another issue. I think users might think it is weird if they run the same script with SELECT * (using merging schemas) but the column order of the results are different sometimes.

Could I open an issue for this if you think it is a separate issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon It will be weird if the column ordering of sqlContext.read.parquet(pathOne, pathTwo) is not deterministic. Can you try it out and see if it is the case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to check this. Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhuai
I investigated that. It does not guarantee the order.

This is because of FileStatusCache in HadoopFsRelation (which ParquetRelation extends as you know). When FileStatusCache.listLeafFiles() is called, this returns Set[FileStatus] which messes up the order of Array[FileStatus].

So, after retrieving the list of leaf files including _metadata and _common_metadata, this starts to merge (separately and if necessary) the Sets of _metadata, _common_metadata and part-files in ParquetRelation.mergeSchemasInParallel(), which ends up in the different column order having the leading columns (of the first file) which the other files do not have.

I think this can be resolved by using LinkedHashSet.

I will open an issue for this. I would like to work on this if this is really an issue.

Filed here https://issues.apache.org/jira/browse/SPARK-11500

@liancheng
Copy link
Contributor

I'm fixing the test case.

@liancheng
Copy link
Contributor

Actually @yhuai already opened #9387 to fix this.

@HyukjinKwon HyukjinKwon deleted the SPARK-11103 branch September 23, 2016 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants