Skip to content

[SPARK-30826][SQL] Respect reference case in StringStartsWith pushed down to parquet#27574

Closed
MaxGekk wants to merge 3 commits intoapache:masterfrom
MaxGekk:parquet-StringStartsWith-case-sens
Closed

[SPARK-30826][SQL] Respect reference case in StringStartsWith pushed down to parquet#27574
MaxGekk wants to merge 3 commits intoapache:masterfrom
MaxGekk:parquet-StringStartsWith-case-sens

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Feb 14, 2020

What changes were proposed in this pull request?

In the PR, I propose to convert the attribute name of StringStartsWith pushed down to the Parquet datasource to column reference via the nameToParquetField map. Similar conversions are performed for other source filters pushed down to parquet.

Why are the changes needed?

This fixes the bug described in SPARK-30826. The query from an external table:

CREATE TABLE t1 (col STRING)
USING parquet
OPTIONS (path '$path')

created on top of written parquet files by Seq("42").toDF("COL").write.parquet(path) returns wrong empty result:

spark.sql("SELECT * FROM t1 WHERE col LIKE '4%'").show
+---+
|col|
+---+
+---+

Does this PR introduce any user-facing change?

Yes. After the changes the result is correct for the example above:

spark.sql("SELECT * FROM t1 WHERE col LIKE '4%'").show
+---+
|col|
+---+
| 42|
+---+

How was this patch tested?

Added a test to ParquetFilterSuite

@SparkQA
Copy link

SparkQA commented Feb 14, 2020

Test build #118396 has finished for PR 27574 at commit 5e632df.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 14, 2020

jenkins, retest this, please

Copy link
Member

@Ngone51 Ngone51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It'd be better if we can reduce the scope of PR title to make it more compatible with the fix.

@MaxGekk MaxGekk changed the title [SPARK-30826][SQL] Fix wrong result after the StringStartsWith filter pushed down to parquet [SPARK-30826][SQL] Respect reference case in StringStartsWith pushed down to parquet Feb 14, 2020
if pushDownStartWith && canMakeFilterOn(name, prefix) =>
Option(prefix).map { v =>
FilterApi.userDefined(binaryColumn(name),
FilterApi.userDefined(binaryColumn(nameToParquetField(name).fieldName),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@SparkQA
Copy link

SparkQA commented Feb 14, 2020

Test build #118401 has finished for PR 27574 at commit 5e632df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

does 2.4 also have this bug?

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 14, 2020

@cloud-fan Yes, 2.4 has the bug too.

@cloud-fan cloud-fan closed this in 8b73b92 Feb 15, 2020
cloud-fan pushed a commit that referenced this pull request Feb 15, 2020
…d down to parquet

### What changes were proposed in this pull request?
In the PR, I propose to convert the attribute name of `StringStartsWith` pushed down to the Parquet datasource to column reference via the `nameToParquetField` map. Similar conversions are performed for other source filters pushed down to parquet.

### Why are the changes needed?
This fixes the bug described in [SPARK-30826](https://issues.apache.org/jira/browse/SPARK-30826). The query from an external table:
```sql
CREATE TABLE t1 (col STRING)
USING parquet
OPTIONS (path '$path')
```
created on top of written parquet files by `Seq("42").toDF("COL").write.parquet(path)` returns wrong empty result:
```scala
spark.sql("SELECT * FROM t1 WHERE col LIKE '4%'").show
+---+
|col|
+---+
+---+
```

### Does this PR introduce any user-facing change?
Yes. After the changes the result is correct for the example above:
```scala
spark.sql("SELECT * FROM t1 WHERE col LIKE '4%'").show
+---+
|col|
+---+
| 42|
+---+
```

### How was this patch tested?
Added a test to `ParquetFilterSuite`

Closes #27574 from MaxGekk/parquet-StringStartsWith-case-sens.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 8b73b92)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0/2.4!

cloud-fan pushed a commit that referenced this pull request Feb 15, 2020
…d down to parquet

### What changes were proposed in this pull request?
In the PR, I propose to convert the attribute name of `StringStartsWith` pushed down to the Parquet datasource to column reference via the `nameToParquetField` map. Similar conversions are performed for other source filters pushed down to parquet.

### Why are the changes needed?
This fixes the bug described in [SPARK-30826](https://issues.apache.org/jira/browse/SPARK-30826). The query from an external table:
```sql
CREATE TABLE t1 (col STRING)
USING parquet
OPTIONS (path '$path')
```
created on top of written parquet files by `Seq("42").toDF("COL").write.parquet(path)` returns wrong empty result:
```scala
spark.sql("SELECT * FROM t1 WHERE col LIKE '4%'").show
+---+
|col|
+---+
+---+
```

### Does this PR introduce any user-facing change?
Yes. After the changes the result is correct for the example above:
```scala
spark.sql("SELECT * FROM t1 WHERE col LIKE '4%'").show
+---+
|col|
+---+
| 42|
+---+
```

### How was this patch tested?
Added a test to `ParquetFilterSuite`

Closes #27574 from MaxGekk/parquet-StringStartsWith-case-sens.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 8b73b92)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. LGTM

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…d down to parquet

### What changes were proposed in this pull request?
In the PR, I propose to convert the attribute name of `StringStartsWith` pushed down to the Parquet datasource to column reference via the `nameToParquetField` map. Similar conversions are performed for other source filters pushed down to parquet.

### Why are the changes needed?
This fixes the bug described in [SPARK-30826](https://issues.apache.org/jira/browse/SPARK-30826). The query from an external table:
```sql
CREATE TABLE t1 (col STRING)
USING parquet
OPTIONS (path '$path')
```
created on top of written parquet files by `Seq("42").toDF("COL").write.parquet(path)` returns wrong empty result:
```scala
spark.sql("SELECT * FROM t1 WHERE col LIKE '4%'").show
+---+
|col|
+---+
+---+
```

### Does this PR introduce any user-facing change?
Yes. After the changes the result is correct for the example above:
```scala
spark.sql("SELECT * FROM t1 WHERE col LIKE '4%'").show
+---+
|col|
+---+
| 42|
+---+
```

### How was this patch tested?
Added a test to `ParquetFilterSuite`

Closes apache#27574 from MaxGekk/parquet-StringStartsWith-case-sens.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@MaxGekk MaxGekk deleted the parquet-StringStartsWith-case-sens branch June 5, 2020 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants