Skip to content

Conversation

@gengliangwang
Copy link
Member

What changes were proposed in this pull request?

Allow casting complex types as string type in ANSI mode.

Why are the changes needed?

Currently, complex types are not allowed to cast as string type. This breaks the DataFrame.show() API. E.g

scala> sql(“select array(1, 2, 2)“).show(false)
org.apache.spark.sql.AnalysisException: cannot resolve ‘CAST(`array(1, 2, 2)` AS STRING)’ due to data type mismatch:
 cannot cast array<int> to string with ANSI mode on.

We should allow the conversion as the extension of the ANSI SQL standard, so that the DataFrame.show() still work in ANSI mode.

Does this PR introduce any user-facing change?

Yes, casting complex types as string type is now allowed in ANSI mode.

How was this patch tested?

Unit tests.

@SparkQA
Copy link

SparkQA commented Mar 24, 2021

Test build #136470 has started for PR 31954 at commit 31ab2b6.

@cloud-fan
Copy link
Contributor

ANSI explicit CAST is in 3.1 so this is a bug fix of ANSI mode for 3.1 (because df.show is not working with ANSI mode)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this affect the coming year-month and day-time interval?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan . Usually, we do the explicitly allowed-list approach in case of types. Is this change okay?
If this PR aims for complex type only, why don't we add them explicitly instead of doing this widely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, cc @MaxGekk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.show needs to cast the column to string, I think we need to support casting from all the data types here, otherwise df.show may still be broken under some cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, we don't support such casting. I opened the JIRAs for that: SPARK-34667 and SPARK-34668

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @cloud-fan .
According to this test case, it's not a bug because it is designed like this.

ANSI explicit CAST is in 3.1 so this is a bug fix of ANSI mode for 3.1 (because df.show is not working with ANSI mode)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea it is designed like this, but if df.show can't work, I think it's a bug in the design...

@dongjoon-hyun
Copy link
Member

For me, this is an improvement for Apache Spark 3.2.0, @gengliangwang .
Let's not backport this to branch-3.1.

@gengliangwang
Copy link
Member Author

@dongjoon-hyun I think putting it in branch-3.1 makes sense as well. It fixes the broken API DataFrame.show() under ANSI mode.
As the 3.2 release won't be available for months, it doesn't hunt to port this to branch-3.1.

@dongjoon-hyun
Copy link
Member

New improvements always deliver something which doesn't work before. According to the existing test case, this limitation is designed from the beginning at the Spark 3.1 implementation. For me, it's just a known limitation instead of a bug.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 24, 2021

New features do not work properly with all combinations. Like SPARK-34827 (AQE + IO Encryption), this kind of efforts should be done in the new releases.

@cloud-fan
Copy link
Contributor

For SPARK-34827, it does expose a bug and we fixed it in 3.1, see #31898

I don't think failing in df.show is a limitation. It's obviously a bug. Failing with AQE + IO Encryption is also a bug and you merged that fix to 3.1. A limitation can be "ANSI mode is forcibly disabled in df.show", but it's kind of a weird behavior (df.collect and df.show behave differently) and I'd prefer to just fix this.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 24, 2021

@cloud-fan . It's not fixed. The new implementation is disabled. :)
Please note that SPARK-34827 is different from SPARK-34790. SPARK-34827 is not fixed it in 3.1.

@dongjoon-hyun
Copy link
Member

Like we are doing in SPARK-33828 for AQE QA, I believe ANSI also needs more QA with umbrella JIRA issue, @gengliangwang and @cloud-fan .
Both AQE and ANSI is hidden by default since 3.0, we still are receiving more issues. We know that there is no bug-free status.

@cloud-fan
Copy link
Contributor

I don't understand what's the difference here. Both AQE and IO Encryption are disabled by default in 3.1 and we still merged #31898 to 3.1

I'm OK if you have different ideas to make df.show not fail when ANSI mode is turned on. If a user can trigger an unexpected failure with public APIs, that's a bug to me.

I agree we should have more QAs, but it doesn't mean we should stop backporting bug fixes. We have fixed many AQE issues and backported them to 3.0/3.1, although AQE is not turned on by default there. If you don't think this PR is the corrected fix, that's a different story and we can discuss more here.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 24, 2021

This is a documented behavior, https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html.

If a user can trigger an unexpected failure with public APIs, that's a bug to me.

So, this is not an unexpected failure. For me, this is a well-documented limitation of Apache Spark 3.1.x with the explicit test coverage.

@cloud-fan
Copy link
Contributor

I don't see any document saying that df.show needs to cast the data to string, and it's unexpected that df.show fails with ANSI mode.

I agree that it's tricky to change a documented behavior, maybe it's arguable to change the ANSI explicit CAST behavior in 3.1. But df.show should be fixed as it's a bug. If you don't agree to change the ANSI explicit CAST behavior, we can try to find a different fix.

@dongjoon-hyun
Copy link
Member

+1 for that. I guess it will be a fix for df.show function which is applicable to master/3.1 together.

If you don't agree to change the ANSI explicit CAST behavior, we can try to find a different fix.

@SparkQA
Copy link

SparkQA commented Mar 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41054/

@SparkQA
Copy link

SparkQA commented Mar 24, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41054/

@gengliangwang
Copy link
Member Author

gengliangwang commented Mar 25, 2021

Talked to @cloud-fan offline. It's overkilling to find a different fix(e.g. create a new cast expression) in the implementation ofdf.show(). Let's use this simple solution and merge to master only.

@SparkQA
Copy link

SparkQA commented Mar 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41107/

@SparkQA
Copy link

SparkQA commented Mar 25, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41107/

@gengliangwang
Copy link
Member Author

Thanks for the review. Merging to master.

@dongjoon-hyun
Copy link
Member

Thank you for the decision, @gengliangwang and @cloud-fan .

@SparkQA
Copy link

SparkQA commented Mar 25, 2021

Test build #136524 has finished for PR 31954 at commit 1707885.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

srowen pushed a commit that referenced this pull request Mar 30, 2021
…FailureMessage

### What changes were proposed in this pull request?

After #31954, Array type is allowed to be cast as String type. So the customized conversion failure message branch from AnsiCast.typeCheckFailureMessage won't be reached anymore.
This PR is to remove the dead code.

### Why are the changes needed?

Code clean up.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Just removing dead code.

Closes #32004 from gengliangwang/SPARK-34856-followup.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Mar 30, 2021
…FailureMessage

### What changes were proposed in this pull request?

After apache/spark#31954, Array type is allowed to be cast as String type. So the customized conversion failure message branch from AnsiCast.typeCheckFailureMessage won't be reached anymore.
This PR is to remove the dead code.

### Why are the changes needed?

Code clean up.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Just removing dead code.

Closes #32004 from gengliangwang/SPARK-34856-followup.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
xuanyuanking pushed a commit to xuanyuanking/spark that referenced this pull request Sep 29, 2021
### What changes were proposed in this pull request?

Allow casting complex types as string type in ANSI mode.

### Why are the changes needed?

Currently, complex types are not allowed to cast as string type. This breaks the DataFrame.show() API. E.g
```
scala> sql(“select array(1, 2, 2)“).show(false)
org.apache.spark.sql.AnalysisException: cannot resolve ‘CAST(`array(1, 2, 2)` AS STRING)’ due to data type mismatch:
 cannot cast array<int> to string with ANSI mode on.
```
We should allow the conversion as the extension of the ANSI SQL standard, so that the DataFrame.show() still work in ANSI mode.
### Does this PR introduce _any_ user-facing change?

Yes, casting complex types as string type is now allowed in ANSI mode.

### How was this patch tested?

Unit tests.

Closes apache#31954 from gengliangwang/fixExplicitCast.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants