Skip to content

[SPARK-34141][SQL] Remove side effect from ExtractGenerator#31213

Closed
tanelk wants to merge 4 commits intoapache:masterfrom
tanelk:SPARK-34141_extract_generator
Closed

[SPARK-34141][SQL] Remove side effect from ExtractGenerator#31213
tanelk wants to merge 4 commits intoapache:masterfrom
tanelk:SPARK-34141_extract_generator

Conversation

@tanelk
Copy link
Contributor

@tanelk tanelk commented Jan 17, 2021

What changes were proposed in this pull request?

Rewrote one ExtractGenerator case such that it would not rely on a side effect of the flatmap function.

Why are the changes needed?

With the dataframe api it is possible to have a lazy sequence as the output of a LogicalPlan. When exploding a column on this dataframe using the withColumn("newName", explode(col("name"))) method, the ExtractGenerator does not extract the generator and CheckAnalysis would throw an exception.

Does this PR introduce any user-facing change?

Bugfix
Before this, the work around was to put .select("*") before the explode.

How was this patch tested?

UT

@github-actions github-actions bot added the SQL label Jan 17, 2021
@tanelk
Copy link
Contributor Author

tanelk commented Jan 17, 2021

@HyukjinKwon , any chance that this bugfix could get to 3.1.1?

@SparkQA
Copy link

SparkQA commented Jan 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38737/

@SparkQA
Copy link

SparkQA commented Jan 17, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38737/

@SparkQA
Copy link

SparkQA commented Jan 17, 2021

Test build #134154 has finished for PR 31213 at commit 08e367c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38741/

@tanelk
Copy link
Contributor Author

tanelk commented Jan 17, 2021

@LuciferYang, you have been working with 2.13.
Any recommendations on how to get a lazy Seq on both 2.12 and 2.13.
On 2.13 the Seq.view does not return a Seq anymore.

@SparkQA
Copy link

SparkQA commented Jan 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38741/

@SparkQA
Copy link

SparkQA commented Jan 17, 2021

Test build #134158 has finished for PR 31213 at commit cefc79f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LuciferYang
Copy link
Contributor

@tanelk In Scala 2.13, Views are collections whose transformation operations are non strict and they no longer extend their corresponding collection type, for example, an IndexedSeqView no longer extends IndexedSeq.

Why do we need a non strict collection here? Can it be a strict collection?
If it must be a non strict collection, may need to define the type of output as scala.collection.SeqView for version compatibility.

@tanelk
Copy link
Contributor Author

tanelk commented Jan 18, 2021

Well that is the test case I'm trying to cover.
in 2.12 the input can be lazy Seq, but the current code breaks then.
Perhaps it's enough to say that I manually tested it and not add the UT.

@LuciferYang
Copy link
Contributor

cc @srowen

@LuciferYang
Copy link
Contributor

Well that is the test case I'm trying to cover.
in 2.12 the input can be lazy Seq, but the current code breaks then.
Perhaps it's enough to say that I manually tested it and not add the UT.

New UT in this pr will break the compilation of Scala 2.13, @tanelk want to get a lazy Seq on both 2.12 and 2.13, but SeqView is not Seq in Scala 2.13, do you have any suggestions ? @srowen

@srowen
Copy link
Member

srowen commented Jan 19, 2021

Hm why does Seq vs SeqView matter here? what's the compile error?

@LuciferYang
Copy link
Contributor

image

@tanelk
Copy link
Contributor Author

tanelk commented Jan 19, 2021

@srowen
There is a correctness issue, when an lazy Seq is given as the project list (for example SeqView).
I did fix it, but the unit test does not compile on 2.13, because there SeqView does not implement Seq any more.

Would it be okay do not add the UT or is there some lazy Seq present on both 2.12 and 2.13.

@srowen
Copy link
Member

srowen commented Jan 19, 2021

Hm, tough call. Can you invoke the second constructor by adding :_* to the end of the argument? But then I suppose that's not testing what you want to test either. Can you pass a Scala Stream here maybe? not sure if that works. It may work to create a new test-suite in a 2.12-only src/test directory if that's not too much trouble. It sounds like this can't happen in 2.13 anyway so the test isn't relevant.

@SparkQA
Copy link

SparkQA commented Jan 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39096/

@SparkQA
Copy link

SparkQA commented Jan 26, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39096/

@tanelk
Copy link
Contributor Author

tanelk commented Jan 26, 2021

I went ahead and added version dependent tests. Thanks for the advice.

import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.types._

class ExtractGeneratorSuite extends AnalysisTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add comments in each test file noting that there is a parallel one in the other source tree, so people realize that both need to change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you'll add the comments I'll merge it @tanelk

@SparkQA
Copy link

SparkQA commented Jan 26, 2021

Test build #134511 has finished for PR 31213 at commit 8dd6da0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ExtractGeneratorSuite extends AnalysisTest
  • class ExtractGeneratorSuite extends AnalysisTest

@SparkQA
Copy link

SparkQA commented Jan 31, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39275/

@SparkQA
Copy link

SparkQA commented Jan 31, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39275/

@SparkQA
Copy link

SparkQA commented Jan 31, 2021

Test build #134689 has finished for PR 31213 at commit 82ca8f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val explode = Alias(Explode(b), "c")()

// view is a lazy seq
val rel = LocalRelation(output = columns.view)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for end users to get stuck on this issue? If possible, could you add end-2-end tests, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hit this issue while using spark in java. The step before was an join, where I used JavaConverters.collectionAsScalaIterable(Arrays.asList(columns)).toSeq() as the join condition. The JavaConverters helper returns an lazy collection.
I can take a look at on how to e2e test this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@srowen srowen closed this in c73f70b Feb 6, 2021
@srowen
Copy link
Member

srowen commented Feb 6, 2021

Merged to master. I'm satisfied with the motivation and test.

@HyukjinKwon
Copy link
Member

Oh, just saw it. LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants