Skip to content

Conversation

@hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

If you construct catalyst trees using scala.collection.immutable.Stream you can run into situations where valid transformations do not seem to have any effect. There are two causes for this behavior:

  • Stream is evaluated lazily. Note that default implementation will generally only evaluate a function for the first element (this makes testing a bit tricky).
  • TreeNode and QueryPlan use side effects to detect if a tree has changed. Mapping over a stream is lazy and does not need to trigger this side effect. If this happens the node will invalidly assume that it did not change and return itself instead if the newly created node (this is for GC reasons).

This PR fixes this issue by forcing materialization on streams in TreeNode and QueryPlan.

How was this patch tested?

Unit tests were added to TreeNodeSuite and LogicalPlanSuite. An integration test was added to the PlannerSuite

@hvanhovell
Copy link
Contributor Author

cc @bogdanrdc @cloud-fan @ssimeonov

@SparkQA
Copy link

SparkQA commented Jun 12, 2018

Test build #91703 has finished for PR 21539 at commit d5832f4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor Author

retest this please

@ssimeonov
Copy link
Contributor

👍

@SparkQA
Copy link

SparkQA commented Jun 12, 2018

Test build #91708 has finished for PR 21539 at commit d5832f4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (children.nonEmpty) {
var changed = false
def mapChild(child: Any): Any = child match {
case arg: TreeNode[_] if containsChild(arg) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we reuse these code in L326?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so was about to do that but then I noticed that they handle different cases, mapChild handles TreeNode and (TreeNode, TreeNode), whereas L326-L349 handles TreeNode, Option[TreeNode] and Map[_, TreeNode]. I am not sure if combining them is useful, and if it is then I'd rather do it in a different PR.

val df = Union(Stream(
Range(1, 1, 1, 1),
Range(1, 2, 1, 1)))
df.queryExecution.executedPlan.execute()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it throw exception before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would throw an UnsupportedOperationException before.

@cloud-fan
Copy link
Contributor

nice fix!

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jun 13, 2018

Test build #91735 has finished for PR 21539 at commit d5832f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Jun 13, 2018

Interesting and great catch!

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 299d297 Jun 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants