LBFGS CostFun used to send a dense vector of zeroes as a closure #7

eugene-kharitonov · 2016-10-17T13:09:45Z

What changes were proposed in this pull request?

LBFGS CostFun used to send a dense vector of zeroes as a closure in a treeAggregate call.
This vector can be of very high dimensionality, hence it's better to avoid this behaviour. We replace treeAggregate by a combination of mapPartition and treeReduce, creating a zero vector inside the mapPartition block in-place.

How was this patch tested?

Manual tests.

treeAggregate call. To avoid that, we replace treeAggregate by mapPartition + treeReduce, creating a zero vector inside the mapPartition block in-place.

AnthonyTruchet · 2016-10-17T15:38:28Z

As we are to contribute this patch upstream, please assert that the change is actually covered by the existing tests for L-BFGS and that they are running fine with it.

AnthonyTruchet · 2016-11-18T10:30:30Z

Replaced by #11

## What changes were proposed in this pull request? This PR aims to optimize GroupExpressions by removing repeating expressions. `RemoveRepetitionFromGroupExpressions` is added. **Before** ```scala scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain() == Physical Plan == WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1)criteo-forks#6,(1 + a#0)criteo-forks#7,(A#0 + 1)criteo-forks#8,(1 + A#0)criteo-forks#9], functions=[], output=[(a + 1)criteo-forks#5]) : +- INPUT +- Exchange hashpartitioning((a#0 + 1)criteo-forks#6, (1 + a#0)criteo-forks#7, (A#0 + 1)criteo-forks#8, (1 + A#0)criteo-forks#9, 200), None +- WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)criteo-forks#6,(1 + a#0) AS (1 + a#0)criteo-forks#7,(A#0 + 1) AS (A#0 + 1)criteo-forks#8,(1 + A#0) AS (1 + A#0)criteo-forks#9], functions=[], output=[(a#0 + 1)criteo-forks#6,(1 + a#0)criteo-forks#7,(A#0 + 1)criteo-forks#8,(1 + A#0)criteo-forks#9]) : +- INPUT +- LocalTableScan [a#0], [[1],[2]] ``` **After** ```scala scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain() == Physical Plan == WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1)criteo-forks#6], functions=[], output=[(a + 1)criteo-forks#5]) : +- INPUT +- Exchange hashpartitioning((a#0 + 1)criteo-forks#6, 200), None +- WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)criteo-forks#6], functions=[], output=[(a#0 + 1)criteo-forks#6]) : +- INPUT +- LocalTableScan [a#0], [[1],[2]] ``` ## How was this patch tested? Pass the Jenkins tests (with a new testcase) Author: Dongjoon Hyun <[email protected]> Closes apache#12590 from dongjoon-hyun/SPARK-14830. (cherry picked from commit 6e63201) Signed-off-by: Michael Armbrust <[email protected]>

…plan properly ### What changes were proposed in this pull request? Make `ResolveRelations` handle plan id properly cherry-pick bugfix apache#45214 to 3.5 ### Why are the changes needed? bug fix for Spark Connect, it won't affect classic Spark SQL before this PR: ``` from pyspark.sql import functions as sf spark.range(10).withColumn("value_1", sf.lit(1)).write.saveAsTable("test_table_1") spark.range(10).withColumnRenamed("id", "index").withColumn("value_2", sf.lit(2)).write.saveAsTable("test_table_2") df1 = spark.read.table("test_table_1") df2 = spark.read.table("test_table_2") df3 = spark.read.table("test_table_1") join1 = df1.join(df2, on=df1.id==df2.index).select(df2.index, df2.value_2) join2 = df3.join(join1, how="left", on=join1.index==df3.id) join2.schema ``` fails with ``` AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "id". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704 ``` That is due to existing plan caching in `ResolveRelations` doesn't work with Spark Connect ``` === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === '[#12]Join LeftOuter, '`==`('index, 'id) '[#12]Join LeftOuter, '`==`('index, 'id) !:- '[#9]UnresolvedRelation [test_table_1], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 !+- '[#11]Project ['index, 'value_2] : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#10]Join Inner, '`==`('id, 'index) +- '[#11]Project ['index, 'value_2] ! :- '[#7]UnresolvedRelation [test_table_1], [], false +- '[#10]Join Inner, '`==`('id, 'index) ! +- '[#8]UnresolvedRelation [test_table_2], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 ! : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#8]SubqueryAlias spark_catalog.default.test_table_2 ! +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_2`, [], false Can not resolve 'id with plan 7 ``` `[#7]UnresolvedRelation [test_table_1], [], false` was wrongly resolved to the cached one ``` :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? ci Closes apache#46291 from zhengruifeng/connect_fix_read_join_35. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

LBFGS CostFun used to send a dense vector of zeroes as a closure in a

91ed39a

treeAggregate call. To avoid that, we replace treeAggregate by mapPartition + treeReduce, creating a zero vector inside the mapPartition block in-place.

AnthonyTruchet mentioned this pull request Nov 18, 2016

[SPARK-18471][MLLIB] In LBFGS, avoid sending huge vectors of 0 #11

Merged

AnthonyTruchet closed this Nov 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LBFGS CostFun used to send a dense vector of zeroes as a closure #7

LBFGS CostFun used to send a dense vector of zeroes as a closure #7

Uh oh!

eugene-kharitonov commented Oct 17, 2016

Uh oh!

AnthonyTruchet commented Oct 17, 2016 •

edited

Loading

Uh oh!

AnthonyTruchet commented Nov 18, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LBFGS CostFun used to send a dense vector of zeroes as a closure #7

LBFGS CostFun used to send a dense vector of zeroes as a closure #7

Uh oh!

Conversation

eugene-kharitonov commented Oct 17, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AnthonyTruchet commented Oct 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AnthonyTruchet commented Nov 18, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AnthonyTruchet commented Oct 17, 2016 •

edited

Loading