Skip to content

Update upstream#19

Merged
GulajavaMinistudio merged 5 commits intoGulajavaMinistudio:masterfrom
apache:master
Apr 16, 2017
Merged

Update upstream#19
GulajavaMinistudio merged 5 commits intoGulajavaMinistudio:masterfrom
apache:master

Conversation

@GulajavaMinistudio
Copy link
Copy Markdown
Owner

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

cloud-fan and others added 5 commits April 16, 2017 11:14
…erializable

## What changes were proposed in this pull request?

In #17398 we introduced `UnresolvedMapObjects` as a placeholder of `MapObjects`. Unfortunately `UnresolvedMapObjects` is not serializable as its `function` may reference Scala `Type` which is not serializable.

Ideally this is fine, as we will never serialize and send unresolved expressions to executors. However users may accidentally do this, e.g. mistakenly reference an encoder instance when implementing `Aggregator`, we should fix it so that it's just a performance issue(more network traffic) and should not fail the query.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #17639 from cloud-fan/minor.
…minism of Hive UDF

### What changes were proposed in this pull request?
```JAVA
  /**
   * Certain optimizations should not be applied if UDF is not deterministic.
   * Deterministic UDF returns same result each time it is invoked with a
   * particular input. This determinism just needs to hold within the context of
   * a query.
   *
   * return true if the UDF is deterministic
   */
  boolean deterministic() default true;
```

Based on the definition of [UDFType](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFType.java#L42-L50), when Hive UDF's children are non-deterministic, Hive UDF is also non-deterministic.

### How was this patch tested?
Added test cases.

Author: Xiao Li <gatorsmile@gmail.com>

Closes #17635 from gatorsmile/udfDeterministic.
…s into docker when running on mesos with docker containerizer

## What changes were proposed in this pull request?

Allow passing in arbitrary parameters into docker when launching spark executors on mesos with docker containerizer tnachen

## How was this patch tested?

Manually built and tested with passed in parameter

Author: Ji Yan <jiyan@Jis-MacBook-Air.local>

Closes #17109 from yanji84/ji/allow_set_docker_user.
… failure in SBT Hadoop 2.6 master on Jenkins

## What changes were proposed in this pull request?

This PR proposes to add

```
      <dependency>
        <groupId>org.apache.avro</groupId>
        <artifactId>avro</artifactId>
      </dependency>
```

in core POM to see if it resolves the build failure as below:

```
[error] /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123: value createDatumWriter is not a member of org.apache.avro.generic.GenericData
[error]     writerCache.getOrElseUpdate(schema, GenericData.get.createDatumWriter(schema))
[error]
```

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull

## How was this patch tested?

I tried many ways but I was unable to reproduce this in my local. Sean also tried the way I did but he was also unable to reproduce this.

Please refer the comments in #17477 (comment)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #17642 from HyukjinKwon/SPARK-20343.
…inst project's code style

## What changes were proposed in this pull request?

Currently, multi-dot separated variables in R is not allowed. For example,

```diff
 setMethod("from_json", signature(x = "Column", schema = "structType"),
-          function(x, schema, asJsonArray = FALSE, ...) {
+          function(x, schema, as.json.array = FALSE, ...) {
             if (asJsonArray) {
               jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
                                      "createArrayType",
```

produces an error as below:

```
R/functions.R:2462:31: style: Words within variable and function names should be separated by '_' rather than '.'.
          function(x, schema, as.json.array = FALSE, ...) {
                              ^~~~~~~~~~~~~
```

This seems against https://google.github.io/styleguide/Rguide.xml#identifiers which says

> The preferred form for variable names is all lower case letters and words separated with dots

This looks because lintr by default https://github.com/jimhester/lintr follows http://r-pkgs.had.co.nz/style.html as written in the README.md. Few cases seems not following Google's one as "a few tweaks".

Per [SPARK-6813](https://issues.apache.org/jira/browse/SPARK-6813), we follow Google's R Style Guide with few exceptions https://google.github.io/styleguide/Rguide.xml. This is also merged into Spark's website - apache/spark-website#43

Also, it looks we have no limit on function name. This rule also looks affecting to the name of functions as written in the README.md.

> `multiple_dots_linter`: check that function and variable names are separated by _ rather than ..

## How was this patch tested?

Manually tested `./dev/lint-r`with the manual change below in `R/functions.R`:

```diff
 setMethod("from_json", signature(x = "Column", schema = "structType"),
-          function(x, schema, asJsonArray = FALSE, ...) {
+          function(x, schema, as.json.array = FALSE, ...) {
             if (asJsonArray) {
               jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
                                      "createArrayType",
```

**Before**

```R
R/functions.R:2462:31: style: Words within variable and function names should be separated by '_' rather than '.'.
          function(x, schema, as.json.array = FALSE, ...) {
                              ^~~~~~~~~~~~~
```

**After**

```
lintr checks passed.
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #17590 from HyukjinKwon/disable-dot-in-name.
@GulajavaMinistudio GulajavaMinistudio merged commit 9f7ba32 into GulajavaMinistudio:master Apr 16, 2017
GulajavaMinistudio pushed a commit that referenced this pull request Mar 12, 2022
…aceable

### What changes were proposed in this pull request?

This PR uses a manual recursion to replace `RuntimeReplaceable` expressions instead of `transformAllExpressionsWithPruning`. The problem of `transformAllExpressionsWithPruning` is it will automatically make the replacement expression inherit  the function alias name from the parent node, which is quite misleading. For example, `select date_part('month', c) from t`, the optimized plan in EXPLAIN before this PR is
```
Project [date_part(cast(c#18 as date)) AS date_part(month, c)#19]
+- Relation default.t[c#18] parquet
```
Now it's
```
Project [month(cast(c#9 as date)) AS date_part(month, c)#10]
+- Relation default.t[c#9] parquet
```

### Why are the changes needed?

fix misleading EXPLAIN result

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test

Closes apache#35821 from cloud-fan/follow2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
GulajavaMinistudio pushed a commit that referenced this pull request Aug 12, 2024
…eption

### What changes were proposed in this pull request?

This pr reworks the group by map type to fix issues:
- Can not bind reference excpetion at runtume since the attribute was wrapped by `MapSort` and we didi not transform the plan with new output
- The add `MapSort` rule should be put before `PullOutGroupingExpressions` to avoid complex expr existing in grouping keys

### Why are the changes needed?

To fix issues.

for example:
```
select map(1, id) from range(10) group by map(1, id);

[INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)#19] SQLSTATE: XX000
org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)#19] SQLSTATE: XX000
	at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
	at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:81)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:470)
```

### Does this PR introduce _any_ user-facing change?

no, not released

### How was this patch tested?

improve the tests to add more cases

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#47545 from ulysses-you/maptype.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: youxiduo <youxiduo@corp.netease.com>
GulajavaMinistudio pushed a commit that referenced this pull request Feb 3, 2025
…IN-subquery

### What changes were proposed in this pull request?

This PR adds code to `RewritePredicateSubquery#apply` to explicitly handle the case where an `Aggregate` node contains an aggregate expression in the left-hand operand of an IN-subquery expression. The explicit handler moves the IN-subquery expressions out of the `Aggregate` and into a parent `Project` node. The `Aggregate` will continue to perform the aggregations that were used as an operand to the IN-subquery expression, but will not include the IN-subquery expression itself. After pulling up IN-subquery expressions into a Project node, `RewritePredicateSubquery#apply` is called again to handle the `Project` as a `UnaryNode`. The `Join` will now be inserted between the `Project` and the `Aggregate` node, and the join condition will use an attribute rather than an aggregate expression, e.g.:
```
Project [col1#32, exists#42 AS (sum(col2) IN (listquery()))#40]
+- Join ExistenceJoin(exists#42), (sum(col2)#41L = c2#39L)
   :- Aggregate [col1#32], [col1#32, sum(col2#33) AS sum(col2)#41L]
   :  +- LocalRelation [col1#32, col2#33]
   +- LocalRelation [c2#39L]
```
`sum(col2)#41L` in the above join condition, despite how it looks, is the name of the attribute, not an aggregate expression.

### Why are the changes needed?

The following query fails:
```
create or replace temp view v1(c1, c2) as values (1, 2), (1, 3), (2, 2), (3, 7), (3, 1);
create or replace temp view v2(col1, col2) as values (1, 2), (1, 3), (2, 2), (3, 7), (3, 1);

select col1, sum(col2) in (select c2 from v1)
from v2 group by col1;
```
It fails with this error:
```
[INTERNAL_ERROR] Cannot generate code for expression: sum(input[1, int, false]) SQLSTATE: XX000
```
With SPARK_TESTING=1, it fails with this error:
```
[PLAN_VALIDATION_FAILED_RULE_IN_BATCH] Rule org.apache.spark.sql.catalyst.optimizer.RewritePredicateSubquery in batch RewriteSubquery generated an invalid plan: Special expressions are placed in the wrong plan:
Aggregate [col1#11], [col1#11, first(exists#20, false) AS (sum(col2) IN (listquery()))#19]
+- Join ExistenceJoin(exists#20), (sum(col2#12) = c2#18L)
   :- LocalRelation [col1#11, col2#12]
   +- LocalRelation [c2#18L]
```
The issue is that `RewritePredicateSubquery` builds a `Join` operator where the join condition contains an aggregate expression.

The bug is in the handler for `UnaryNode` in `RewritePredicateSubquery#apply`, which adds a `Join` below the `Aggregate` and assumes that the left-hand operand of IN-subquery can be used in the join condition. This works fine for most cases, but not when the left-hand operand is an aggregate expression.

This PR moves the offending IN-subqueries to a `Project` node, with the aggregates replaced by attributes referring to the aggregate expressions. The resulting join condition now uses those attributes rather than the actual aggregate expressions.

### Does this PR introduce _any_ user-facing change?

No, other than allowing this type of query to succeed.

### How was this patch tested?

New unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#48627 from bersprockets/aggregate_in_set_issue.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants