Update upstream#34
Merged
GulajavaMinistudio merged 11 commits intoGulajavaMinistudio:masterfrom May 3, 2017
Merged
Conversation
…ady initialized after getting SQLException ## What changes were proposed in this pull request? Avoid failing to initCause on JDBC exception with cause initialized to null ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #17800 from srowen/SPARK-20459.
## What changes were proposed in this pull request? Add support for the SQL standard distinct predicate to SPARK SQL. ``` <expression> IS [NOT] DISTINCT FROM <expression> ``` ## How was this patch tested? Tested using unit tests, integration tests, manual tests. Author: ptkool <michael.styles@shopify.com> Closes #17764 from ptkool/is_not_distinct_from.
…d class ## What changes were proposed in this pull request? `newProductSeqEncoder with REPL defined class` in `ReplSuite` has been failing in-deterministically : https://spark-tests.appspot.com/failed-tests over the last few days. Disabling the test until a fix is in place. https://spark.test.databricks.com/job/spark-master-test-sbt-hadoop-2.7/176/testReport/junit/org.apache.spark.repl/ReplSuite/newProductSeqEncoder_with_REPL_defined_class/history/ ## How was this patch tested? N/A Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes #17823 from sameeragarwal/disable-test.
## What changes were proposed in this pull request? Updating R Programming Guide ## How was this patch tested? manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17816 from felixcheung/r22relnote.
## What changes were proposed in this pull request? Adds R wrappers for: - `o.a.s.sql.functions.grouping` as `o.a.s.sql.functions.is_grouping` (to avoid shading `base::grouping` - `o.a.s.sql.functions.grouping_id` ## How was this patch tested? Existing unit tests, additional unit tests. `check-cran.sh`. Author: zero323 <zero323@users.noreply.github.com> Closes #17807 from zero323/SPARK-20532.
## What changes were proposed in this pull request? As #17773 revealed `OnHeapColumnVector` may copy a part of the original storage. `OffHeapColumnVector` reallocation also copies to the new storage data up to 'elementsAppended'. This variable is only updated when using the `ColumnVector.appendX` API, while `ColumnVector.putX` is more commonly used. This PR copies the new storage data up to the previously-allocated size in`OffHeapColumnVector`. ## How was this patch tested? Existing test suites Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #17811 from kiszk/SPARK-20537.
…nToStructs ## What changes were proposed in this pull request? A fix for the same problem was made in #17693 but ignored `JsonToStructs`. This PR uses the same fix for `JsonToStructs`. ## How was this patch tested? Regression test Author: Burak Yavuz <brkyvz@gmail.com> Closes #17826 from brkyvz/SPARK-20549.
…rs,Items Add Python API for `ALSModel` methods `recommendForAllUsers`, `recommendForAllItems` ## How was this patch tested? New doc tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #17622 from MLnick/SPARK-20300-pyspark-recall.
…h Hive Metastore ### What changes were proposed in this pull request? This is a follow-up of enabling test cases in DDLSuite with Hive Metastore. It consists of the following remaining tasks: - Run all the `alter table` and `drop table` DDL tests against data source tables when using Hive metastore. - Do not run any `alter table` and `drop table` DDL test against Hive serde tables when using InMemoryCatalog. - Reenable `alter table: set serde partition` and `alter table: set serde` tests for Hive serde tables. ### How was this patch tested? N/A Author: Xiao Li <gatorsmile@gmail.com> Closes #17524 from gatorsmile/cleanupDDLSuite.
## What changes were proposed in this pull request? doc only ## How was this patch tested? manual Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17828 from felixcheung/rnotfamily.
In the previous patch I deprecated StorageStatus, but not the method in SparkContext that exposes that class publicly. So deprecate the method too. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #17824 from vanzin/SPARK-20421.
GulajavaMinistudio
pushed a commit
that referenced
this pull request
Apr 22, 2020
### What changes were proposed in this pull request?
To support formatted explain for AQE.
### Why are the changes needed?
AQE does not support formatted explain yet. It's good to support it for better user experience, debugging, etc.
Before:
```
== Physical Plan ==
AdaptiveSparkPlan (1)
+- * HashAggregate (unknown)
+- CustomShuffleReader (unknown)
+- ShuffleQueryStage (unknown)
+- Exchange (unknown)
+- * HashAggregate (unknown)
+- * Project (unknown)
+- * BroadcastHashJoin Inner BuildRight (unknown)
:- * LocalTableScan (unknown)
+- BroadcastQueryStage (unknown)
+- BroadcastExchange (unknown)
+- LocalTableScan (unknown)
(1) AdaptiveSparkPlan
Output [4]: [k#7, count(v1)#32L, sum(v1)#33L, avg(v2)#34]
Arguments: HashAggregate(keys=[k#7], functions=[count(1), sum(cast(v1#8 as bigint)), avg(cast(v2#19 as bigint))]), AdaptiveExecutionContext(org.apache.spark.sql.SparkSession104ab57b), [PlanAdaptiveSubqueries(Map())], false
```
After:
```
== Physical Plan ==
AdaptiveSparkPlan (14)
+- * HashAggregate (13)
+- CustomShuffleReader (12)
+- ShuffleQueryStage (11)
+- Exchange (10)
+- * HashAggregate (9)
+- * Project (8)
+- * BroadcastHashJoin Inner BuildRight (7)
:- * Project (2)
: +- * LocalTableScan (1)
+- BroadcastQueryStage (6)
+- BroadcastExchange (5)
+- * Project (4)
+- * LocalTableScan (3)
(1) LocalTableScan [codegen id : 2]
Output [2]: [_1#x, _2#x]
Arguments: [_1#x, _2#x]
(2) Project [codegen id : 2]
Output [2]: [_1#x AS k#x, _2#x AS v1#x]
Input [2]: [_1#x, _2#x]
(3) LocalTableScan [codegen id : 1]
Output [2]: [_1#x, _2#x]
Arguments: [_1#x, _2#x]
(4) Project [codegen id : 1]
Output [2]: [_1#x AS k#x, _2#x AS v2#x]
Input [2]: [_1#x, _2#x]
(5) BroadcastExchange
Input [2]: [k#x, v2#x]
Arguments: HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#x]
(6) BroadcastQueryStage
Output [2]: [k#x, v2#x]
Arguments: 0
(7) BroadcastHashJoin [codegen id : 2]
Left keys [1]: [k#x]
Right keys [1]: [k#x]
Join condition: None
(8) Project [codegen id : 2]
Output [3]: [k#x, v1#x, v2#x]
Input [4]: [k#x, v1#x, k#x, v2#x]
(9) HashAggregate [codegen id : 2]
Input [3]: [k#x, v1#x, v2#x]
Keys [1]: [k#x]
Functions [3]: [partial_count(1), partial_sum(cast(v1#x as bigint)), partial_avg(cast(v2#x as bigint))]
Aggregate Attributes [4]: [count#xL, sum#xL, sum#x, count#xL]
Results [5]: [k#x, count#xL, sum#xL, sum#x, count#xL]
(10) Exchange
Input [5]: [k#x, count#xL, sum#xL, sum#x, count#xL]
Arguments: hashpartitioning(k#x, 5), true, [id=#x]
(11) ShuffleQueryStage
Output [5]: [sum#xL, k#x, sum#x, count#xL, count#xL]
Arguments: 1
(12) CustomShuffleReader
Input [5]: [k#x, count#xL, sum#xL, sum#x, count#xL]
Arguments: coalesced
(13) HashAggregate [codegen id : 3]
Input [5]: [k#x, count#xL, sum#xL, sum#x, count#xL]
Keys [1]: [k#x]
Functions [3]: [count(1), sum(cast(v1#x as bigint)), avg(cast(v2#x as bigint))]
Aggregate Attributes [3]: [count(1)#xL, sum(cast(v1#x as bigint))#xL, avg(cast(v2#x as bigint))#x]
Results [4]: [k#x, count(1)#xL AS count(v1)#xL, sum(cast(v1#x as bigint))#xL AS sum(v1)#xL, avg(cast(v2#x as bigint))#x AS avg(v2)#x]
(14) AdaptiveSparkPlan
Output [4]: [k#x, count(v1)#xL, sum(v1)#xL, avg(v2)#x]
Arguments: isFinalPlan=true
```
### Does this PR introduce any user-facing change?
No, this should be new feature along with AQE in Spark 3.0.
### How was this patch tested?
Added a query file: `explain-aqe.sql` and a unit test.
Closes apache#28271 from Ngone51/support_formatted_explain_for_aqe.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
GulajavaMinistudio
pushed a commit
that referenced
this pull request
Jul 20, 2020
…or its output partitioning
### What changes were proposed in this pull request?
Currently, the `BroadcastHashJoinExec`'s `outputPartitioning` only uses the streamed side's `outputPartitioning`. However, if the join type of `BroadcastHashJoinExec` is an inner-like join, the build side's info (the join keys) can be added to `BroadcastHashJoinExec`'s `outputPartitioning`.
For example,
```Scala
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "500")
val t1 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i1", "j1")
val t2 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i2", "j2")
val t3 = (0 until 20).map(i => (i % 7, i % 11)).toDF("i3", "j3")
val t4 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i4", "j4")
// join1 is a sort merge join.
val join1 = t1.join(t2, t1("i1") === t2("i2"))
// join2 is a broadcast join where t3 is broadcasted.
val join2 = join1.join(t3, join1("i1") === t3("i3"))
// Join on the column from the broadcasted side (i3).
val join3 = join2.join(t4, join2("i3") === t4("i4"))
join3.explain
```
You see that `Exchange hashpartitioning(i2#103, 200)` is introduced because there is no output partitioning info from the build side.
```
== Physical Plan ==
*(6) SortMergeJoin [i3#29], [i4#40], Inner
:- *(4) Sort [i3#29 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(i3#29, 200), true, [id=#55]
: +- *(3) BroadcastHashJoin [i1#7], [i3#29], Inner, BuildRight
: :- *(3) SortMergeJoin [i1#7], [i2#18], Inner
: : :- *(1) Sort [i1#7 ASC NULLS FIRST], false, 0
: : : +- Exchange hashpartitioning(i1#7, 200), true, [id=#28]
: : : +- LocalTableScan [i1#7, j1#8]
: : +- *(2) Sort [i2#18 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(i2#18, 200), true, [id=#29]
: : +- LocalTableScan [i2#18, j2#19]
: +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#34]
: +- LocalTableScan [i3#29, j3#30]
+- *(5) Sort [i4#40 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i4#40, 200), true, [id=#39]
+- LocalTableScan [i4#40, j4#41]
```
This PR proposes to introduce output partitioning for the build side for `BroadcastHashJoinExec` if the streamed side has a `HashPartitioning` or a collection of `HashPartitioning`s.
There is a new internal config `spark.sql.execution.broadcastHashJoin.outputPartitioningExpandLimit`, which can limit the number of partitioning a `HashPartitioning` can expand to. It can be set to "0" to disable this feature.
### Why are the changes needed?
To remove unnecessary shuffle.
### Does this PR introduce _any_ user-facing change?
Yes, now the shuffle in the above example can be eliminated:
```
== Physical Plan ==
*(5) SortMergeJoin [i3#108], [i4#119], Inner
:- *(3) Sort [i3#108 ASC NULLS FIRST], false, 0
: +- *(3) BroadcastHashJoin [i1#86], [i3#108], Inner, BuildRight
: :- *(3) SortMergeJoin [i1#86], [i2#97], Inner
: : :- *(1) Sort [i1#86 ASC NULLS FIRST], false, 0
: : : +- Exchange hashpartitioning(i1#86, 200), true, [id=#120]
: : : +- LocalTableScan [i1#86, j1#87]
: : +- *(2) Sort [i2#97 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(i2#97, 200), true, [id=#121]
: : +- LocalTableScan [i2#97, j2#98]
: +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#126]
: +- LocalTableScan [i3#108, j3#109]
+- *(4) Sort [i4#119 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i4#119, 200), true, [id=#130]
+- LocalTableScan [i4#119, j4#120]
```
### How was this patch tested?
Added new tests.
Closes apache#28676 from imback82/broadcast_join_output.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
GulajavaMinistudio
pushed a commit
that referenced
this pull request
Oct 13, 2021
…on to make Java 17 compatible with Java 8
### What changes were proposed in this pull request?
The `date_format` function with `B` format has different behavior when use Java 8 and Java 17, `select date_format('2018-11-17 13:33:33.333', 'B')` in `datetime-formatting-invalid.sql` can prove this.
The case result with Java 8 is
```
-- !query
select date_format('2018-11-17 13:33:33.333', 'B')
-- !query schema
struct<>
-- !query output
java.lang.IllegalArgumentException
Unknown pattern letter: B
```
and the case result with Java 17 is
```
- datetime-formatting-invalid.sql *** FAILED ***
datetime-formatting-invalid.sql
Expected "struct<[]>", but got "struct<[date_format(2018-11-17 13:33:33.333, B):string]>" Schema did not match for query #34
select date_format('2018-11-17 13:33:33.333', 'B'): -- !query
select date_format('2018-11-17 13:33:33.333', 'B')
-- !query schema
struct<date_format(2018-11-17 13:33:33.333, B):string>
-- !query output
in the afternoon (SQLQueryTestSuite.scala:469)
```
We found that this is due to the new support of format `B` in Java 17
```
'B' is used to represent Pattern letters to output a day period in Java 17
* Pattern Count Equivalent builder methods
* ------- ----- --------------------------
* B 1 appendDayPeriodText(TextStyle.SHORT)
* BBBB 4 appendDayPeriodText(TextStyle.FULL)
* BBBBB 5 appendDayPeriodText(TextStyle.NARROW)
```
And through [ http://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html]( http://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html) , we can confirm that format `B` is not documented/supported for `date_format` function currently.
So the main change of this pr is manual disabled format `B` of `date_format` function in `DateTimeFormatterHelper` to make Java 17 compatible with Java 8.
### Why are the changes needed?
Ensure that Java 17 and Java 8 have the same behavior.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Pass the Jenkins or GitHub Action
- Manual test `SQLQueryTestSuite` with JDK 17
**Before**
```
- datetime-formatting-invalid.sql *** FAILED ***
datetime-formatting-invalid.sql
Expected "struct<[]>", but got "struct<[date_format(2018-11-17 13:33:33.333, B):string]>" Schema did not match for query #34
select date_format('2018-11-17 13:33:33.333', 'B'): -- !query
select date_format('2018-11-17 13:33:33.333', 'B')
-- !query schema
struct<date_format(2018-11-17 13:33:33.333, B):string>
-- !query output
in the afternoon (SQLQueryTestSuite.scala:469)
```
**After**
The test `select date_format('2018-11-17 13:33:33.333', 'B')` in `datetime-formatting-invalid.sql` passed
Closes apache#34237 from LuciferYang/SPARK-36970.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.