forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 0
Merge #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Merge #3
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### What changes were proposed in this pull request? This PR aims to upgrade to Apache ORC 1.5.7. ### Why are the changes needed? This will bring the latest bug fixes. The following is the full release note. - https://issues.apache.org/jira/projects/ORC/versions/12345702 ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing tests. Closes #26276 from dongjoon-hyun/SPARK-29617. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ethods ### What changes were proposed in this pull request? This pr is to fix wrong code to check parameter lengths of split methods in `subexpressionEliminationForWholeStageCodegen`. ### Why are the changes needed? Bug fix. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #26267 from maropu/SPARK-29008-FOLLOWUP. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…f node to unary node ### What changes were proposed in this pull request? ### Why are the changes needed? When make the `LocalShuffleReaderExec` to leaf node, there exists a potential issue: the leaf node will hide the running query stage and make the unfinished query stage as finished query stage when creating its parent query stage. This PR make the leaf node to unary node. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests Closes #26250 from JkSelf/updateLeafNodeofLocalReaderToUnaryExecNode. Authored-by: jiake <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…o 4.2.4 for Scala 2.13.1" This reverts commit 5fc363b.
## What changes were proposed in this pull request? After this PR, we can create and register Hive UDFs to accept UDT type, like `VectorUDT` and `MatrixUDT`. These UDTs are widely used in Spark machine learning. ## How was this patch tested? add new ut Closes #24961 from uncleGen/SPARK-28158. Authored-by: uncleGen <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…C Tab ### What changes were proposed in this pull request? In this PR, extend the support of pagination to session table in `JDBC/PDBC` . ### Why are the changes needed? Some times we may connect a lot client and there a many session info shown in session tab. make it can be paged for better view. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manuel verify. After pr: <img width="1440" alt="Screen Shot 2019-10-25 at 4 19 27 PM" src="https://user-images.githubusercontent.com/46485123/67555133-50ae9900-f743-11e9-8724-9624a691f232.png"> <img width="1434" alt="Screen Shot 2019-10-25 at 4 19 38 PM" src="https://user-images.githubusercontent.com/46485123/67555165-5906d400-f743-11e9-819e-73f86a333dd3.png"> Closes #26253 from AngersZhuuuu/SPARK-29599. Lead-authored-by: angerszhu <[email protected]> Co-authored-by: AngersZhuuuu <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…like v2 commands ### What changes were proposed in this pull request? Add LoadDataStatement and make LOAD DATA INTO TABLE go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog LOAD DATA INPATH 'filepath' INTO TABLE t // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? yes. When running LOAD DATA INTO TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26178 from viirya/SPARK-29521. Lead-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This patch deduplicates code blocks in Kafka data source which are being repeated multiple times in a method. ### Why are the changes needed? This change would simplify the code and open possibility to simplify future code whenever fields are added to Kafka writer schema. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UTs. Closes #26158 from HeartSaVioR/MINOR-deduplicate-kafka-source. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>
### What changes were proposed in this pull request? This PR adds `DROP NAMESPACE` support for V2 catalogs. ### Why are the changes needed? Currently, you cannot drop namespaces for v2 catalogs. ### Does this PR introduce any user-facing change? The user can now perform the following: ```SQL CREATE NAMESPACE mycatalog.ns DROP NAMESPACE mycatalog.ns SHOW NAMESPACES IN mycatalog # Will show no namespaces ``` to drop a namespace `ns` inside `mycatalog` V2 catalog. ### How was this patch tested? Added unit tests. Closes #26262 from imback82/drop_namespace. Authored-by: Terry Kim <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? add single-column input/output support in Imputer ### Why are the changes needed? Currently, Imputer only has multi-column support. This PR adds single-column input/output support. ### Does this PR introduce any user-facing change? Yes. add single-column input/output support in Imputer ```Imputer.setInputCol``` ```Imputer.setOutputCol``` ### How was this patch tested? add unit tests Closes #26247 from huaxingao/spark-29566. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: zhengruifeng <[email protected]>
…omDataGenerator to create row for UDT backed by ArrayType ### What changes were proposed in this pull request? There're some issues observed in `HiveUserDefinedTypeSuite."Support UDT in Hive UDF"`: 1) Neither function (TestUDF) nor test take "nullable" point column into account. 2) ExamplePointUDT. sqlType is ArrayType which doesn't provide information how many elements are expected. RandomDataGenerator may provide less elements than needed. This patch fixes `HiveUserDefinedTypeSuite."Support UDT in Hive UDF"` to change the type of "point" column to be non-nullable, as well as not use RandomDataGenerator to create row for UDT backed by ArrayType. ### Why are the changes needed? CI builds are failing in high occurrences. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested by running tests locally multiple times. Closes #26287 from HeartSaVioR/SPARK-28158-FOLLOWUP. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name. Made the following changes in this PR: * Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview` * Update the PySpark version from `3.0.0.dev0` to `3.0.0` **Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.** We shall revert the changes after 3.0.0-preview release passed. ### Why are the changes needed? To make the maven release repository to accept the built jars. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #26243 from jiangxb1987/3.0.0-preview-prepare. Lead-authored-by: Xingbo Jiang <[email protected]> Co-authored-by: HyukjinKwon <[email protected]> Signed-off-by: Xingbo Jiang <[email protected]>
This reverts commit 5eddbb5.
…talog/table like v2 commands ### What changes were proposed in this pull request? Add AlterTableRecoverPartitionsStatement and make ALTER TABLE ... RECOVER PARTITIONS go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog ALTER TABLE t RECOVER PARTITIONS // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? Yes. When running ALTER TABLE ... RECOVER PARTITIONS Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26269 from huaxingao/spark-29612. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…input/output ### What changes were proposed in this pull request? add single-column input/ouput support in OneHotEncoder ### Why are the changes needed? Currently, OneHotEncoder only has multi columns support. It makes sense to support single column as well. ### Does this PR introduce any user-facing change? Yes ```OneHotEncoder.setInputCol``` ```OneHotEncoder.setOutputCol``` ### How was this patch tested? Unit test Closes #26265 from huaxingao/spark-29565. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]>
…valUtils ### What changes were proposed in this pull request? In the PR, I propose to move all static methods from the `CalendarInterval` class to the `IntervalUtils` object. All those methods are rewritten from Java to Scala. ### Why are the changes needed? - For consistency with other helper methods. Such methods were placed to the helper object `IntervalUtils`, see #26190 - Taking into account that `CalendarInterval` will be fully exposed to users in the future (see #25022), it would be nice to clean it up by moving service methods to an internal object. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By moved tests from `CalendarIntervalSuite` to `IntervalUtilsSuite` - By existing test suites Closes #26261 from MaxGekk/refactoring-calendar-interval. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### Why are the changes needed?
Starting from Spark 2.3, the SHS REST API endpoint `/applications/<app_id>/jobs/` is not including `description` in the JobData returned. This is not the case until Spark 2.2.
In this PR I've added the mentioned field.
### Does this PR introduce any user-facing change?
Yes.
Old API response:
```
[ {
"jobId" : 0,
"name" : "foreach at <console>:26",
"submissionTime" : "2019-10-28T12:41:54.301GMT",
"completionTime" : "2019-10-28T12:41:54.731GMT",
"stageIds" : [ 0 ],
"jobGroup" : "test",
"status" : "SUCCEEDED",
"numTasks" : 1,
"numActiveTasks" : 0,
"numCompletedTasks" : 1,
"numSkippedTasks" : 0,
"numFailedTasks" : 0,
"numKilledTasks" : 0,
"numCompletedIndices" : 1,
"numActiveStages" : 0,
"numCompletedStages" : 1,
"numSkippedStages" : 0,
"numFailedStages" : 0,
"killedTasksSummary" : { }
} ]
```
New API response:
```
[ {
"jobId" : 0,
"name" : "foreach at <console>:26",
"description" : "job", <= This is the addition here
"submissionTime" : "2019-10-28T13:37:24.107GMT",
"completionTime" : "2019-10-28T13:37:24.613GMT",
"stageIds" : [ 0 ],
"jobGroup" : "test",
"status" : "SUCCEEDED",
"numTasks" : 1,
"numActiveTasks" : 0,
"numCompletedTasks" : 1,
"numSkippedTasks" : 0,
"numFailedTasks" : 0,
"numKilledTasks" : 0,
"numCompletedIndices" : 1,
"numActiveStages" : 0,
"numCompletedStages" : 1,
"numSkippedStages" : 0,
"numFailedStages" : 0,
"killedTasksSummary" : { }
} ]
```
### How was this patch tested?
Extended + existing unit tests.
Manually:
* Open spark-shell
```
scala> sc.setJobGroup("test", "job", false);
scala> val foo = sc.textFile("/user/foo.txt");
foo: org.apache.spark.rdd.RDD[String] = /user/foo.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> foo.foreach(println);
```
* Access REST API `http://SHS-host:port/api/v1/applications/<app-id>/jobs/`
Closes #26295 from gaborgsomogyi/SPARK-29637.
Authored-by: Gabor Somogyi <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
## What changes were proposed in this pull request? Now, `RepartitionByExpression` is allowed at Dataset method `Dataset.repartition()`. But in spark sql, we do not have an equivalent functionality. In hive, we can use `distribute by`, so it's worth to add a hint to support such function. Similar jira [SPARK-24940](https://issues.apache.org/jira/browse/SPARK-24940) ## Why are the changes needed? Make repartition hints consistent with repartition api . ## Does this PR introduce any user-facing change? This pr intends to support quries below; ``` // SQL cases - sql("SELECT /*+ REPARTITION(c) */ * FROM t") - sql("SELECT /*+ REPARTITION(1, c) */ * FROM t") - sql("SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t") - sql("SELECT /*+ REPARTITION_BY_RANGE(1, c) */ * FROM t") ``` ## How was this patch tested? UT Closes #25464 from ulysses-you/SPARK-28746. Lead-authored-by: ulysses <[email protected]> Co-authored-by: ulysses <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>
### What changes were proposed in this pull request?
```
postgres=# select date '2001-09-28' + integer '7';
?column?
------------
2001-10-05
(1 row)postgres=# select integer '7';
int4
------
7
(1 row)
```
Add support for typed integer literal expression from postgreSQL.
### Why are the changes needed?
SPARK-27764 Feature Parity between PostgreSQL and Spark
### Does this PR introduce any user-facing change?
support typed integer lit in SQL
### How was this patch tested?
add uts
Closes #26291 from yaooqinn/SPARK-29629.
Authored-by: Kent Yao <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?
This PR proposes to allow `array_contains` to take column instances.
### Why are the changes needed?
For consistent support in Scala and Python APIs. Scala allows column instances at `array_contains`
Scala:
```scala
import org.apache.spark.sql.functions._
val df = Seq(Array("a", "b", "c"), Array.empty[String]).toDF("data")
df.select(array_contains($"data", lit("a"))).show()
```
Python:
```python
from pyspark.sql.functions import array_contains, lit
df = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])
df.select(array_contains(df.data, lit("a"))).show()
```
However, PySpark sides does not allow.
### Does this PR introduce any user-facing change?
Yes.
```python
from pyspark.sql.functions import array_contains, lit
df = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])
df.select(array_contains(df.data, lit("a"))).show()
```
**Before:**
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/functions.py", line 1950, in array_contains
return Column(sc._jvm.functions.array_contains(_to_java_column(col), value))
File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1277, in __call__
File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1241, in _build_args
File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1228, in _get_args
File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_collections.py", line 500, in convert
File "/.../spark/python/pyspark/sql/column.py", line 344, in __iter__
raise TypeError("Column is not iterable")
TypeError: Column is not iterable
```
**After:**
```
+-----------------------+
|array_contains(data, a)|
+-----------------------+
| true|
| false|
+-----------------------+
```
### How was this patch tested?
Manually tested and added a doctest.
Closes #26288 from HyukjinKwon/SPARK-29627.
Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request? This PR ports window.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/window.sql#L913-L1278 The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/window.out ### Why are the changes needed? To ensure compatibility with PostgreSQL. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins. And, Comparison with PgSQL results. Closes #26238 from DylanGuedes/spark-29110. Authored-by: DylanGuedes <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request? This PR fix and use proper html tag in docs ### Why are the changes needed? Fix documentation format error. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #26302 from uncleGen/minor-doc. Authored-by: uncleGen <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…HiveClient in SparkSQLEnv
### What changes were proposed in this pull request?
This patch fixes the issue that external listeners are not initialized properly when `spark.sql.hive.metastore.jars` is set to either "maven" or custom list of jar.
("builtin" is not a case here - all jars in Spark classloader are also available in separate classloader)
The culprit is lazy initialization (lazy val or passing builder function) & thread context classloader. HiveClient leverages IsolatedClientLoader to properly load Hive and relevant libraries without issue - to not mess up with Spark classpath it uses separate classloader with leveraging thread context classloader.
But there's a messed-up case - SessionState is being initialized while HiveClient changed the thread context classloader from Spark classloader to Hive isolated one, and streaming query listeners are loaded from changed classloader while initializing SessionState.
This patch forces initializing SessionState in SparkSQLEnv to avoid such case.
### Why are the changes needed?
ClassNotFoundException could occur in spark-sql with specific configuration, as explained above.
### Does this PR introduce any user-facing change?
No, as I don't think end users assume the classloader of external listeners is only containing jars for Hive client.
### How was this patch tested?
New UT added which fails on master branch and passes with the patch.
The error message with master branch when running UT:
```
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':;
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:109)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:221)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:147)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:137)
at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:59)
at org.apache.spark.sql.hive.thriftserver.SparkSQLEnvSuite.$anonfun$new$2(SparkSQLEnvSuite.scala:44)
at org.apache.spark.sql.hive.thriftserver.SparkSQLEnvSuite.withSystemProperties(SparkSQLEnvSuite.scala:61)
at org.apache.spark.sql.hive.thriftserver.SparkSQLEnvSuite.$anonfun$new$1(SparkSQLEnvSuite.scala:43)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
at org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:393)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:381)
at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:376)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:458)
at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
at org.scalatest.Suite.run(Suite.scala:1124)
at org.scalatest.Suite.run$(Suite.scala:1106)
at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
at org.scalatest.SuperEngine.runImpl(Engine.scala:518)
at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45)
at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1349)
at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1343)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1343)
at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:1033)
at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:1011)
at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1509)
at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1011)
at org.scalatest.tools.Runner$.run(Runner.scala:850)
at org.scalatest.tools.Runner.run(Runner.scala)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:133)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:27)
Caused by: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1054)
at org.apache.spark.sql.SparkSession.$anonfun$sessionState$2(SparkSession.scala:156)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:154)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:151)
at org.apache.spark.sql.SparkSession.$anonfun$new$3(SparkSession.scala:105)
at scala.Option.map(Option.scala:230)
at org.apache.spark.sql.SparkSession.$anonfun$new$1(SparkSession.scala:105)
at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:164)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:127)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:300)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:421)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:314)
at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:68)
at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:67)
at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:221)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
... 58 more
Caused by: java.lang.ClassNotFoundException: test.custom.listener.DummyQueryExecutionListener
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:206)
at org.apache.spark.util.Utils$.$anonfun$loadExtensions$1(Utils.scala:2746)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2744)
at org.apache.spark.sql.util.ExecutionListenerManager.$anonfun$new$1(QueryExecutionListener.scala:83)
at org.apache.spark.sql.util.ExecutionListenerManager.$anonfun$new$1$adapted(QueryExecutionListener.scala:82)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.sql.util.ExecutionListenerManager.<init>(QueryExecutionListener.scala:82)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.$anonfun$listenerManager$2(BaseSessionStateBuilder.scala:293)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.listenerManager(BaseSessionStateBuilder.scala:293)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:320)
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1051)
... 80 more
```
Closes #26258 from HeartSaVioR/SPARK-29604.
Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? MICROS_PER_MONTH = DAYS_PER_MONTH * MICROS_PER_DAY ### Why are the changes needed? fix bug ### Does this PR introduce any user-facing change? no ### How was this patch tested? add ut Closes #26321 from yaooqinn/SPARK-29653. Authored-by: Kent Yao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR ports create_view.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/create_view.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/create_view.out ### Why are the changes needed? To check behaviour differences between Spark and PostgreSQL ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins. And, Comparison with PgSQL results Closes #26290 from maropu/SPARK-29120. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to deprecate `Python 3.4 ~ 3.5`, which is prior to version 3.6 additionally. ### Why are the changes needed? Since `Python 3.8` is already out, we will focus on to support Python 3.6/3.7/3.8. ### Does this PR introduce any user-facing change? Yes. It's highly recommended to use Python 3.6/3.7. We will verify Python 3.8 before Apache Spark 3.0.0 release. ### How was this patch tested? NA (This is a doc-only change). Closes #26326 from dongjoon-hyun/SPARK-29668. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…er}-preview` in release script
### What changes were proposed in this pull request?
Update `release-build.sh`, to allow pyspark version name format `${versionNumber}-preview`, otherwise the release script won't generate pyspark release tarballs.
### How was this patch tested?
Tested locally.
Closes #26306 from jiangxb1987/buildPython.
Authored-by: Xingbo Jiang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? `release-build.sh` fail to publish release under dry run mode with the following error message: ``` /opt/spark-rm/release-build.sh: line 429: pushd: spark-repo-g4MBm/org/apache/spark: No such file or directory ``` We need to at least run the `mvn clean install` command once to create the `$tmp_repo` path, but now those steps are all skipped under dry-run mode. This PR fixes the issue. ### How was this patch tested? Tested locally. Closes #26329 from jiangxb1987/dryrun. Authored-by: Xingbo Jiang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name. Made the following changes in this PR: * Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview` * Update the sparkR version number check logic to allow jvm version like `3.0.0-preview` **Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.** We shall revert the changes after 3.0.0-preview release passed. ### Why are the changes needed? To make the maven release repository to accept the built jars. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A
This reverts commit 007c873.
### What changes were proposed in this pull request? This PR aims to upgrade ASM to 7.2. - https://issues.apache.org/jira/browse/XBEAN-322 (Upgrade to ASM 7.2) - https://asm.ow2.io/versions.html ### Why are the changes needed? This will bring the following patches. - 317875: Infinite loop when parsing invalid method descriptor - 317873: Add support for RET instruction in AdviceAdapter - 317872: Throw an exception if visitFrame used incorrectly - add support for Java 14 ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing UTs. Closes #26373 from dongjoon-hyun/SPARK-29729. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR replaces `jira_client` API call for read-only access with public Apache JIRA REST API invocation. ### Why are the changes needed? This will reduce the number of authenticated API invocations. I hope this will reduce the chance of CAPCHAR from Apache JIRA site. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual. ``` $ echo 26375 > .github-jira-max $ dev/github_jira_sync.py Read largest PR number previously seen: 26375 Retrieved 100 JIRA PR's from Github 1 PR's remain after excluding visted ones Checking issue SPARK-29731 Writing largest PR number seen: 26376 Build PR dictionary SPARK-29731 26376 Set 26376 with labels "PROJECT INFRA" ``` Closes #26376 from dongjoon-hyun/SPARK-29731. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…quals` ### What changes were proposed in this pull request? The `assertEquals` method of JUnit Assert requires the first parameter to be the expected value. In this PR, I propose to change the order of parameters when the expected value is passed as the second parameter. ### Why are the changes needed? Wrong order of assert parameters confuses when the assert fails and the parameters have special string representation. For example: ```java assertEquals(input1.add(input2), new CalendarInterval(5, 5, 367200000000L)); ``` ``` java.lang.AssertionError: Expected :interval 5 months 5 days 101 hours Actual :interval 5 months 5 days 102 hours ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing tests. Closes #26377 from MaxGekk/fix-order-in-assert-equals. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? Update the version of dropwizard metrics that Spark uses for metrics to 4.1.x, from 3.2.x. ### Why are the changes needed? This helps JDK 9+ support, per for example dropwizard/metrics#1236 ### Does this PR introduce any user-facing change? No, although downstream users with custom metrics may be affected. ### How was this patch tested? Existing tests. Closes #26332 from srowen/SPARK-29674. Authored-by: Sean Owen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This reverts commit 91d9901. ### Why are the changes needed? CRAN check is pretty important for R package, we should enable it. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests. Closes #26381 from viirya/revert-SPARK-24152. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
… high order functions ### What changes were proposed in this pull request? Support non-reversed keywords to be used in high order functions. ### Why are the changes needed? the keywords are non-reversed. ### Does this PR introduce any user-facing change? yes, all non-reversed keywords can be used in high order function correctly ### How was this patch tested? add uts Closes #26366 from yaooqinn/SPARK-29722. Authored-by: Kent Yao <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…oin/Aggregation ### What changes were proposed in this pull request? This is somewhat a complement of #21853. The `Sort` without `Limit` operator in `Join` subquery is useless, it's the same case in `GroupBy` when the aggregation function is order irrelevant, such as `count`, `sum`. This PR try to remove this kind of `Sort` operator in `SQL Optimizer`. ### Why are the changes needed? For example, `select count(1) from (select a from test1 order by a)` is equal to `select count(1) from (select a from test1)`. 'select * from (select a from test1 order by a) t1 join (select b from test2) t2 on t1.a = t2.b' is equal to `select * from (select a from test1) t1 join (select b from test2) t2 on t1.a = t2.b`. Remove useless `Sort` operator can improve performance. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Adding new UT `RemoveSortInSubquerySuite.scala` Closes #26011 from WangGuangxin/remove_sorts. Authored-by: wangguangxin.cn <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? In the PR, I propose to changed `CalendarInterval.toString`: - to skip the `week` unit - to convert `milliseconds` and `microseconds` as the fractional part of the `seconds` unit. ### Why are the changes needed? To improve readability. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? - By `CalendarIntervalSuite` and `IntervalUtilsSuite` - `literals.sql`, `datetime.sql` and `interval.sql` Closes #26367 from MaxGekk/interval-to-string-format. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…rTableChangeColumnCommand ### What changes were proposed in this pull request? If the resolved table is v1 table, AlterTableAlterColumnStatement fallbacks to v1 AlterTableChangeColumnCommand. ### Why are the changes needed? To make the catalog/table lookup logic consistent. ### Does this PR introduce any user-facing change? Yes, a ALTER TABLE ALTER COLUMN command previously fails on v1 tables. After this, it falls back to v1 AlterTableChangeColumnCommand. ### How was this patch tested? Unit test. Closes #26354 from viirya/SPARK-29353. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…_functions ### What changes were proposed in this pull request? This PR aims to replace `cc` with `java` in `test_pipe_functions` of `test_rdd.py`. ### Why are the changes needed? Currently, `test_rdd.py` assumes `cc` installation during `rdd.pipe` tests. This requires us to install `gcc` for python testing. If we use `java`, we can have the same test coverage and we don't need to install it because it's already installed in `PySpark` test environment. This will be helpful when we build a dockerized parallel testing environment. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the existing PySpark tests. Closes #26383 from dongjoon-hyun/SPARK-29739. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…values ### What changes were proposed in this pull request? - Retry the tests for special date-time values on failure. The tests can potentially fail when reference values were taken before midnight and test code resolves special values after midnight. The retry can guarantees that the tests run during the same day. - Simplify getting of the current timestamp via `Instant.now()`. This should avoid any issues of converting current local datetime to an instance. For example, the same local time can be mapped to 2 instants when clocks are turned backward 1 hour on daylight saving date. - Extract common code to SQLHelper - Set the tested zoneId to the session time zone in `DateTimeUtilsSuite`. ### Why are the changes needed? To make the tests more stable. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suites `Date`/`TimestampFormatterSuite` and `DateTimeUtilsSuite`. Closes #26380 from MaxGekk/retry-on-fail. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…le disk failures ### What changes were proposed in this pull request? The `getFile` method in `DiskBlockManager` may return a file with an existing subdirectory. But when a disk failure occurs on that subdirectory. this file is inaccessible. Then the FileNotFoundException like the following usually tear down the entire task, which is a bit heavy. ``` java.io.FileNotFoundException: /mnt/dfs/4/yarn/local/usercache/da_haitao/appcache/application_1568691584183_1953115/blockmgr-cc4689f5-eddd-4b99-8af4-4166a86ec30b/10/temp_shuffle_79be5049-d1d5-4a81-8e67-4ef236d3834f (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.<init>(FileOutputStream.java:213) at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103) at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:249) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:209) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:416) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:230) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:190) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` This change pre-touch the temporary file to check whether the parent directory is available or not. If NOT, we may try another possibly heathy disk util we reach the max attempts. ### Why are the changes needed? Re-running the whole task is much heavier than pick another heathy disk to output the temporary results. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? ADD UT Closes #25962 from yaooqinn/SPARK-29285. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…s, to make it consistent with other tabs ### What changes were proposed in this pull request? Currently, JDBC/ODBC tab in the WEBUI doesn't support hiding table. Other tabs in the web ui like, Jobs, stages, SQL etc supports hiding table (refer #22592). In this PR, added the support for hide table in the jdbc/odbc tab also. ### Why are the changes needed? Spark ui about the contents of the form need to have hidden and show features, when the table records very much. Because sometimes you do not care about the record of the table, you just want to see the contents of the next table, but you have to scroll the scroll bar for a long time to see the contents of the next table. ### Does this PR introduce any user-facing change? No, except support of hide table ### How was this patch tested? Manually tested   Closes #26353 from shahidki31/hideTable. Authored-by: shahid <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…/table like v2 commands ### What changes were proposed in this pull request? Add AlterTableAddPartitionStatement and make ALTER TABLE ... ADD PARTITION go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog ALTER TABLE t ADD PARTITION (id=1) // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? Yes. When running ALTER TABLE ... ADD PARTITION, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests Closes #26369 from imback82/spark-29678. Authored-by: Terry Kim <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? sum support interval values ### Why are the changes needed? Part of SPARK-27764 Feature Parity between PostgreSQL and Spark ### Does this PR introduce any user-facing change? yes, sum can evaluate intervals ### How was this patch tested? add ut Closes #26325 from yaooqinn/SPARK-29663. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? Current checkstyle checking folder can't cover all folder. Since for support multi version hive, we have some divided hive folder. We should check it too. ### Why are the changes needed? Fix build bug ### Does this PR introduce any user-facing change? NO ### How was this patch tested? NO Closes #26385 from AngersZhuuuu/SPARK-29742. Authored-by: angerszhu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…erence ### What changes were proposed in this pull request? Added doc for DROP TABLE and DROP VIEW sql command ### Why are the changes needed? For reference DROP TABLE or DROP VIEW in spark-sql ### Does this PR introduce any user-facing change? It updates DROP TABLE or DROP VIEW reference doc ### How was this patch tested? <details> <summary> Attached the Snap</summary> DROP TABLE   DROP VIEW   </details> Closes #25533 from shivusondur/jiraUSEDB. Authored-by: shivusondur <[email protected]> Signed-off-by: Sean Owen <[email protected]>
### What changes were proposed in this pull request? Added the document reference for SHOW TABLE EXTENDED sql command ### Why are the changes needed? For User reference ### Does this PR introduce any user-facing change? yes, it provides document reference for SHOW TABLE EXTENDED sql command ### How was this patch tested? verified in snap <details> <summary> Attached the Snap</summary>     </details> Closes #25632 from shivusondur/jiraSHOWTABLE. Authored-by: shivusondur <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…ild is ### What changes were proposed in this pull request? `SampleExec` has a bug that it sets `needCopyResult` to false as long as the `withReplacement` parameter is false. This causes problems if its child needs to copy the result, e.g. a join. ### Why are the changes needed? to fix a correctness issue ### Does this PR introduce any user-facing change? Yes, the result will be corrected. ### How was this patch tested? a new test Closes #26387 from cloud-fan/sample-bug. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…sses
### What changes were proposed in this pull request?
I propose 2 new methods for `CalendarInterval`:
- `extractAsPeriod()` returns the date part of an interval as an instance of `java.time.Period`
- `extractAsDuration()` returns the time part of an interval as an instance of `java.time.Duration`
For example:
```scala
scala> import org.apache.spark.unsafe.types.CalendarInterval
scala> import java.time._
scala> val i = spark.sql("select interval 1 year 3 months 4 days 10 hours 30 seconds").collect()(0).getAs[CalendarInterval](0)
scala> LocalDate.of(2019, 11, 1).plus(i.period())
res8: java.time.LocalDate = 2021-02-05
scala> ZonedDateTime.parse("2019-11-01T12:13:14Z").plus(i.extractAsPeriod()).plus(i.extractAsDuration())
res9: java.time.ZonedDateTime = 2021-02-05T22:13:44Z
```
### Why are the changes needed?
Taking into account that `CalendarInterval` has been already partially exposed to users via the collect operation, and probably it will be fully exposed in the future, it could be convenient for users to get the date and time parts of intervals as java classes:
- to avoid unnecessary dependency from Spark's classes in user code
- to easily use external libraries that accept standard Java classes.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By new test in `CalendarIntervalSuite`.
Closes #26368 from MaxGekk/interval-java-period-duration.
Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Spark 2.4 added the ability for executor plugins to be loaded into Spark (see SPARK-24918). That feature intentionally skipped the driver to keep changes small, and also because it is possible to load code into the Spark driver using listeners + configuration. But that is a bit awkward, because the listener interface does not provide hooks into a lot of Spark functionality. This change reworks the executor plugin interface to also extend to the driver. - there's a "SparkPlugin" main interface that provides APIs to load driver and executor components. - custom metric support (added in SPARK-28091) can be used by plugins to register metrics both in the driver process and in executors. - a communication channel now exists that allows the plugin's executor components to send messages to the plugin's driver component easily, using the existing Spark RPC system. The latter was a feature intentionally left out of the original plugin design (also because it didn't include a driver component). To avoid polluting the "org.apache.spark" namespace, I added the new interfaces to the "org.apache.spark.api" package, which seems like a better place in any case. The actual implementation is kept in an internal package. The change includes unit tests for the new interface and features, but I've also been running a custom plugin that extends the new API in real applications. Closes #26170 from vanzin/SPARK-29397. Authored-by: Marcelo Vanzin <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>
## What changes were proposed in this pull request? This pr proposes to be case insensitive when matching dialects via jdbc url prefix. When I use jdbc url such as: ```jdbc: MySQL://localhost/db``` to query data through sparksql, the result is wrong, but MySQL supports such url writing. because sparksql matches MySQLDialect by prefix ```jdbc:mysql```, so ```jdbc: MySQL``` is not matched with the correct dialect. Therefore, it should be case insensitive when identifying the corresponding dialect through jdbc url https://issues.apache.org/jira/browse/SPARK-28552 ## How was this patch tested? UT. Closes #25287 from teeyog/sql_dialect. Lead-authored-by: yong.tian1 <[email protected]> Co-authored-by: Xingbo Jiang <[email protected]> Co-authored-by: Chris Martin <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Co-authored-by: Dongjoon Hyun <[email protected]> Co-authored-by: Kent Yao <[email protected]> Co-authored-by: teeyog <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Ryan Blue <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>
…ming query ## What changes were proposed in this pull request? This patch adds the option to clean up files which are completed in previous batch. `cleanSource` -> "archive" / "delete" / "off" The default value is "off", which Spark will do nothing. If "delete" is specified, Spark will simply delete input files. If "archive" is specified, Spark will require additional config `sourceArchiveDir` which will be used to move input files to there. When archiving (via move) the path of input files are retained to the archived paths as sub-path. Note that it is only applied to "micro-batch", since for batch all input files must be kept to get same result across multiple query executions. ## How was this patch tested? Added UT. Manual test against local disk as well as HDFS. Closes #22952 from HeartSaVioR/SPARK-20568. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Co-authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Co-authored-by: Jungtaek Lim <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>
### What changes were proposed in this pull request? This upgrades joda-time from 2.9 to 2.10.5. ### Why are the changes needed? Joda 2.9 is almost 4 yrs ago and there are bugs fix and tz database updates. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests. Closes #26389 from viirya/upgrade-joda. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?
This PR introduces a new SQL command: `SHOW CURRENT NAMESPACE`.
### Why are the changes needed?
Datasource V2 supports multiple catalogs/namespaces and having `SHOW CURRENT NAMESPACE` to retrieve the current catalog/namespace info would be useful.
### Does this PR introduce any user-facing change?
Yes, the user can perform the following:
```
scala> spark.sql("SHOW CURRENT NAMESPACE").show
+-------------+---------+
| catalog|namespace|
+-------------+---------+
|spark_catalog| default|
+-------------+---------+
scala> spark.sql("USE testcat.ns1.ns2").show
scala> spark.sql("SHOW CURRENT NAMESPACE").show
+-------+---------+
|catalog|namespace|
+-------+---------+
|testcat| ns1.ns2|
+-------+---------+
```
### How was this patch tested?
Added unit tests.
Closes #26379 from imback82/show_current_catalog.
Authored-by: Terry Kim <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR ports groupingsets.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/groupingsets.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/groupingsets.out ### Why are the changes needed? To check behaviour differences between Spark and PostgreSQL ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins. And, Comparison with PgSQL results Closes #26352 from maropu/GgroupingSets. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
… a temporary view in an expression ### What changes were proposed in this pull request? Disallow creating a permanent view that references a temporary view in **expressions**. ### Why are the changes needed? Creating a permanent view that references a temporary view is currently disallowed. For example, ```SQL # The following throws org.apache.spark.sql.AnalysisException # Not allowed to create a permanent view `per_view` by referencing a temporary view `tmp`; CREATE VIEW per_view AS SELECT t1.a, t2.b FROM base_table t1, (SELECT * FROM tmp) t2" ``` However, the following is allowed. ```SQL CREATE VIEW per_view AS SELECT * FROM base_table WHERE EXISTS (SELECT * FROM tmp); ``` This PR fixes the bug where temporary views used inside expressions are not checked. ### Does this PR introduce any user-facing change? Yes. Now the following SQL query throws an exception as expected: ```SQL # The following throws org.apache.spark.sql.AnalysisException # Not allowed to create a permanent view `per_view` by referencing a temporary view `tmp`; CREATE VIEW per_view AS SELECT * FROM base_table WHERE EXISTS (SELECT * FROM tmp); ``` ### How was this patch tested? Added new unit tests. Closes #26361 from imback82/spark-29630. Authored-by: Terry Kim <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…log/table like v2 commands ### What changes were proposed in this pull request? Add AlterTableSerDePropertiesStatement and make ALTER TABLE ... SET SERDE/SERDEPROPERTIES go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog ALTER TABLE t SET SERDE 'org.apache.class' // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? Yes. When running ALTER TABLE ... SET SERDE/SERDEPROPERTIES, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26374 from huaxingao/spark_29695. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR ports limit.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/limit.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/limit.out ### Why are the changes needed? To check behaviour differences between Spark and PostgreSQL ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins. And, Comparison with PgSQL results Closes #26311 from maropu/SPARK-29648. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.