forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 0
Sync with master #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ering on write ### What changes were proposed in this pull request? This PR adds connector interfaces proposed in the [design doc](https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs/edit#) for SPARK-23889. **Note**: This PR contains a subset of changes discussed in PR #29066. ### Why are the changes needed? Data sources should be able to request a specific distribution and ordering of data on write. In particular, these scenarios are considered useful: - global sort - cluster data and sort within partitions - local sort within partitions - no sort Please see the design doc above for a more detailed explanation of requirements. ### Does this PR introduce _any_ user-facing change? This PR introduces public changes to the DS V2 by adding a logical write abstraction as we have on the read path as well as additional interfaces to represent distribution and ordering of data (please see the doc for more info). The existing `Distribution` interface in `read` package is read-specific and not flexible enough like discussed in the design doc. The current proposal is to evolve these interfaces separately until they converge. ### How was this patch tested? This patch adds only interfaces. Closes #30706 from aokolnychyi/spark-23889-interfaces. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Ryan Blue <[email protected]>
### What changes were proposed in this pull request? This PR removes unused imports. ### Why are the changes needed? These changes are required to fix the build. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Via `dev/lint-java`. Closes #30767 from aokolnychyi/fix-linter. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? Add a developer API for custom driver & executor feature steps. ### Why are the changes needed? While we allow templates for the basis of pod creation, some deployments need more flexibility in how the pods are configured. This adds a developer API for custom deployments. ### Does this PR introduce _any_ user-facing change? New developer API. ### How was this patch tested? Extended tests to verify custom step is applied when configured. Closes #30206 from holdenk/SPARK-33261-allow-people-to-extend-pod-feature-steps. Authored-by: Holden Karau <[email protected]> Signed-off-by: Holden Karau <[email protected]>
…ng on JDK 14 ### What changes were proposed in this pull request? This pr fix invalid value for HourOfAmPm when testing on JDK 14. ### Why are the changes needed? Run test on JDK 14. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #30754 from wangyum/SPARK-33771. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? List partitions returned by the V2 `SHOW PARTITIONS` command in alphabetical order. ### Why are the changes needed? To have the same behavior as: 1. V1 in-memory catalog, see https://github.com/apache/spark/blob/a28ed86a387b286745b30cd4d90b3d558205a5a7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala#L546 2. V1 Hive catalogs, see https://github.com/apache/spark/blob/fab2995972761503563fa2aa547c67047c51bd33/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L715 ### Does this PR introduce _any_ user-facing change? Yes, after the changes, V2 SHOW PARTITIONS sorts its output. ### How was this patch tested? Added new UT to the base trait `ShowPartitionsSuiteBase` which contains tests for V1 and V2. Closes #30764 from MaxGekk/sort-show-partitions. Authored-by: Max Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This changes DSv2 refresh table semantics to also recache the target table itself. ### Why are the changes needed? Currently "REFRESH TABLE" in DSv2 only invalidate all caches referencing the table. With #30403 merged which adds support for caching a DSv2 table, we should also recache the target table itself to make the behavior consistent with DSv1. ### Does this PR introduce _any_ user-facing change? Yes, now refreshing table in DSv2 also recache the target table itself. ### How was this patch tested? Added coverage of this new behavior in the existing UT for v2 refresh table command Closes #30742 from sunchao/SPARK-33653. Authored-by: Chao Sun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…or Python executables
### What changes were proposed in this pull request?
This PR proposes:
- Respect `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, or `spark.pyspark.python` and `spark.pyspark.driver.python` configurations in Kubernates just like other cluster types in Spark.
- Depreate `spark.kubernetes.pyspark.pythonVersion` and guide users to set the environment variables and configurations for Python executables.
NOTE that `spark.kubernetes.pyspark.pythonVersion` is already a no-op configuration without this PR. Default is `3` and other values are disallowed.
- In order for Python executable settings to be consistently used, fix `spark.archives` option to unpack into the current working directory in the driver of Kubernates' cluster mode. This behaviour is identical with Yarn's cluster mode. By doing this, users can leverage Conda or virtuenenv in cluster mode as below:
```python
conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz
PYSPARK_PYTHON=./environment/bin/python spark-submit --archives pyspark_conda_env.tar.gz#environment app.py
```
- Removed several unused or useless codes such as `extractS3Key` and `renameResourcesToLocalFS`
### Why are the changes needed?
- To provide a consistent support of PySpark by using `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, or `spark.pyspark.python` and `spark.pyspark.driver.python` configurations.
- To provide Conda and virtualenv support via `spark.archives` options.
### Does this PR introduce _any_ user-facing change?
Yes:
- `spark.kubernetes.pyspark.pythonVersion` is deprecated.
- `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, and `spark.pyspark.python` and `spark.pyspark.driver.python` configurations are respected.
### How was this patch tested?
Manually tested via:
```bash
minikube delete
minikube start --cpus 12 --memory 16384
kubectl create namespace spark-integration-test
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark
namespace: spark-integration-test
EOF
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark-integration-test:spark --namespace=spark-integration-test
dev/make-distribution.sh --pip --tgz -Pkubernetes
resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --spark-tgz `pwd`/spark-3.2.0-SNAPSHOT-bin-3.2.0.tgz --service-account spark --namespace spark-integration-test
```
Unittests were also added.
Closes #30735 from HyukjinKwon/SPARK-33748.
Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
…UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `ALTER TABLE ... RECOVER PARTITIONS` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `ALTER TABLE ... RECOVER PARTITIONS` is not supported for v2 tables. ### Why are the changes needed? The PR makes the resolution consistent behavior consistent. For example, ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint, val string) USING csv PARTITIONED BY (id)") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("ALTER TABLE t RECOVER PARTITIONS") // works fine ``` , but after this PR: ``` sql("ALTER TABLE t RECOVER PARTITIONS") org.apache.spark.sql.AnalysisException: t is a temp view. 'ALTER TABLE ... RECOVER PARTITIONS' expects a table; line 1 pos 0 ``` , which is the consistent behavior with other commands. ### Does this PR introduce _any_ user-facing change? After this PR, `ALTER TABLE t RECOVER PARTITIONS` in the above example is resolved to a temp view `t` first instead of `spark_catalog.test.t`. ### How was this patch tested? Updated existing tests. Closes #30773 from imback82/alter_table_recover_part_v2. Authored-by: Terry Kim <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ON tests ### What changes were proposed in this pull request? 1. Move the `ALTER TABLE .. DROP PARTITION` parsing tests to `AlterTableDropPartitionParserSuite` 2. Place v1 tests for `ALTER TABLE .. DROP PARTITION` from `DDLSuite` and v2 tests from `AlterTablePartitionV2SQLSuite` to the common trait `AlterTableDropPartitionSuiteBase`, so, the tests will run for V1, Hive V1 and V2 DS. ### Why are the changes needed? - The unification will allow to run common `ALTER TABLE .. DROP PARTITION` tests for both DSv1 and Hive DSv1, DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *AlterTableDropPartitionParserSuite" $ build/sbt -Phive -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #30747 from MaxGekk/unify-alter-table-drop-partition-tests. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? If we call `SubqueryExec.executeTake`, it will call `SubqueryExec.execute` which will trigger the codegen of the query plan and create an RDD. However, `SubqueryExec` already has a thread (`SubqueryExec.relationFuture`) to execute the query plan, which means we have 2 threads triggering codegen of the same query plan at the same time. Spark codegen is not thread-safe, as we have places like `HashAggregateExec.bufferVars` that is a shared variable. The bug in `SubqueryExec` may lead to correctness bugs. Since https://issues.apache.org/jira/browse/SPARK-33119, `ScalarSubquery` will call `SubqueryExec.executeTake`, so flaky tests start to appear. This PR fixes the bug by reimplementing #30016 . We should pass the number of rows we want to collect to `SubqueryExec` at planning time, so that we can use `executeTake` inside `SubqueryExec.relationFuture`, and the caller side should always call `SubqueryExec.executeCollect`. This PR also adds checks so that we can make sure only `SubqueryExec.executeCollect` is called. ### Why are the changes needed? fix correctness bug. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? run `build/sbt "sql/testOnly *SQLQueryTestSuite -- -z scalar-subquery-select"` more than 10 times. Previously it fails, now it passes. Closes #30765 from cloud-fan/bug. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
… to deal with Column type ### What changes were proposed in this pull request? The proposition of this pull request is described in this JIRA ticket: [https://issues.apache.org/jira/browse/SPARK-33769](url) It proposes to improve the next-day function of the sql component to deal with Column type for the parameter dayOfWeek. ### Why are the changes needed? It makes this functionality easier to use. Actually the signature of this function is: > def next_day(date: Column, dayOfWeek: String): Column. It accepts the dayOfWeek parameter as a String. However in some cases, the dayOfWeek is in a Column, so a different value for each row of the dataframe. A current workaround is to use the NextDay function like this: > NextDay(dateCol.expr, dayOfWeekCol.expr). The proposition is to add another signature for this function: > def next_day(date: Column, dayOfWeek: Column): Column In fact it is already the case for some other functions in this scala object, exemple: > def date_sub(start: Column, days: Int): Column = date_sub(start, lit(days)) > def date_sub(start: Column, days: Column): Column = withExpr \{ DateSub(start.expr, days.expr) } or > def add_months(startDate: Column, numMonths: Int): Column = add_months(startDate, lit(numMonths)) > def add_months(startDate: Column, numMonths: Column): Column = withExpr { > AddMonths(startDate.expr, numMonths.expr) > } This pull request is the same idea for the function next_day. ### Does this PR introduce _any_ user-facing change? Yes With this pull request, users of spark will have a new signature of the function: > def next_day(date: Column, dayOfWeek: Column): Column But the existing function signature should still work: > def next_day(date: Column, dayOfWeek: String): Column So this change should be retrocompatible. ### How was this patch tested? The unit tests of the next_day function has been enhanced. It tests the dayOfWeek parameter both as String and Column. I also added a test case for the existing signature where the dayOfWeek is a non valid String. This should return null. Closes #30761 from chongguang/SPARK-33769. Authored-by: Chongguang LIU <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…ds semicolon repeatedly ### What changes were proposed in this pull request? The current `getSimpleMessage` of `AnalysisException` may adds semicolon repeatedly. There show an example below: `select decode()` The output will be: ``` org.apache.spark.sql.AnalysisException Invalid number of arguments for function decode. Expected: 2; Found: 0;; line 1 pos 7 ``` ### Why are the changes needed? Fix a bug, because it adds semicolon repeatedly. ### Does this PR introduce _any_ user-facing change? Yes. the message of AnalysisException will be correct. ### How was this patch tested? Jenkins test. Closes #30724 from beliefer/SPARK-33752. Lead-authored-by: gengjiaan <[email protected]> Co-authored-by: beliefer <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…utPartitionings when some columns are dropped from projection
### What changes were proposed in this pull request?
This PR tries to prune the unrequired output partitionings in cases when the columns are dropped from Project/Aggregates etc.
### Why are the changes needed?
Consider this query:
select t1.id from t1 JOIN t2 on t1.id = t2.id
This query will have top level Project node which will just project t1.id. But the outputPartitioning of this project node will be: PartitioningCollection(HashPartitioning(t1.id), HashPartitioning(t2.id)).
But since we are not propagating t2.id column, so we can drop HashPartitioning(t2.id) from the output partitioning of Project node.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UTs.
Closes #30762 from prakharjain09/SPARK-33758-prune-partitioning.
Authored-by: Prakhar Jain <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This is a followup of #30559 . The default parallelism config in Spark core is not good, as it's unclear where it applies. To not inherit this problem in Spark SQL, this PR refines the default parallelism SQL config, to make it clear that it only applies to leaf nodes. ### Why are the changes needed? Make the config clearer. ### Does this PR introduce _any_ user-facing change? It changes an unreleased config. ### How was this patch tested? existing tests Closes #30736 from cloud-fan/follow. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This PR adds `UpdateTable` to supported plans in `ReplaceNullWithFalseInPredicate`. ### Why are the changes needed? This change allows Spark to optimize update conditions like we optimize filters. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR extends the existing test cases to also cover `UpdateTable`. Closes #30787 from aokolnychyi/spark-33735. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This is a simple change to support allocating a specified amount of overhead memory for the driver's mesos container. This is already supported for executors. ### Why are the changes needed? This is needed to keep the driver process from exceeding memory limits and being killed off when running on mesos. ### Does this PR introduce _any_ user-facing change? Yes, it adds a `spark.mesos.driver.memoryOverhead` configuration option. Documentation changes for this option are included in the PR. ### How was this patch tested? Test cases covering allocation of driver memory overhead are included in the changes. ### Other notes This is a second attempt to get this change reviewed, accepted and merged. The original pull request was closed as stale back in January: #21006. For this pull request, I took the original change by pmackles, rebased it onto the current master branch, and added a test case that was requested in the original code review. I'm happy to make any further edits or do anything needed so that this can be included in a future spark release. I keep having to build custom spark distributions so that we can use spark within our mesos clusters. Closes #30739 from dmcwhorter/dmcwhorter-SPARK-22256. Lead-authored-by: David McWhorter <[email protected]> Co-authored-by: Paul Mackles <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…atalog.dropPartitions() ### What changes were proposed in this pull request? Throw `NoSuchPartitionsException` from `ALTER TABLE .. DROP TABLE` for not existing partitions of a table in V1 Hive external catalog. ### Why are the changes needed? The behaviour of Hive external catalog deviates from V1/V2 in-memory catalogs that throw `NoSuchPartitionsException`. To improve user experience with Spark SQL, it would be better to throw the same exception. ### Does this PR introduce _any_ user-facing change? Yes, the command throws `NoSuchPartitionsException` instead of the general exception `AnalysisException`. ### How was this patch tested? By running tests for `ALTER TABLE .. DROP PARTITION`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #30778 from MaxGekk/hive-drop-partition-exception. Authored-by: Max Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request? If the text in the left menu of Spark is too long, it will be hidden.  This PR is to fix the style issue. ### Why are the changes needed? Improve the UI of Spark documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test After changes:  Closes #30786 from gengliangwang/fixDocStyle. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…in tests ### What changes were proposed in this pull request? This PR introduces a helper method `withExecutor` that handles the creation of an Executor object and ensures that it is always stopped in a finally block. The tests in ExecutorSuite have been refactored to use this method. ### Why are the changes needed? Recently an issue was discovered that leaked Executors (which are not explicitly stopped after a test) can cause other tests to fail due to the JVM being killed after 10 min. It is therefore crucial that tests always stop the Executor. By introducing this helper method, a simple pattern is established that can be easily adopted in new tests, which reduces the risk of regressions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run the ExecutorSuite locally. Closes #30783 from sander-goos/SPARK-33793-close-executors. Authored-by: Sander Goos <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
… resolve identifier ### What changes were proposed in this pull request? This PR proposes to migrate `UNCACHE TABLE` to use `UnresolvedRelation` to resolve the table/view identifier in Analyzer as discussed https://github.com/apache/spark/pull/30403/files#r532360022. ### Why are the changes needed? To resolve the table/view in the analyzer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Updated existing tests Closes #30743 from imback82/uncache_v2. Authored-by: Terry Kim <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…when a table name is altered
### What changes were proposed in this pull request?
This PR proposes to retain the cache's storage level when a table name is altered by `ALTER TABLE ... RENAME TO ...`.
### Why are the changes needed?
Currently, when a table name is altered, the table's cache is refreshed (if exists), but the storage level is not retained. For example:
```scala
def getStorageLevel(tableName: String): StorageLevel = {
val table = spark.table(tableName)
val cachedData = spark.sharedState.cacheManager.lookupCachedData(table).get
cachedData.cachedRepresentation.cacheBuilder.storageLevel
}
Seq(1 -> "a").toDF("i", "j").write.parquet(path.getCanonicalPath)
sql(s"CREATE TABLE old USING parquet LOCATION '${path.toURI}'")
sql("CACHE TABLE old OPTIONS('storageLevel' 'MEMORY_ONLY')")
val oldStorageLevel = getStorageLevel("old")
sql("ALTER TABLE old RENAME TO new")
val newStorageLevel = getStorageLevel("new")
```
`oldStorageLevel` will be `StorageLevel(memory, deserialized, 1 replicas)` whereas `newStorageLevel` will be `StorageLevel(disk, memory, deserialized, 1 replicas)`, which is the default storage level.
### Does this PR introduce _any_ user-facing change?
Yes, now the storage level for the cache will be retained.
### How was this patch tested?
Added a unit test.
Closes #30774 from imback82/alter_table_rename_cache_fix.
Authored-by: Terry Kim <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This PR proposes to fix the Jenkins job badge: Before:  After:  ### Why are the changes needed? To make people can easily check the status of builds. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested via using GitHub. Closes #30797 from HyukjinKwon/minor-readme. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…updating PySpark coverage ### What changes were proposed in this pull request? The current Jenkins job fails as below (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-3.2/1726/console) ``` Generating HTML files for PySpark coverage under /home/jenkins/workspace/spark-master-test-sbt-hadoop-3.2/python/test_coverage/htmlcov /home/jenkins/workspace/spark-master-test-sbt-hadoop-3.2 Cloning into 'pyspark-coverage-site'... *** Please tell me who you are. Run git config --global user.email "youexample.com" git config --global user.name "Your Name" to set your account's default identity. Omit --global to set the identity only in this repository. ``` This PR proposes to set both when committing to the coverage site. ### Why are the changes needed? To make the coverage site keep working. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested in the console but it has to be merged to test in the Jenkins environment. Closes #30796 from HyukjinKwon/SPARK-33802. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…mand ### What changes were proposed in this pull request? This PR proposes to sort table properties in DESCRIBE TABLE command. This is consistent with DSv2 command as well: https://github.com/apache/spark/blob/e3058ba17cb4512537953eb4ded884e24ee93ba2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeTableExec.scala#L63 This PR fixes the test case in Scala 2.13 build as well where the table properties have different order in the map. ### Why are the changes needed? To keep the deterministic and pretty output, and fix the tests in Scala 2.13 build. See https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/49/testReport/junit/org.apache.spark.sql/SQLQueryTestSuite/describe_sql/ ``` describe.sql Expected "...spark_catalog, view.[query.out.col.2=c, view.referredTempFunctionsNames=[], view.catalogAndNamespace.part.1=default]]", but got "...spark_catalog, view.[catalogAndNamespace.part.1=default, view.query.out.col.2=c, view.referredTempFunctionsNames=[]]]" Result did not match for query #29 DESC FORMATTED v ``` ### Does this PR introduce _any_ user-facing change? Yes, it will change the text output from `DESCRIBE [EXTENDED|FORMATTED] table_name`. Now the table properties are sorted by its key. ### How was this patch tested? Related unittests were fixed accordingly. Closes #30799 from HyukjinKwon/SPARK-33803. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? 1. Move common utility functions such as `test()`, `withNsTable()` and `checkPartitions()` to `DDLCommandTestUtils`. 2. Place common settings such as `version`, `catalog`, `defaultUsing`, `sparkConf` to `CommandSuiteBase`. ### Why are the changes needed? To improve code maintenance of the unified tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowPartitionsSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowTablesSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #30779 from MaxGekk/refactor-unified-tests. Lead-authored-by: Max Gekk <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…s first ### What changes were proposed in this pull request? As a follow-up of #30045, we modify the RESET command here to respect the session initial configs per session first then fall back to the `SharedState` conf, which makes each session could maintain a different copy of initial configs for resetting. ### Why are the changes needed? to make reset command saner. ### Does this PR introduce _any_ user-facing change? yes, RESET will respect session initials first not always go to the system defaults ### How was this patch tested? add new tests Closes #30642 from yaooqinn/SPARK-32991-F. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…tion in git command for PySpark coverage
### What changes were proposed in this pull request?
This PR proposes to separate arguments properly for `-c` options. Otherwise, the space is considered as its part of argument:
```
Cloning into 'pyspark-coverage-site'...
unknown option: -c user.name='Apache Spark Test Account'
usage: git [--version] [--help] [-C <path>] [-c <name>=<value>]
[--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
[-p | --paginate | -P | --no-pager] [--no-replace-objects] [--bare]
[--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
<command> [<args>]
[error] running git -c user.name='Apache Spark Test Account' -c user.email='sparktestaccgmail.com' commit -am Coverage report at latest commit in Apache Spark ; received return code 129
```
### Why are the changes needed?
To make the build pass (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-3.2/1728/console).
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
```python
>>> from sparktestsupport.shellutils import run_cmd
>>> run_cmd([
... "git",
... "-c",
... "user.name='Apache Spark Test Account'",
... "-c",
... "user.email='sparktestaccgmail.com'",
... "commit",
... "-am",
... "Coverage report at latest commit in Apache Spark"])
[SPARK-33802-followup 80d2565a511] Coverage report at latest commit in Apache Spark
1 file changed, 1 insertion(+), 1 deletion(-)
CompletedProcess(args=['git', '-c', "user.name='Apache Spark Test Account'", '-c', "user.email='sparktestaccgmail.com'", 'commit', '-am', 'Coverage report at latest commit in Apache Spark'], returncode=0)
```
I cannot run e2e test because it requires the env to have Jenkins secret.
Closes #30804 from HyukjinKwon/SPARK-33802-followup.
Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
…hen a relation is not resolved ### What changes were proposed in this pull request? Based on the discussion #30743 (comment), this PR proposes to remove the command name in AnalysisException message when a relation is not resolved. For some of the commands that use `UnresolvedTable`, `UnresolvedView`, and `UnresolvedTableOrView` to resolve an identifier, when the identifier cannot be resolved, the exception will be something like `Table or view not found for 'SHOW TBLPROPERTIES': badtable`. The command name (`SHOW TBLPROPERTIES` in this case) should be dropped to be consistent with other existing commands. ### Why are the changes needed? To make the exception message consistent. ### Does this PR introduce _any_ user-facing change? Yes, the exception message will be changed from ``` Table or view not found for 'SHOW TBLPROPERTIES': badtable ``` to ``` Table or view not found: badtable ``` for commands that use `UnresolvedTable`, `UnresolvedView`, and `UnresolvedTableOrView` to resolve an identifier. ### How was this patch tested? Updated existing tests. Closes #30794 from imback82/remove_cmd_from_exception_msg. Authored-by: Terry Kim <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? The test failures were due to machine being slow in Jenkins. We switched to Ubuntu 20 if I am not wrong. Looks like all machines are functioning properly unlike the past, and the tests pass without a problem anymore. This PR proposes to enable them back. ### Why are the changes needed? To restore test coverage. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Jenkins jobs in this PR show the flakiness. Closes #30798 from HyukjinKwon/do-not-merge-test. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…able expressions ### What changes were proposed in this pull request? It seems a very popular way that people use DISTRIBUTE BY clause with a literal to coalesce partition in the pure SQL data processing. For example ``` insert into table src select * from values (1), (2), (3) t(a) distribute by 1 ``` Users may want the final output to be one single data file, but if the reality is not always true. Spark will always create a file for partition 0 whether it contains data or not, so when the data all goes to a partition(IDX >0), there will be always 2 files there and the part-00000 is empty. On the other hand, a lot of empty tasks will be launched too, this is unnecessary. When users repeat the insert statement daily, hourly, or minutely, it causes small file issues. ``` spark-sql> set spark.sql.shuffle.partitions=3;drop table if exists test2;create table test2 using parquet as select * from values (1), (2), (3) t(a) distribute by 1; kentyaohulk ~/spark SPARK-33806 tree /Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20201202/spark-warehouse/test2/ -s /Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20201202/spark-warehouse/test2/ ├── [ 0] _SUCCESS ├── [ 298] part-00000-5dc19733-9405-414b-9681-d25c4d3e9ee6-c000.snappy.parquet └── [ 426] part-00001-5dc19733-9405-414b-9681-d25c4d3e9ee6-c000.snappy.parquet ``` To avoid this, there are some options you can take. 1. use `distribute by null`, let the data go to the partition 0 2. set spark.sql.adaptive.enabled to true for Spark to automatically coalesce 3. using hints instead of `distribute by` 4. set spark.sql.shuffle.partitions to 1 In this PR, we set the partition number to 1 in this particular case. ### Why are the changes needed? 1. avoid small file issues 2. avoid unnecessary empty tasks when no adaptive execution ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #30800 from yaooqinn/SPARK-33806. Authored-by: Kent Yao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR adds the support of the latest mkdocs, and makes the sidebar properly show. It works in lower versions too. Before:  After:  ### Why are the changes needed? This is a regression in the documentation. ### Does this PR introduce _any_ user-facing change? Technically no. It's not related yet. It fixes the list on the sidebar appears properly. ### How was this patch tested? Manually built the docs via `./sql/create-docs.sh` and `open ./sql/site/index.html` Closes #31061 from HyukjinKwon/SPARK-34022. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…tors" ### What changes were proposed in this pull request? Add doc for 'like any' and 'like all' operators in sql-ref-syntx-qry-select-like.cmd ### Why are the changes needed? make the usage of 'like any' and 'like all' known to more users ### Does this PR introduce _any_ user-facing change? Yes. <img width="687" alt="Screen Shot 2021-01-06 at 21 10 38" src="https://user-images.githubusercontent.com/692303/103767385-dc1ffb80-5063-11eb-9529-89479531425f.png"> <img width="495" alt="Screen Shot 2021-01-06 at 21 11 06" src="https://user-images.githubusercontent.com/692303/103767391-dde9bf00-5063-11eb-82ce-63bdd11593a1.png"> <img width="406" alt="Screen Shot 2021-01-06 at 21 11 20" src="https://user-images.githubusercontent.com/692303/103767396-df1aec00-5063-11eb-8e81-a192e6c72431.png"> ### How was this patch tested? No tests Closes #31008 from beliefer/SPARK-33977. Lead-authored-by: gengjiaan <[email protected]> Co-authored-by: beliefer <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>
…ault filed.delim to '\t' when user specifies serde ### What changes were proposed in this pull request? Update migration guide according to #30942 (comment) ### Why are the changes needed? update migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31051 from AngersZhuuuu/SPARK-32685-FOLLOW-UP. Authored-by: angerszhu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This PR is a follow-up of #31061. It fixes a typo in a document: `Finctions` -> `Functions` ### Why are the changes needed? Make the change better documented. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #31069 from kiszk/SPARK-34022-followup. Authored-by: Kazuaki Ishizaki <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to add a basis for columnar encryption test framework by add `OrcEncryptionSuite` and `FakeKeyProvider`. Please note that we will improve more in both Apache Spark and Apache ORC in Apache Spark 3.2.0 timeframe. ### Why are the changes needed? Apache ORC 1.6 supports columnar encryption. ### Does this PR introduce _any_ user-facing change? No. This is for a test case. ### How was this patch tested? Pass the newly added test suite. Closes #31065 from dongjoon-hyun/SPARK-34029. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
… should check if partition expression is empty
### What changes were proposed in this pull request?
Add check partition expressions is empty.
### Why are the changes needed?
We should keep `spark.range(1).hint("REPARTITION_BY_RANGE")` has default shuffle number instead of 1.
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
Add test.
Closes #31074 from ulysses-you/SPARK-33806-FOLLOWUP.
Authored-by: ulysses-you <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to update SQL documentation about ORC data sources. New structure looks like the following. - ORC Implementation - Vectorized Reader - Schema Merging - Zstandard - Bloom Filters - Columnar Encryption - Hive metastore ORC table conversion - Configuration ### Why are the changes needed? This document is not up-to-date. Apache Spark 3.2.0 can utilize new improvements from Apache ORC 1.6.6. ### Does this PR introduce _any_ user-facing change? No, this is a documentation. ### How was this patch tested? Manual. ``` SKIP_API=1 jekyll build ``` --- **BEFORE**  --- **AFTER**      Closes #31075 from dongjoon-hyun/SPARK-34036. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…mission
### What changes were proposed in this pull request?
3.0.1 CRAN submission was failed as the reason below:
```
Found the following (possibly) invalid URLs:
URL: http://jsonlines.org/ (moved to https://jsonlines.org/)
From: man/read.json.Rd
man/write.json.Rd
Status: 200
Message: OK
URL: https://dl.acm.org/citation.cfm?id=1608614 (moved to
https://dl.acm.org/doi/10.1109/MC.2009.263)
From: inst/doc/sparkr-vignettes.html
Status: 200
Message: OK
```
The links were being redirected now. This PR checked all hyperlinks in the docs such as `href{...}` and `url{...}`, and fixed all in SparkR:
- Fix two problems above.
- Fix http to https
- Fix `https://www.apache.org/ https://spark.apache.org/` -> `https://www.apache.org https://spark.apache.org`.
### Why are the changes needed?
For CRAN submission.
### Does this PR introduce _any_ user-facing change?
Virtually no because it's just cleanup that CRAN requires.
### How was this patch tested?
Manually tested by clicking the links
Closes #31058 from HyukjinKwon/SPARK-34021.
Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request? There is one compilation warning as follow: ``` [WARNING] [Warn] /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:1555: [other-match-analysis org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction.catalogFunction] unreachable code ``` This compilation warning is due to `NoSuchPermanentFunctionException` is sub-class of `AnalysisException` and if there is `NoSuchPermanentFunctionException` be thrown out, it will be catch by `case _: AnalysisException => failFunctionLookup(name)`, so `case _: NoSuchPermanentFunctionException => failFunctionLookup(name)` is `unreachable code`. This pr remove `case _: NoSuchPermanentFunctionException => failFunctionLookup(name)` directly because both these 2 branches handle exceptions in the same way: `failFunctionLookup(name)` ### Why are the changes needed? Cleanup "unreachable code" compilation warnings. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31064 from LuciferYang/SPARK-34028. Authored-by: yangjie01 <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
… consider deterministic ### What changes were proposed in this pull request? This pr address #30865 (review) to fix simplify conditional in predicate should consider deterministic. ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31067 from wangyum/SPARK-33861-2. Authored-by: Yuming Wang <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?
This pr add row count to `Union` operator when CBO enabled.
```scala
spark.sql("CREATE TABLE t1 USING parquet AS SELECT id FROM RANGE(10)")
spark.sql("CREATE TABLE t2 USING parquet AS SELECT id FROM RANGE(10)")
spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
spark.sql("set spark.sql.cbo.enabled=true")
spark.sql("SELECT * FROM t1 UNION ALL SELECT * FROM t2").explain("cost")
```
Before this pr:
```
== Optimized Logical Plan ==
Union false, false, Statistics(sizeInBytes=320.0 B)
:- Relation[id#5880L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
+- Relation[id#5881L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
```
After this pr:
```
== Optimized Logical Plan ==
Union false, false, Statistics(sizeInBytes=320.0 B, rowCount=20)
:- Relation[id#2138L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
+- Relation[id#2139L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
```
### Why are the changes needed?
Improve query performance, [`JoinEstimation.estimateInnerOuterJoin`](https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala#L55-L156) need the row count.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes #31068 from wangyum/SPARK-34031.
Lead-authored-by: Yuming Wang <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
…rovider" This reverts commit 8bb70bf.
…adcast timeout in AQE ### What changes were proposed in this pull request? In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others. It can make sure the broadcast job are submitted before map jobs to avoid waiting for job schedule and cause broadcast timeout. ### Why are the changes needed? When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map job and broadcast job are submitted almost at the same time, but map job will hold all the computing resources. If the map job runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475). The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? 1. Add UT 2. Test the code using dev environment in https://issues.apache.org/jira/browse/SPARK-33933 Closes #30998 from zhongyu09/aqe-broadcast. Authored-by: Yu Zhong <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…in spark-sql ### What changes were proposed in this pull request? This PR help find correct bound of bracketed comment in spark-sql. Here is the log for UT of SPARK-33100 in CliSuite before: ``` 2021-01-05 13:22:34.768 - stdout> spark-sql> /* SELECT 'test';*/ SELECT 'test'; 2021-01-05 13:22:41.523 - stderr> Time taken: 6.716 seconds, Fetched 1 row(s) 2021-01-05 13:22:41.599 - stdout> test 2021-01-05 13:22:41.6 - stdout> spark-sql> ;;/* SELECT 'test';*/ SELECT 'test'; 2021-01-05 13:22:41.709 - stdout> test 2021-01-05 13:22:41.709 - stdout> spark-sql> /* SELECT 'test';*/;; SELECT 'test'; 2021-01-05 13:22:41.902 - stdout> spark-sql> SELECT 'test'; -- SELECT 'test'; 2021-01-05 13:22:41.902 - stderr> Time taken: 0.129 seconds, Fetched 1 row(s) 2021-01-05 13:22:41.902 - stderr> Error in query: 2021-01-05 13:22:41.902 - stderr> mismatched input '<EOF>' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 19) 2021-01-05 13:22:42.006 - stderr> 2021-01-05 13:22:42.006 - stderr> == SQL == 2021-01-05 13:22:42.006 - stderr> /* SELECT 'test';*/ 2021-01-05 13:22:42.006 - stderr> -------------------^^^ 2021-01-05 13:22:42.006 - stderr> 2021-01-05 13:22:42.006 - stderr> Time taken: 0.226 seconds, Fetched 1 row(s) 2021-01-05 13:22:42.006 - stdout> test ``` The root cause is that the insideBracketedComment is not accurate. For `/* comment */`, the last character `/` is not insideBracketedComment and it would be treat as beginning of statements. In this PR, this issue is fixed. ### Why are the changes needed? To fix the issue described above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #31054 from turboFei/SPARK-33100-followup. Authored-by: fwang12 <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>
…umentation ### What changes were proposed in this pull request? This PR proposes to: - Add a link of quick start in PySpark docs into "Programming Guides" in Spark main docs - `ML` / `MLlib` -> `MLlib (DataFrame-based)` / `MLlib (RDD-based)` in API reference page - Mention other user guides as well because the guide such as [ML](http://spark.apache.org/docs/latest/ml-guide.html) and [SQL](http://spark.apache.org/docs/latest/sql-programming-guide.html). - Mention other migration guides as well because PySpark can get affected by it. ### Why are the changes needed? For better documentation. ### Does this PR introduce _any_ user-facing change? It fixes user-facing docs. However, it's not released out yet. ### How was this patch tested? Manually tested by running: ```bash cd docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch ``` Closes #31082 from HyukjinKwon/SPARK-34041. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…a-sources-hive-tables.md ### What changes were proposed in this pull request? This PR adds new configuration to `sql-data-sources-hive-tables`. ### Why are the changes needed? SPARK-32852 added a new configuration, `spark.sql.hive.metastore.jars.path`. ### Does this PR introduce _any_ user-facing change? Yes, but a document only. ### How was this patch tested? **BEFORE**  **AFTER**  Closes #31085 from dongjoon-hyun/SPARK-34044. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request? Label both the statuses and ensure the ExecutorPodSnapshot starts with the default config to match. ### Why are the changes needed? The current test depends on the order rather than testing the desired property. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Labeled the containers statuses, observed failures, added the default label as the initialization point, tests passed again. Built Spark, ran on K8s cluster verified no NPE in driver log. Closes #31071 from holdenk/SPARK-34018-finishedExecutorWithRunningSidecar-doesnt-correctly-constructt-the-test-case. Authored-by: Holden Karau <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…edRegexColumnNames` in the SQL documents ### What changes were proposed in this pull request? According to #30805 (comment), doc `spark.sql.parser.quotedRegexColumnNames` since we need user know about this in doc and it's useful.   ### Why are the changes needed? Complete doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #30816 from AngersZhuuuu/SPARK-33818. Authored-by: angerszhu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This changes `ReplaceTableExec`/`AtomicReplaceTableExec`, and uncaches the target table before it is dropped. In addition, this includes some refactoring by moving the `uncacheTable` method to `DataSourceV2Strategy` so that we don't need to pass a Spark session to the v2 exec. ### Why are the changes needed? Similar to SPARK-33492 (#30429). When a table is refreshed, the associated cache should be invalidated to avoid potential incorrect results. ### Does this PR introduce _any_ user-facing change? Yes. Now When a data source v2 is cached (either directly or indirectly), all the relevant caches will be refreshed or invalidated if the table is replaced. ### How was this patch tested? Added a new unit test. Closes #31081 from sunchao/SPARK-34039. Authored-by: Chao Sun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ask end ### What changes were proposed in this pull request? This PR makes `AppStatusListener` update the peak memory metrics for each Executor on task end like other peak memory metrics (e.g, stage, executors in a stage). ### Why are the changes needed? When `AppStatusListener#onExecutorMetricsUpdate` is called, peak memory metrics for Executors, stages and executors in a stage are updated but currently, the metrics only for Executors are not updated on task end. ### Does this PR introduce _any_ user-facing change? Yes. Executor peak memory metrics is updated more accurately. ### How was this patch tested? After I run a job with `local-cluster[1,1,1024]` and visited `/api/v1/<appid>/executors`, I confirmed `peakExecutorMemory` metrics is shown for an Executor even though the life time of each job is very short . I also modify the json files for `HistoryServerSuite`. Closes #31029 from sarutak/update-executor-metrics-on-taskend. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…nSuite and WholeStageCodegenSuite ### What changes were proposed in this pull request? There are some existing test cases that constructing various joins by tuning the SQL configuration AUTO_BROADCASTJOIN_THRESHOLD, PREFER_SORTMERGEJOIN,SHUFFLE_PARTITIONS, etc. This can be tricky and not straight-forward. In the future development we might have to tweak the configurations again . This PR is to construct specific joins by using join hint in test cases. ### Why are the changes needed? Make test cases for join simpler and more robust. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #31087 from gengliangwang/joinhintInTest. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…rCharVarchar and ResolveAggregateFunctions ### What changes were proposed in this pull request? ResolveAggregateFunctions is a hacky rule and it calls `executeSameContext` to generate a `resolved agg` to determine which unresolved sort attribute should be pushed into the agg. However, after we add the PaddingAndLengthCheckForCharVarchar rule which will rewrite the query output, thus, the `resolved agg` cannot match original attributes anymore. It causes some dissociative sort attribute to be pushed in and fails the query ``` logtalk [info] Failed to analyze query: org.apache.spark.sql.AnalysisException: expression 'testcat.t1.`v`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.; [info] Project [v#14, sum(i)#11L] [info] +- Sort [aggOrder#12 ASC NULLS FIRST], true [info] +- !Aggregate [v#14], [v#14, sum(cast(i#7 as bigint)) AS sum(i)#11L, v#13 AS aggOrder#12] [info] +- SubqueryAlias testcat.t1 [info] +- Project [if ((length(v#6) <= 3)) v#6 else if ((length(rtrim(v#6, None)) > 3)) cast(raise_error(concat(input string of length , cast(length(v#6) as string), exceeds varchar type length limitation: 3)) as string) else rpad(rtrim(v#6, None), 3, ) AS v#14, i#7] [info] +- RelationV2[v#6, i#7, index#15, _partition#16] testcat.t1 [info] [info] Project [v#14, sum(i)#11L] [info] +- Sort [aggOrder#12 ASC NULLS FIRST], true [info] +- !Aggregate [v#14], [v#14, sum(cast(i#7 as bigint)) AS sum(i)#11L, v#13 AS aggOrder#12] [info] +- SubqueryAlias testcat.t1 [info] +- Project [if ((length(v#6) <= 3)) v#6 else if ((length(rtrim(v#6, None)) > 3)) cast(raise_error(concat(input string of length , cast(length(v#6) as string), exceeds varchar type length limitation: 3)) as string) else rpad(rtrim(v#6, None), 3, ) AS v#14, i#7] [info] +- RelationV2[v#6, i#7, index#15, _partition#16] testcat.t1 ``` ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #31027 from yaooqinn/SPARK-34003. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
… for Kafka delegation token ### What changes were proposed in this pull request? Kafka delegation token is obtained with `AdminClient` where security settings can be set. Keystore and trustrore type however can't be set. In this PR I've added these new configurations. This can be useful when the type is different. A good example is to make Spark FIPS compliant where the default JKS is not accepted. ### Why are the changes needed? Missing configurations. ### Does this PR introduce _any_ user-facing change? Yes, adding 2 additional config parameters. ### How was this patch tested? Existing + modified unit tests + simple Kafka to Kafka app on cluster. Closes #31070 from gaborgsomogyi/SPARK-34032. Authored-by: Gabor Somogyi <[email protected]> Signed-off-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
### What changes were proposed in this pull request? 1. Recognize `null` while parsing partition specs, and put `null` instead of `"null"` as partition values. 2. For V1 catalog: replace `null` by `__HIVE_DEFAULT_PARTITION__`. 3. For V2 catalogs: pass `null` AS IS, and let catalog implementations to decide how to handle `null`s as partition values in spec. ### Why are the changes needed? Currently, `null` in partition specs is recognized as the `"null"` string which could lead to incorrect results, for example: ```sql spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1); spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0; spark-sql> SELECT isnull(p1) FROM tbl5; false ``` Even we inserted a row to the partition with the `null` value, **the resulted table doesn't contain `null`**. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the example above works as expected: ```sql spark-sql> SELECT isnull(p1) FROM tbl5; true ``` ### How was this patch tested? 1. By running the affected test suites `SQLQuerySuite`, `AlterTablePartitionV2SQLSuite` and `v1/ShowPartitionsSuite`. 2. Compiling by Scala 2.13: ``` $ ./dev/change-scala-version.sh 2.13 $ ./build/sbt -Pscala-2.13 compile ``` Closes #30538 from MaxGekk/partition-spec-value-null. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…QL Guide ### What changes were proposed in this pull request? This PR tweaks the width of left-menu of Spark SQL Guide. When I view the Spark SQL Guide with browsers on macOS, the title `Spark SQL Guide` looks prettily. But I often use Pop!_OS, an Ubuntu variant, and the title is overlapped with browsers on it.  After this change, the title is no longer overlapped.  ### Why are the changes needed? For the pretty layout. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Built the document with `cd docs && SKIP_API=1 jekyll build` and confirmed the layout. Closes #31091 from sarutak/modify-layout-sparksql-guide. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…ilsSuite ### What changes were proposed in this pull request? This PR fixes an incorrect unicode literal test in `ParserUtilsSuite`. In that suite, string literals in queries have unicode escape characters like `\u7328` but the backslash should be escaped because the queriy strings are given as Java strings. ### Why are the changes needed? Correct the test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Run `ParserUtilsSuite` and it passed. Closes #31088 from sarutak/fix-incorrect-unicode-test. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…le blocks after successful map task completion ### What changes were proposed in this pull request? This is the shuffle writer side change where executors can push data to remote shuffle services. This is needed for push-based shuffle - SPIP [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602). Summary of changes: - This adds support for executors to push shuffle blocks after map tasks complete writing shuffle data. - This also introduces a timeout specifically for creating connection to remote shuffle services. ### Why are the changes needed? - These changes are needed for push-based shuffle. Refer to the SPIP in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602). - The main reason to create a separate connection creation timeout is because the existing `connectionTimeoutMs` is overloaded and is used for connection creation timeouts as well as connection idle timeout. The connection creation timeout should be much lower than the idle timeouts. The default for `connectionTimeoutMs` is 120s. This is quite high for just establishing the connections. If a shuffle server node is bad then the connection creation will fail within few seconds. However, an overloaded shuffle server may take much longer to respond to a request and the channel can stay idle for a much longer time which is expected. Another reason is that with push-based shuffle, an executor may be fetching shuffle data and pushing shuffle data (next stage) simultaneously. Both these tasks will share the same connections with the shuffle service. If there is a bad shuffle server node and the connection creation timeout is very high then both these tasks end up waiting a long time time eventually impacting the performance. ### Does this PR introduce _any_ user-facing change? Yes. This PR introduces client-side configs for push-based shuffle. If push-based shuffle is turned-off then the users will not see any change. ### How was this patch tested? Added unit tests. The reference PR with the consolidated changes covering the complete implementation is also provided in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602). We have already verified the functionality and the improved performance as documented in the SPIP doc. Lead-authored-by: Min Shen mshenlinkedin.com Co-authored-by: Chandni Singh chsinghlinkedin.com Co-authored-by: Ye Zhou yezhoulinkedin.com Closes #30312 from otterc/SPARK-32917. Lead-authored-by: Chandni Singh <[email protected]> Co-authored-by: Chandni Singh <[email protected]> Co-authored-by: Min Shen <[email protected]> Co-authored-by: Ye Zhou <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
…tion ### What changes were proposed in this pull request? This PR makes `StreamExecution` use the `Write` abstraction introduced in SPARK-33779. Note: we will need separate plans for streaming writes in order to support the required distribution and ordering in SS. This change only migrates to the `Write` abstraction. ### Why are the changes needed? These changes prevent exceptions from data sources that implement only the `build` method in `WriteBuilder`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #31093 from aokolnychyi/spark-34049. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
linar-jether
pushed a commit
that referenced
this pull request
Jun 7, 2021
… correctly ### What changes were proposed in this pull request? This PR proposes: 1. `CREATE OR REPLACE TEMP VIEW USING` should use `TemporaryViewRelation` to store temp views. 2. By doing #1, it fixes the issue where the temp view being replaced is not uncached. ### Why are the changes needed? This is a part of an ongoing work to wrap all the temporary views with `TemporaryViewRelation`: [SPARK-34698](https://issues.apache.org/jira/browse/SPARK-34698). This also fixes a bug where the temp view being replaced is not uncached. ### Does this PR introduce _any_ user-facing change? Yes, the temp view being replaced with `CREATE OR REPLACE TEMP VIEW USING` is correctly uncached if the temp view is cached. ### How was this patch tested? Added new tests. Closes apache#31825 from imback82/create_temp_view_using. Authored-by: Terry Kim <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?