Skip to content

Conversation

@dmcwhorter
Copy link

What changes were proposed in this pull request?

  1. Rebased pmackles/paul-SPARK-22256 onto the latest apache/spark master branch
  2. Added a test case for when the default value of spark.mesos.driver.memoryOverhead is 10% of driver memory as requested in [SPARK-22256][MESOS] - Introduce spark.mesos.driver.memoryOverhead apache/spark#21006

How was this patch tested?

Ran unit tests in resource-managers/mesos locally.

@dmcwhorter dmcwhorter changed the title Dmcwhorter spark 22256 SPARK-22256 Updates Oct 8, 2019
@dmcwhorter dmcwhorter force-pushed the dmcwhorter-SPARK-22256 branch from 60aab26 to a2b5feb Compare October 8, 2019 21:01
HyukjinKwon and others added 27 commits November 17, 2020 14:15
… (disabled by default)

### What changes were proposed in this pull request?

This PR proposes to simplify the exception messages from Python UDFS.

Currently, the exception message from Python UDFs is as below:

```python
from pyspark.sql.functions import udf; spark.range(10).select(udf(lambda x: x/0)("id")).collect()
```

```python
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../python/pyspark/sql/dataframe.py", line 427, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../python/pyspark/sql/utils.py", line 127, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
  An exception was thrown from Python worker in the executor:
Traceback (most recent call last):
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
    process()
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
    serializer.dump_stream(out_iter, outfile)
  File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
    for obj in iterator:
  File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
    for item in iterator:
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
    return lambda *a: f(*a)
  File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
    return f(*args, **kwargs)
  File "<stdin>", line 1, in <lambda>
ZeroDivisionError: division by zero
```

Actually, almost all cases, users only care about `ZeroDivisionError: division by zero`. We don't really have to show the internal stuff in 99% cases.

This PR adds a configuration `spark.sql.execution.pyspark.udf.simplifiedException.enabled` (disabled by default) that hides the internal tracebacks related to Python worker, (de)serialization, etc.

```python
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../python/pyspark/sql/dataframe.py", line 427, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../python/pyspark/sql/utils.py", line 127, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
  An exception was thrown from Python worker in the executor:
Traceback (most recent call last):
  File "<stdin>", line 1, in <lambda>
ZeroDivisionError: division by zero
```

The trackback will be shown from the point when any non-PySpark file is seen in the traceback.

### Why are the changes needed?

Without this configuration. such internal tracebacks are exposed to users directly especially for shall or notebook users in PySpark. 99% cases people don't care about the internal Python worker, (de)serialization and related tracebacks. It just makes the exception more difficult to read. For example, one statement of `x/0` above shows a very long traceback and most of them are unnecessary.

This configuration enables the ability to show simplified tracebacks which users will likely be most interested in.

### Does this PR introduce _any_ user-facing change?

By default, no. It adds one configuration that simplifies the exception message. See the example above.

### How was this patch tested?

Manually tested:

```bash
$ pyspark --conf spark.sql.execution.pyspark.udf.simplifiedException.enabled=true
```
```python
from pyspark.sql.functions import udf; spark.sparkContext.setLogLevel("FATAL"); spark.range(10).select(udf(lambda x: x/0)("id")).collect()
```

and unittests were also added.

Closes apache#30309 from HyukjinKwon/SPARK-33407.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
…dicate have many values

### What changes were proposed in this pull request?

We [rewrite](https://github.com/apache/spark/blob/5197c5d2e7648d75def3e159e0d2aa3e20117105/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L722-L724) `In`/`InSet` predicate to `or` expressions when pruning Hive partitions. That will cause Hive metastore stack over flow if there are a lot of values.

This pr rewrite `InSet` predicate to `GreaterThanOrEqual` min value and `LessThanOrEqual ` max value when pruning Hive partitions to avoid Hive metastore stack overflow.

From our experience, `spark.sql.hive.metastorePartitionPruningInSetThreshold` should be less than 10000.

### Why are the changes needed?

Avoid Hive metastore stack overflow when `InSet` predicate have many values.
Especially DPP, it may generate many values.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes apache#30325 from wangyum/SPARK-33416.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…ession evaluation

### What changes were proposed in this pull request?

This patch proposes to add subexpression elimination for interpreted expression evaluation. Interpreted expression evaluation is used when codegen was not able to work, for example complex schema.

### Why are the changes needed?

Currently we only do subexpression elimination for codegen. For some reasons, we may need to run interpreted expression evaluation. For example, codegen fails to compile and fallbacks to interpreted mode, or complex input/output schema of expressions. It is commonly seen for complex schema from expressions that is possibly caused by the query optimizer too, e.g. SPARK-32945.

We should also support subexpression elimination for interpreted evaluation. That could reduce performance difference when Spark fallbacks from codegen to interpreted expression evaluation, and improve Spark usability.

#### Benchmark

Update `SubExprEliminationBenchmark`:

Before:

```
OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6
 Intel(R) Core(TM) i7-9750H CPU  2.60GHz
 from_json as subExpr:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 -------------------------------------------------------------------------------------------------------------------------
subexpressionElimination on, codegen off           24707          25688         903          0.0   247068775.9       1.0X
```

After:
```
OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6
 Intel(R) Core(TM) i7-9750H CPU  2.60GHz
 from_json as subExpr:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 -------------------------------------------------------------------------------------------------------------------------
subexpressionElimination on, codegen off            2360           2435          87          0.0    23604320.7      11.2X
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test. Benchmark manually.

Closes apache#30341 from viirya/SPARK-33427.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?

Added integration test - which tries to configure a log4j.properties and checks if, it is the one pickup by the driver.

### Why are the changes needed?

Improved test coverage.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

By running integration tests.

Closes apache#30388 from ScrapCodes/SPARK-32222/k8s-it-spark-conf-propagate.

Authored-by: Prashant Sharma <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

This PR aims to upgrade Kubernetes-client from 4.11.1 to 4.12.0

### Why are the changes needed?

This upgrades the dependency for Apache Spark 3.1.0.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes apache#30401 from ramesh-muthusamy/SPARK-33471-k8s-clientupgrade.

Authored-by: Rameshkrishnan Muthusamy <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

This minor PR updates the docs of `schema_of_csv` and `schema_of_json`. They allow foldable string column instead of a string literal now.

### Why are the changes needed?

The function doc of  `schema_of_csv` and `schema_of_json` are not updated accordingly with previous PRs.

### Does this PR introduce _any_ user-facing change?

Yes, update user-facing doc.

### How was this patch tested?

Unit test.

Closes apache#30396 from viirya/minor-json-csv.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
…se hight cpu cost in external shuffle service when `maxChunksBeingTransferred` use default value

### What changes were proposed in this pull request?
Followup from apache#27831 , origin author chrysan.

Each request it will check `chunksBeingTransferred `
```
public long chunksBeingTransferred() {
    long sum = 0L;
    for (StreamState streamState: streams.values()) {
      sum += streamState.chunksBeingTransferred.get();
    }
    return sum;
  }
```
  such as
```
long chunksBeingTransferred = streamManager.chunksBeingTransferred();
    if (chunksBeingTransferred >= maxChunksBeingTransferred) {
      logger.warn("The number of chunks being transferred {} is above {}, close the connection.",
        chunksBeingTransferred, maxChunksBeingTransferred);
      channel.close();
      return;
    }
```
It will  traverse `streams` repeatedly and we know that fetch data chunk will access `stream` too,  there cause two problem:

1. repeated traverse `streams`, the longer the length, the longer the time
2. lock race in ConcurrentHashMap `streams`

In this PR, when `maxChunksBeingTransferred` use default value, we avoid compute `chunksBeingTransferred ` since we don't  care about this.  If user want to set this configuration and meet performance problem,  you can also backport PR apache#27831

### Why are the changes needed?
Speed up  getting `chunksBeingTransferred`  and avoid lock race in object `streams`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes apache#30139 from AngersZhuuuu/SPARK-31069.

Lead-authored-by: angerszhu <[email protected]>
Co-authored-by: chrysan <[email protected]>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request?

This change adds MapType support for PySpark with Arrow, if using pyarrow >= 2.0.0.

### Why are the changes needed?

MapType was previous unsupported with Arrow.

### Does this PR introduce _any_ user-facing change?

User can now enable MapType for `createDataFrame()`, `toPandas()` with Arrow optimization, and with Pandas UDFs.

### How was this patch tested?

Added new PySpark tests for createDataFrame(), toPandas() and Scalar Pandas UDFs.

Closes apache#30393 from BryanCutler/arrow-add-MapType-SPARK-24554.

Authored-by: Bryan Cutler <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?

This PR intends to upgrade ANTLR runtime from 4.7.1 to 4.8-1.

### Why are the changes needed?

Release note of v4.8 and v4.7.2 (the v4.7.2 release has a few minor bug fixes for java targets):
 - v4.8: https://github.com/antlr/antlr4/releases/tag/4.8
 - v4.7.2: https://github.com/antlr/antlr4/releases/tag/4.7.2

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA tests.

Closes apache#30404 from maropu/UpgradeAntlr.

Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
…ql.hive.metastore.jars

### What changes were proposed in this pull request?

This is a follow-up for apache#29881.
It revises the documentation of the configuration `spark.sql.hive.metastore.jars`.

### Why are the changes needed?

Fix grammatical error in the doc.
Also, make it more clear that the configuration is effective only when `spark.sql.hive.metastore.jars` is set as `path`

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Just doc changes.

Closes apache#30407 from gengliangwang/reviseJarPathDoc.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
### What changes were proposed in this pull request?
use `maxBlockSizeInMB` instead of `blockSize` (#rows) to control the stacking of vectors;

### Why are the changes needed?
the performance gain is mainly related to the nnz of block.

### Does this PR introduce _any_ user-facing change?
yes, param blockSize -> blockSizeInMB in master

### How was this patch tested?
updated testsuites

Closes apache#30355 from zhengruifeng/adaptively_blockify_aft_lir_lor.

Lead-authored-by: zhengruifeng <[email protected]>
Co-authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
…le system schemes

### What changes were proposed in this pull request?

This PR aims to generalize executor metrics to support user-given file system schemes instead of the fixed `file,hdfs` scheme.

### Why are the changes needed?

For the users using only cloud storages like `S3A`, we need to be able to expose `S3A` metrics. Also, we can skip unused `hdfs` metrics.

### Does this PR introduce _any_ user-facing change?

Yes, but compatible for the existing users which uses `hdfs` and `file` filesystem scheme only.

### How was this patch tested?

Manually do the following.

```
$ build/sbt -Phadoop-cloud package
$ sbin/start-master.sh; sbin/start-slave.sh spark://$(hostname):7077
$ bin/spark-shell --master spark://$(hostname):7077 -c spark.executor.metrics.fileSystemSchemes=file,s3a -c spark.metrics.conf.executor.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
scala> spark.read.textFile("s3a://dongjoon/README.md").collect()
```

Separately, launch `jconsole` and check `*.executor.filesystem.s3a.*`. Also, confirm that there is no `*.executor.filesystem.hdfs.*`

```
$ jconsole
```
![Screen Shot 2020-11-17 at 9 26 03 PM](https://user-images.githubusercontent.com/9700541/99487609-94121180-291b-11eb-9ed2-964546146981.png)

Closes apache#30405 from dongjoon-hyun/SPARK-33476.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Supports python client deps from the launcher fs.
This is a feature that was added for java deps. This PR adds support fo rpythona s well.

yes

Manually running different scenarios and via examining the driver & executors logs. Also there is an integration test added.
I verified that the python resources are added to the spark file server and they are named properly so they dont fail the executors. Note here that as previously the following will not work:
primary resource `A.py`: uses a closure defined in submited pyfile `B.py`, context.py only adds to the pythonpath files with certain extension eg. zip, egg, jar.

Closes apache#25870 from skonto/python-deps.

Lead-authored-by: Stavros Kontopoulos <[email protected]>
Co-authored-by: Stavros Kontopoulos <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

This PR is a follow-up of apache#29471 and does the following improvements for `HadoopFSUtils`:
1. Removes the extra `filterFun` from the listing API and combines it with the `filter`.
2. Removes `SerializableBlockLocation` and `SerializableFileStatus` given that `BlockLocation` and `FileStatus` are already serializable.
3. Hides the `isRootLevel` flag from the top-level API.

### Why are the changes needed?

Main purpose is to simplify the logic within `HadoopFSUtils` as well as cleanup the API.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit tests (e.g., `FileIndexSuite`)

Closes apache#29959 from sunchao/hadoop-fs-utils-followup.

Authored-by: Chao Sun <[email protected]>
Signed-off-by: Holden Karau <[email protected]>
### What changes were proposed in this pull request?

This adds support for metadata columns to DataSourceV2. If a source implements `SupportsMetadataColumns` it must also implement `SupportsPushDownRequiredColumns` to support projecting those columns.

The analyzer is updated to resolve metadata columns from `LogicalPlan.metadataOutput`, and this adds a rule that will add metadata columns to the output of `DataSourceV2Relation` if one is used.

### Why are the changes needed?

This is the solution discussed for exposing additional data in the Kafka source. It is also needed for a generic `MERGE INTO` plan.

### Does this PR introduce any user-facing change?

Yes. Users can project additional columns from sources that implement the new API. This also updates `DescribeTableExec` to show metadata columns.

### How was this patch tested?

Will include new unit tests.

Closes apache#28027 from rdblue/add-dsv2-metadata-columns.

Authored-by: Ryan Blue <[email protected]>
Signed-off-by: Burak Yavuz <[email protected]>
…itHub Actions yaml

### What changes were proposed in this pull request?

This PR proposes:
- Add `~/.sbt` directory into the build cache, see also sbt/sbt#3681
- Move `hadoop-2` below to put up together with `java-11` and `scala-213`, see apache#30391 (comment)
- Remove unnecessary `.m2` cache if you run SBT tests only.
- Remove `rm ~/.m2/repository/org/apache/spark`. If you don't `sbt publishLocal` or `mvn install`, we don't need to care about it.
- Use Java 8 in Scala 2.13 build. We can switch the Java version to 11 used for release later.
- Add caches into linters. The linter scripts uses `sbt` in, for example, `./dev/lint-scala`, and uses `mvn` in, for example, `./dev/lint-java`. Also, it requires to `sbt package` in Jekyll build, see: https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L160-L161. We need full caches here for SBT, Maven and build tools.
- Use the same syntax of Java version, 1.8 -> 8.

### Why are the changes needed?

- Remove unnecessary stuff
- Cache what we can in the build

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

It will be tested in GitHub Actions build at the current PR

Closes apache#30391 from HyukjinKwon/SPARK-33464.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

In section 6.13 of the ANSI SQL standard, there are syntax rules for valid combinations of the source and target data types.
![image](https://user-images.githubusercontent.com/1097932/98212874-17356f80-1ef9-11eb-8f2b-385f32db404a.png)

Comparing the ANSI CAST syntax rules with the current default behavior of Spark:
![image](https://user-images.githubusercontent.com/1097932/98789831-b7870a80-23b7-11eb-9b5f-469a42e0ee4a.png)

To make Spark's ANSI mode more ANSI SQL Compatible,I propose to disallow the following casting in ANSI mode:
```
TimeStamp <=> Boolean
Date <=> Boolean
Numeric <=> Timestamp
Numeric <=> Date
Numeric <=> Binary
String <=> Array
String <=> Map
String <=> Struct
```
The following castings are considered invalid in ANSI SQL standard, but they are quite straight forward. Let's Allow them for now
```
Numeric <=> Boolean
String <=> Binary
```
### Why are the changes needed?

Better ANSI SQL compliance

### Does this PR introduce _any_ user-facing change?

Yes, the following castings will not be allowed in ANSI mode:
```
TimeStamp <=> Boolean
Date <=> Boolean
Numeric <=> Timestamp
Numeric <=> Date
Numeric <=> Binary
String <=> Array
String <=> Map
String <=> Struct
```

### How was this patch tested?

Unit test

The ANSI Compliance doc preview:
![image](https://user-images.githubusercontent.com/1097932/98946017-2cd20880-24a8-11eb-8161-65749bfdd03a.png)

Closes apache#30260 from gengliangwang/ansiCanCast.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
### What changes were proposed in this pull request?

Adds `from_avro` and `to_avro` functions to SparkR.

### Why are the changes needed?

Feature parity.

### Does this PR introduce _any_ user-facing change?

New functions exposed in SparkR API.

### How was this patch tested?

New unit tests.

Closes apache#30216 from zero323/SPARK-33304.

Authored-by: zero323 <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?

Make the API key of DocSearch configurable and avoid hardcoding in the HTML template

### Why are the changes needed?

After apache#30292, our Spark documentation site supports searching.
However, the default API key always points to the latest release doc. We have to set different API keys for different releases. Otherwise, the search results are always based on the latest documentation(https://spark.apache.org/docs/latest/) even when visiting the documentation of previous releases.

As per discussion in apache#30292 (comment), we should make the API key configurable and set different values for different releases.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual test

Closes apache#30409 from gengliangwang/apiKey.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
…ionRuntimeSuite

### What changes were proposed in this pull request?

This followup is to prevent possible test flakyness of `SubExprEvaluationRuntimeSuite`.

### Why are the changes needed?

Because HashMap doesn't guarantee the order, in `proxyExpressions` the proxy expression id is not deterministic. So in `SubExprEvaluationRuntimeSuite` we should not test against it.

### Does this PR introduce _any_ user-facing change?

No, dev only.

### How was this patch tested?

Unit test.

Closes apache#30414 from viirya/SPARK-33427-followup.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
…her interpreted projections

### What changes were proposed in this pull request?

Similar to `InterpretedUnsafeProjection`, this patch proposes to extend interpreted subexpression elimination to `InterpretedMutableProjection` and `InterpretedSafeProjection`.

### Why are the changes needed?

Enabling subexpression elimination can improve the performance of interpreted projections, as shown in `InterpretedUnsafeProjection`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes apache#30406 from viirya/SPARK-33473.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

Add missing license headers for new files added in apache#28027.

### Why are the changes needed?

To fix licenses.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This is a purely non-functional change.

Closes apache#30415 from rdblue/license-headers.

Authored-by: Ryan Blue <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…all unused-imports

### What changes were proposed in this pull request?
This pr add a new Scala compile arg to `pom.xml` to defense against new unused imports:

- `-Ywarn-unused-import` for Scala 2.12
- `-Wconf:cat=unused-imports:e` for Scala 2.13

The other fIles change are remove all unused imports in Spark code

### Why are the changes needed?
Cleanup code and add guarantee to defense against new unused imports

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes apache#30351 from LuciferYang/remove-imports-core-module.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
…g.String when pruning partition column

### What changes were proposed in this pull request?

This pr fix filter for int column and value class java.lang.String when pruning partition column.

How to reproduce this issue:
```scala
spark.sql("CREATE table test (name STRING) partitioned by (id int) STORED AS PARQUET")
spark.sql("CREATE VIEW test_view as select cast(id as string) as id, name from test")
spark.sql("SELECT * FROM test_view WHERE id = '0'").explain
```
```
20/11/15 06:19:01 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_partitions_by_filter : db=default tbl=test
20/11/15 06:19:01 INFO MetaStoreDirectSql: Unable to push down SQL filter: Cannot push down filter for int column and value class java.lang.String
20/11/15 06:19:01 ERROR SparkSQLDriver: Failed in [SELECT * FROM test_view WHERE id = '0']
java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK
 at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
 at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:745)
 at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
 at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
 at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
 at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276)
 at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:743)
```

### Why are the changes needed?

Fix bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes apache#30380 from wangyum/SPARK-27421.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
…id unnecessary sort operations

### What changes were proposed in this pull request?
This pull request tries to normalize the SortOrder properly to prevent unnecessary sort operators. Currently the sameOrderExpressions are not normalized as part of AliasAwareOutputOrdering.

Example: consider this join of three tables:

      """
        |SELECT t2id, t3.id as t3id
        |FROM (
        |    SELECT t1.id as t1id, t2.id as t2id
        |    FROM t1, t2
        |    WHERE t1.id = t2.id
        |) t12, t3
        |WHERE t1id = t3.id
      """.

The plan for this looks like:

      *(8) Project [t2id#1059L, id#1004L AS t3id#1060L]
      +- *(8) SortMergeJoin [t2id#1059L], [id#1004L], Inner
         :- *(5) Sort [t2id#1059L ASC NULLS FIRST ], false, 0         <-----------------------------
         :  +- *(5) Project [id#1000L AS t2id#1059L]
         :     +- *(5) SortMergeJoin [id#996L], [id#1000L], Inner
         :        :- *(2) Sort [id#996L ASC NULLS FIRST ], false, 0
         :        :  +- Exchange hashpartitioning(id#996L, 5), true, [id=apache#1426]
         :        :     +- *(1) Range (0, 10, step=1, splits=2)
         :        +- *(4) Sort [id#1000L ASC NULLS FIRST ], false, 0
         :           +- Exchange hashpartitioning(id#1000L, 5), true, [id=apache#1432]
         :              +- *(3) Range (0, 20, step=1, splits=2)
         +- *(7) Sort [id#1004L ASC NULLS FIRST ], false, 0
            +- Exchange hashpartitioning(id#1004L, 5), true, [id=apache#1443]
               +- *(6) Range (0, 30, step=1, splits=2)

In this plan, the marked sort node could have been avoided as the data is already sorted on "t2.id" by the lower SortMergeJoin.

### Why are the changes needed?
To remove unneeded Sort operators.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
New UT added.

Closes apache#30302 from prakharjain09/SPARK-33400-sortorder.

Authored-by: Prakhar Jain <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?

This PR fixes the RAT exclusion rule which was originated from SPARK-1144 (Apache Spark 1.0)

### Why are the changes needed?

This prevents the situation like apache#30415.

Currently, it missed `catalog` directory due to `.log` rule.
```
$ dev/check-license
Could not find Apache license headers in the following files:
 !????? /Users/dongjoon/APACHE/spark-merge/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java
 !????? /Users/dongjoon/APACHE/spark-merge/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.java
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CI with the new rule.

Closes apache#30418 from dongjoon-hyun/SPARK-RAT.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
… version

### What changes were proposed in this pull request?
This PR is a follow up for apache#30093 to updates the config `spark.sql.execution.removeRedundantSorts` version to 2.4.8.

### Why are the changes needed?
To update the rule version it has been backported to 2.4. apache#30194

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
N/A

Closes apache#30420 from allisonwang-db/spark-33183-follow-up.

Authored-by: allisonwang-db <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
HyukjinKwon and others added 27 commits December 9, 2020 20:26
…to nonInheritableMetadataKeys in Alias

### What changes were proposed in this pull request?

This PR is a followup of apache#30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing.

### Why are the changes needed?

To make it easier to maintain and read.

### Does this PR introduce _any_ user-facing change?

No. This is rather a code cleanup.

### How was this patch tested?

Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them.

Closes apache#30682 from HyukjinKwon/SPARK-33071-SPARK-33536.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?

This PR adds `DeleteFromTable` to supported plans in `ReplaceNullWithFalseInPredicate`.

### Why are the changes needed?

This change allows Spark to optimize delete conditions like we optimize filters.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This PR extends the existing test cases to also cover `DeleteFromTable`.

Closes apache#30688 from aokolnychyi/spark-33722.

Authored-by: Anton Okolnychyi <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

This upgrades snappy-java to 1.1.8.2.

### Why are the changes needed?

Minor version upgrade that includes:

- [Fixed](xerial/snappy-java#265) an initialization issue when using a recent Mac OS X version
- Support Apple Silicon (M1, Mac-aarch64)
- Fixed the pure-java Snappy fallback logic when no native library for your platform is found.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes apache#30690 from viirya/upgrade-snappy.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

While building R docker image if we can't fetch the key from gnupg.net fall back to openpgp.org

### Why are the changes needed?

gnupg.net key servers are flaky and sometimes fail to resolve or return keys.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Tried to add key on my desktop, it failed, then tried to add key with openpgp.org and it succeed.

Closes apache#30696 from holdenk/SPARK-33727-gnupg-server-is-flaky.

Authored-by: Holden Karau <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?

Makes the location of the decommission script used in Kubernetes for graceful shutdown configurable.

### Why are the changes needed?

Some environments don't use the Spark image builder and instead mount the decompressed Spark distro. In those envs configuring the location of the decommissioning script is required.

### Does this PR introduce _any_ user-facing change?

New configuration parameter.

### How was this patch tested?

Existing decommissioning integration test.

Closes apache#30694 from holdenk/SPARK-33724-allow-decommissioning-script-location-to-be-configured.

Authored-by: Holden Karau <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…N tests

### What changes were proposed in this pull request?
1. Move the `ALTER TABLE .. ADD PARTITION` parsing tests to `AlterTableAddPartitionParserSuite`
2. Place v1 tests for `ALTER TABLE .. ADD PARTITION` from `DDLSuite` and v2 tests from `AlterTablePartitionV2SQLSuite` to the common trait `AlterTableAddPartitionSuiteBase`, so, the tests will run for V1, Hive V1 and V2 DS.

### Why are the changes needed?
- The unification will allow to run common `ALTER TABLE .. ADD PARTITION` tests for both DSv1 and Hive DSv1, DSv2
- We can detect missing features and differences between DSv1 and DSv2 implementations.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running new test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite"
```

Closes apache#30685 from MaxGekk/unify-alter-table-add-partition-tests.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…mands to use UnresolvedView to resolve the identifier

### What changes were proposed in this pull request?

This PR adds `allowTemp` flag to `UnresolvedView` so that `Analyzer` can check whether to resolve temp views or not.

This PR also migrates `ALTER VIEW ... SET/UNSET TBLPROPERTIES` to use `UnresolvedView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

### Why are the changes needed?

To use `UnresolvedView` for view resolution.

One benefit is that the exception message is better for `ALTER VIEW ... SET/UNSET TBLPROPERTIES`. Before, if a temp view is passed, you will just get `NoSuchTableException` with `Table or view 'tmpView' not found in database 'default'`. But with this PR, you will get more description exception message: `tmpView is a temp view. ALTER VIEW ... SET TBLPROPERTIES expects a permanent view`.

### Does this PR introduce _any_ user-facing change?

The exception message changes as describe above.

### How was this patch tested?

Updated existing tests.

Closes apache#30676 from imback82/alter_view_set_unset_properties.

Authored-by: Terry Kim <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…ith Minikube 1.9+

### What changes were proposed in this pull request?

This PR changes `Minikube.scala` for Kubernetes integration tests to work with Minikube 1.9+.
`Minikube.scala` assumes that `apiserver.key` and `apiserver.crt` are in `~/.minikube/`.
But as of Minikube 1.9, they are in `~/.minikube/profiles/<profile>`.

### Why are the changes needed?

Currently, Kubernetes integration tests doesn't work with Minikube 1.9+.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I confirmed the following test passes.
```
$ build/sbt -Pkubernetes -Pkubernetes-integration-tests package 'kubernetes-integration-tests/testOnly -- -z "SparkPi with no"'
```

Closes apache#30700 from sarutak/minikube-1.9.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…lyzer in one file

### What changes were proposed in this pull request?
This PR follows up apache#29497.
Because apache#29497 just give us an example to group all `AnalysisExcpetion` in Analyzer into QueryCompilationErrors.
This PR group other `AnalysisExcpetion` into QueryCompilationErrors.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes apache#30564 from beliefer/SPARK-32670-followup.

Lead-authored-by: gengjiaan <[email protected]>
Co-authored-by: Jiaan Geng <[email protected]>
Co-authored-by: beliefer <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…lookup function

### What changes were proposed in this pull request?
Using the view captured catalog and namespace to lookup function, so the view
referred functions won't be overridden by newly created function with the same name,
but different database or function type (i.e. temporary function)

### Why are the changes needed?
bug fix, without this PR, changing database or create a temporary function with
the same name may cause failure when querying a view.

### Does this PR introduce _any_ user-facing change?
Yes, bug fix.

### How was this patch tested?
newly added and existing test cases.

Closes apache#30662 from linhongliu-db/SPARK-33692.

Lead-authored-by: Linhong Liu <[email protected]>
Co-authored-by: Linhong Liu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…existing hadoop ones

### What changes were proposed in this pull request?
org.apache.hadoop.conf.Configuration#setIfUnset will ignore those with defaults too

### Why are the changes needed?

fix a regression

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

Closes apache#30709 from yaooqinn/SPARK-33740.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…ernalCatalog.createPartitions()

### What changes were proposed in this pull request?
Throw `PartitionsAlreadyExistException` from `createPartitions()` in Hive external catalog when a partition exists. Currently, `HiveExternalCatalog.createPartitions()` throws `AlreadyExistsException` wrapped by `AnalysisException`.

In the PR, I propose to catch `AlreadyExistsException` in `HiveClientImpl` and replace it by `PartitionsAlreadyExistException`.

### Why are the changes needed?
The behaviour of Hive external catalog deviates from V1/V2 in-memory catalogs that throw `PartitionsAlreadyExistException`. To improve user experience with Spark SQL, it would be better to throw the same exception.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
By running existing test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite"
```

Closes apache#30711 from MaxGekk/hive-partition-exception.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

This PR aims to upgrade Apache ORC to 1.6.6 for Apache Spark 3.2.0.

### Why are the changes needed?

This brings the latest bug fixes and features.
Apache Iceberg is already using Apache ORC 1.6.6.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes apache#30715 from dongjoon-hyun/SPARK-33295.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…and flake8

### What changes were proposed in this pull request?

Once you build and ran K8S tests, Python lint fails as below:

```bash
$ ./dev/lint-python
```

Before this PR:

```
starting python compilation test...
python compilation succeeded.

downloading pycodestyle from https://raw.githubusercontent.com/PyCQA/pycodestyle/2.6.0/pycodestyle.py...
starting pycodestyle test...
pycodestyle checks failed:
./resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/python/pyspark/cloudpickle/cloudpickle.py:15:101: E501 line too long (105 > 100 characters)
./resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/python/docs/source/conf.py:60:101: E501 line too long (124 > 100 characters)
...
```

After this PR:

```
starting python compilation test...
python compilation succeeded.

downloading pycodestyle from https://raw.githubusercontent.com/PyCQA/pycodestyle/2.6.0/pycodestyle.py...
starting pycodestyle test...
pycodestyle checks passed.

starting flake8 test...
flake8 checks passed.

starting mypy test...
mypy checks passed.

starting sphinx-build tests...
sphinx-build checks passed.
```

This PR excludes target directory to avoid such cases in the future.

### Why are the changes needed?

To make it easier to run linters

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested va running `./dev/lint-python`.

Closes apache#30718 from HyukjinKwon/SPARK-33749.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
…ReaderAdmin

### What changes were proposed in this pull request?
The Kafka offset reader which uses `AdminClient` still uses `UninterruptibleThread` to call it. Since there is no evidence that `AdminClient` suffers from similar issues like [KAFKA-1894](https://issues.apache.org/jira/browse/KAFKA-1894) I'm removing `UninterruptibleThread` usage. In order to put the `AdminClient` under stress and make sure it works I've created the following standalone application: https://github.com/gaborgsomogyi/kafka-admin-interruption

What this PR contains:
* Removed `UninterruptibleThread` from `KafkaOffsetReaderAdmin`
* Removed/modified comments which are not true
* Adapted `KafkaRelationSuite`
* Renamed `partitionsAssignedToConsumer` to `partitionsAssignedToAdmin`

### Why are the changes needed?
`KafkaOffsetReaderAdmin` doesn't need `UninterruptibleThread` usage.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing unit tests + manually with simple Kafka to Kafka query.

Closes apache#30668 from gaborgsomogyi/SPARK-32910.

Authored-by: Gabor Somogyi <[email protected]>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
…h mainstream databases

### What changes were proposed in this pull request?
In Spark, decode(bin, charset) - Decodes the first argument using the second argument character set.

Unfortunately this is NOT what any other SQL vendor understands `DECODE` to do.
`DECODE` generally is a short hand for a simple case expression:

```
SELECT DECODE(c1, 1, 'Hello', 2, 'World', '!') FROM (VALUES (1), (2), (3)) AS T(c1)
=>
(Hello),
(World)
(!)
```
There are some mainstream database support the syntax.
**Oracle**
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/DECODE.html#GUID-39341D91-3442-4730-BD34-D3CF5D4701CE
**Vertica**
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/String/DECODE.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CString%20Functions%7C_____10
**DB2**
https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1447.htm
**Redshift**
https://docs.aws.amazon.com/redshift/latest/dg/r_DECODE_expression.html
**Pig**
https://pig.apache.org/docs/latest/api/org/apache/pig/piggybank/evaluation/decode/Decode.html
**Teradata**
https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/jtCpCycpEaXESG4d63kMjg
**Snowflake**
https://docs.snowflake.com/en/sql-reference/functions/decode.html

### Why are the changes needed?
It is very useful.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Jenkins test.

Closes apache#30479 from beliefer/SPARK-33527.

Lead-authored-by: gengjiaan <[email protected]>
Co-authored-by: beliefer <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…alCatalogVersionsSuite

### What changes were proposed in this pull request?

This PR aims to use `hadoop-3.2` distribution in HiveExternalCatalogVersionsSuite if available.

### Why are the changes needed?

Apache Spark 3.1 is using Hadoop 3 by default. We need to focus on Hadoop 3 more to prepare the future.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes apache#30722 from dongjoon-hyun/SPARK-33750.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

Replace `legacy_setops_precedence_enbled` with `legacy_setops_precedence_enabled`

Alternatively, `legacy_setops_precedence_enabled` could be added, and `legacy_setops_precedence_enbled` retained, and if set the code could honor it and warn about the deprecated spelling.

### Why are the changes needed?

`enabled` is misspelled in `legacy_setops_precedence_enbled`

### Does this PR introduce _any_ user-facing change?

Yes.

It would break current consumers.
Examples include:
* https://www.programmersought.com/article/87752082924/
* https://github.com/fugue-project/fugue/blob/125d873c38e18b5f09b032bd01ac47a0c6739ddc/fugue_sql/_antlr/fugue_sqlLexer.py
* https://github.com/search?q=legacy_setops_precedence_enbled&type=code

### How was this patch tested?

It's been included in apache#30323 for a while (and is now split out here)

Closes apache#30677 from jsoref/spelling-enabled.

Authored-by: Josh Soref <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…d to follow the default Hadoop profile updated

### What changes were proposed in this pull request?

This PR updates `kubernetes/integration-tests/README.md`.

### Why are the changes needed?

To follow the current Hadoop profile (hadoop-3.2).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I have confirmed that the integration tests pass with the following command for both Hadoop 3.2 an 2.7.
```
build/mvn integration-test -am -pl :spark-kubernetes-integration-tests_2.12 \
  -Pkubernetes \
  -Pkubernetes-integration-tests \
  -Dspark.kubernetes.test.imageTag=${IMAGE_TAG} \
  -Dspark.kubernetes.test.imageRepo=docker.io/kubespark \
  -Dspark.kubernetes.test.namespace=default \
  -Dspark.kubernetes.test.deployMode=minikube \
  -Dtest.include.tags=k8s
```

Closes apache#30726 from sarutak/update-kube-integ-readme.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

This PR fixes the Scala 2.13 build failure brought by apache#30479 .

### Why are the changes needed?

To pass Scala 2.13 build.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Should be done byGitHub Actions.

Closes apache#30727 from sarutak/fix-scala213-build-failure.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…esolve identifier

### What changes were proposed in this pull request?

This PR proposes to migrate `CACHE TABLE` to use `UnresolvedRelation` to resolve the table/view identifier in Analyzer as discussed https://github.com/apache/spark/pull/30403/files#r532360022.

### Why are the changes needed?

To resolve the table in the analyzer.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

Closes apache#30598 from imback82/cache_v2.

Authored-by: Terry Kim <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…rtitionExists()

### What changes were proposed in this pull request?
1. Check that the partition identifier passed to `SupportsPartitionManagement.partitionExists()` is fully specified (specifies all values of partition fields).
2. Remove the custom implementation of `partitionExists()` from `InMemoryPartitionTable`, and re-use the default implementation from `SupportsPartitionManagement`.

### Why are the changes needed?
The method is supposed to check existence of one partition but currently it can return `true` for partially specified partition. This can lead to incorrect commands behavior, for instance the commands could modify or place data in the middle of partition path.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running existing test suites:
```
$ build/sbt "test:testOnly *AlterTablePartitionV2SQLSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *SupportsPartitionManagementSuite"
```

Closes apache#30667 from MaxGekk/check-len-partitionExists.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…ask on thriftserver

### What changes were proposed in this pull request?

This PR add a new config `spark.sql.thriftServer.forceCancel` to give user a way to interrupt task when cancel statement.

### Why are the changes needed?

After [apache#29933](apache#29933), we support cancel query if timeout, but the default behavior of `SparkContext.cancelJobGroups` won't interrupt task and just let task finish by itself. In some case it's dangerous, e.g., data skew or exists a heavily shuffle. A task will hold in a long time after do cancel and the resource will not release.

### Does this PR introduce _any_ user-facing change?

Yes, a new config.

### How was this patch tested?

Add test.

Closes apache#30481 from ulysses-you/SPARK-33526.

Lead-authored-by: ulysses-you <[email protected]>
Co-authored-by: ulysses-you <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
… Scala 2.13 build

### What changes were proposed in this pull request?

This PR adds `kubernetes-integration-tests` to GitHub Actions for Scala 2.13 build.

### Why are the changes needed?

Now that the build pass with `kubernetes-integration-tests` and Scala 2.13, it's better to keep it build-able.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Should be done by GitHub Actions.
I also confirmed that the build passes with the following command.
```
$ build/sbt -Pscala-2.13 -Pkubernetes -Pkubernetes-integration-tests compile test:compile
```

Closes apache#30731 from sarutak/github-actions-k8s.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
…Actions and AppVeyor

### What changes were proposed in this pull request?

This PR fixes the R dependencies build error on GitHub Actions and AppVeyor.
The reason seems that `usethis` package is updated 2020/12/10.
https://cran.r-project.org/web/packages/usethis/index.html

### Why are the changes needed?

To keep the build clean.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Should be done by GitHub Actions.

Closes apache#30737 from sarutak/fix-r-dependencies-build-error.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
@dmcwhorter dmcwhorter force-pushed the dmcwhorter-SPARK-22256 branch from a2b5feb to 8a5661c Compare December 11, 2020 18:33
… describe behavior, add ticket number to test cases
@dmcwhorter dmcwhorter force-pushed the dmcwhorter-SPARK-22256 branch from 2a73b8c to 8149c34 Compare December 15, 2020 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.