sync #22

GuoPhilipse · 2020-08-06T07:06:12Z

sync

### What changes were proposed in this pull request? This PR adds the SQL standard command - `SET TIME ZONE` to the current default time zone displacement for the current SQL-session, which is the same as the existing `set spark.sql.session.timeZone=xxx'. All in all, this PR adds syntax as following, ``` SET TIME ZONE LOCAL; SET TIME ZONE 'valid time zone'; -- zone offset or region SET TIME ZONE INTERVAL XXXX; -- xxx must in [-18, + 18] hours, * this range is bigger than ansi [-14, + 14] ``` ### Why are the changes needed? ANSI compliance and supply pure SQL users a way to retrieve all supported TimeZones ### Does this PR introduce _any_ user-facing change? yes, add new syntax. ### How was this patch tested? add unit tests. and locally verified reference doc ![image](https://user-images.githubusercontent.com/8326978/87510244-c8dc3680-c6a5-11ea-954c-b098be84afee.png) Closes #29064 from yaooqinn/SPARK-32272. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…c tables ### What changes were proposed in this pull request? Spark sql commands are failing on selecting the orc tables Steps to reproduce Example 1 - Prerequisite - This is the location(/Users/test/tpcds_scale5data/date_dim) for orc data which is generated by the hive. ``` val table = """CREATE TABLE `date_dim` ( `d_date_sk` INT, `d_date_id` STRING, `d_date` TIMESTAMP, `d_month_seq` INT, `d_week_seq` INT, `d_quarter_seq` INT, `d_year` INT, `d_dow` INT, `d_moy` INT, `d_dom` INT, `d_qoy` INT, `d_fy_year` INT, `d_fy_quarter_seq` INT, `d_fy_week_seq` INT, `d_day_name` STRING, `d_quarter_name` STRING, `d_holiday` STRING, `d_weekend` STRING, `d_following_holiday` STRING, `d_first_dom` INT, `d_last_dom` INT, `d_same_day_ly` INT, `d_same_day_lq` INT, `d_current_day` STRING, `d_current_week` STRING, `d_current_month` STRING, `d_current_quarter` STRING, `d_current_year` STRING) USING orc LOCATION '/Users/test/tpcds_scale5data/date_dim'""" spark.sql(table).collect val u = """select date_dim.d_date_id from date_dim limit 5""" spark.sql(u).collect ``` Example 2 ``` val table = """CREATE TABLE `test_orc_data` ( `_col1` INT, `_col2` STRING, `_col3` INT) USING orc""" spark.sql(table).collect spark.sql("insert into test_orc_data values(13, '155', 2020)").collect val df = """select _col2 from test_orc_data limit 5""" spark.sql(df).collect ``` Its Failing with below error ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, 192.168.0.103, executor driver): java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372) at org.apache.spark.rdd.RDD.iterator(RDD.scala:336) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:133) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:448) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)` ``` The reason behind this initBatch is not getting the schema that is needed to find out the column value in OrcFileFormat.scala ``` batchReader.initBatch( TypeDescription.fromString(resultSchemaString) ``` ### Why are the changes needed? Spark sql queries for orc tables are failing ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test is added for this .Also Tested through spark shell and spark submit the failing queries Closes #29045 from SaurabhChawla100/SPARK-32234. Lead-authored-by: SaurabhChawla <[email protected]> Co-authored-by: SaurabhChawla <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? In the PR, I propose to support pushed down filters in JSON datasource. The reason of pushing a filter up to `JacksonParser` is to apply the filter as soon as all its attributes become available i.e. converted from JSON field values to desired values according to the schema. This allows to skip parsing of the rest of JSON record and conversions of other values if the filter returns `false`. This can improve performance when pushed filters are highly selective and conversion of JSON string fields to desired values are comparably expensive ( for example, the conversion to `TIMESTAMP` values). The main idea behind of `JsonFilters` is to group pushdown filters by their references, convert the grouped filters to expressions, and then compile to predicates. The predicates are indexed by schema field positions. Each predicate has a state with reference counter to non-set row fields. As soon as the counter reaches `0`, it can be applied to the row because all its dependencies has been set. Before processing new row, predicate's reference counter is reset to total number of predicate references (dependencies in a row). The common code shared between `CSVFilters` and `JsonFilters` is moved to the `StructFilters` class and its companion object. ### Why are the changes needed? The changes improve performance on synthetic benchmarks up to **27 times** on JDK 8 and **25** times on JDK 11: ``` OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1044-aws Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ w/o filters 25230 25255 22 0.0 252299.6 1.0X pushdown disabled 25248 25282 33 0.0 252475.6 1.0X w/ filters 905 911 8 0.1 9047.9 27.9X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added new test suites `JsonFiltersSuite` and `JacksonParserSuite`. - By new end-to-end and case sensitivity tests in `JsonSuite`. - By `CSVFiltersSuite`, `UnivocityParserSuite` and `CSVSuite`. - Re-running `CSVBenchmark` and `JsonBenchmark` using Amazon EC2: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`| and `./dev/run-benchmarks`: ```python #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27366 from MaxGekk/json-filters-pushdown. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Max Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…equire ### What changes were proposed in this pull request? Small improvement in the error message shown to user https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala#L537-L538 ### Why are the changes needed? The current behavior is an exception is thrown but without any specific details on the cause ``` Caused by: java.lang.IllegalArgumentException: requirement failedCaused by: java.lang.IllegalArgumentException: requirement failed at scala.Predef$.require(Predef.scala:212) at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:508) at org.apache.spark.mllib.clustering.EuclideanDistanceMeasure$.fastSquaredDistance(DistanceMeasure.scala:232) at org.apache.spark.mllib.clustering.EuclideanDistanceMeasure.isCenterConverged(DistanceMeasure.scala:190) at org.apache.spark.mllib.clustering.KMeans$$anonfun$runAlgorithm$4.apply(KMeans.scala:336) at org.apache.spark.mllib.clustering.KMeans$$anonfun$runAlgorithm$4.apply(KMeans.scala:334) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:130) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at scala.collection.MapLike$MappedValues.foreach(MapLike.scala:245) at org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:334) at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:251) at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:233) ``` ### Does this PR introduce _any_ user-facing change? Yes, this PR adds an explanation message to be shown to user when requirement check is not meant ### How was this patch tested? manually Closes #29115 from dzlab/patch/SPARK-32315. Authored-by: dzlab <[email protected]> Signed-off-by: Huaxin Gao <[email protected]>

…cation, regression, clustering and fpm ### What changes were proposed in this pull request? set params default values in trait ...Params in both Scala and Python. I will do this in two PRs. I will change classification, regression, clustering and fpm in this PR. Will change the rest in another PR. ### Why are the changes needed? Make ML has the same default param values between estimator and its corresponding transformer, and also between Scala and Python. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #29112 from huaxingao/set_default. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Huaxin Gao <[email protected]>

### What changes were proposed in this pull request? This PR aims to remove Python 2 test case from K8s IT. ### Why are the changes needed? Since Apache Spark 3.1.0 dropped Python 2.7, 3.4 and 3.5 support officially via SPARK-32138, K8s IT fails. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example *** FAILED *** The code passed to eventually never returned normally. Attempted 113 times over 2.0014854648999996 minutes. Last failure message: false was not true. (KubernetesSuite.scala:370) - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - Test basic decommissioning - Run SparkR on simple dataframe.R example Run completed in 11 minutes, 15 seconds. Total number of tests run: 20 Suites: completed 2, aborted 0 Tests: succeeded 19, failed 1, canceled 0, ignored 0, pending 0 *** 1 TEST FAILED *** ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass Jenkins K8s IT. Closes #29136 from dongjoon-hyun/SPARK-32335. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…e in hive version related subdirectories ### What changes were proposed in this pull request? This patch fixes the build issue on Hive 1.2 profile brought by #29069, via putting mocks for HiveSessionImplSuite in hive version related subdirectories, so that maven build will pick up the proper source code according to the profile. ### Why are the changes needed? #29069 fixed the flakiness of HiveSessionImplSuite, but given the patch relied on the default profile (Hive 2.3) it broke the build with Hive 1.2 profile. This patch addresses both Hive versions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually confirmed the test suite via below command: > Hive 1.2 ``` build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite test -Phive-1.2 -Phadoop-2.7 -Phive-thriftserver ``` > Hive 2.3 ``` build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite test -Phive-2.3 -Phadoop-3.2 -Phive-thriftserver ``` Closes #29129 from frankyin-factual/hive-tests. Authored-by: Frank Yin <[email protected]> Signed-off-by: Jungtaek Lim (HeartSaVioR) <[email protected]>

### What changes were proposed in this pull request? This PR aims to upgrade PySpark's embedded cloudpickle to the latest cloudpickle v1.5.0 (See https://github.com/cloudpipe/cloudpickle/blob/v1.5.0/cloudpickle/cloudpickle.py) ### Why are the changes needed? There are many bug fixes. For example, the bug described in the JIRA: dill unpickling fails because they define `types.ClassType`, which is undefined in dill. This results in the following error: ``` Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/apache_beam/internal/pickler.py", line 279, in loads return dill.loads(s) File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 317, in loads return load(file, ignore) File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 305, in load obj = pik.load() File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 577, in _load_type return _reverse_typemap[name] KeyError: 'ClassType' ``` See also cloudpipe/cloudpickle#82. This was fixed for cloudpickle 1.3.0+ (cloudpipe/cloudpickle#337), but PySpark's cloudpickle.py doesn't have this change yet. More notably, now it supports C pickle implementation with Python 3.8 which hugely improve performance. This is already adopted in another project such as Ray. ### Does this PR introduce _any_ user-facing change? Yes, as described above, the bug fixes. Internally, users also could leverage the fast cloudpickle backed by C pickle. ### How was this patch tested? Jenkins will test it out. Closes #29114 from HyukjinKwon/SPARK-32094. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? Fix typo error in the error log of SparkOperation trait, reported by #28963 (comment) ### Why are the changes needed? fix error in thrift server driver log ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Passing GitHub actions Closes #29140 from yaooqinn/SPARK-32145-F. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…erWebUI ### What changes were proposed in this pull request? This PR allows an external agent to inform the Master that certain hosts are being decommissioned. ### Why are the changes needed? The current decommissioning is triggered by the Worker getting getting a SIGPWR (out of band possibly by some cleanup hook), which then informs the Master about it. This approach may not be feasible in some environments that cannot trigger a clean up hook on the Worker. In addition, when a large number of worker nodes are being decommissioned then the master will get a flood of messages. So we add a new post endpoint `/workers/kill` on the MasterWebUI that allows an external agent to inform the master about all the nodes being decommissioned in bulk. The list of nodes is specified by providing a list of hostnames. All workers on those hosts will be decommissioned. This API is merely a new entry point into the existing decommissioning logic. It does not change how the decommissioning request is handled in its core. ### Does this PR introduce _any_ user-facing change? Yes, a new endpoint `/workers/kill` is added to the MasterWebUI. By default only requests originating from an IP address local to the MasterWebUI are allowed. ### How was this patch tested? Added unit tests Closes #29015 from agrawaldevesh/master_decom_endpoint. Authored-by: Devesh Agrawal <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? New `spark.sql.metadataCacheTTLSeconds` option that adds time-to-live cache behaviour to the existing caches in `FileStatusCache` and `SessionCatalog`. ### Why are the changes needed? Currently Spark [caches file listing for tables](https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing) and requires issuing `REFRESH TABLE` any time the file listing has changed outside of Spark. Unfortunately, simply submitting `REFRESH TABLE` commands could be very cumbersome. Assuming frequently added files, hundreds of tables and dozens of users querying the data (and expecting up-to-date results), manually refreshing metadata for each table is not a solution. This is a pretty common use-case for streaming ingestion of data, which can be done outside of Spark (with tools like Kafka Connect, etc.). A similar feature exists in Presto: `hive.file-status-cache-expire-time` can be found [here](https://prestosql.io/docs/current/connector/hive.html#hive-configuration-properties). ### Does this PR introduce _any_ user-facing change? Yes, it's controlled with the new `spark.sql.metadataCacheTTLSeconds` option. When it's set to `-1` (by default), the behaviour of caches doesn't change, so it stays _backwards-compatible_. Otherwise, you can specify a value in seconds, for example `spark.sql.metadataCacheTTLSeconds: 60` means 1-minute cache TTL. ### How was this patch tested? Added new tests in: - FileIndexSuite - SessionCatalogSuite Closes #28852 from sap1ens/SPARK-30616-metadata-cache-ttl. Authored-by: Yaroslav Tkachenko <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…PROFILES ### What changes were proposed in this pull request? This PR aims to rename `HADOOP2_MODULE_PROFILES` to `HADOOP_MODULE_PROFILES` because Hadoop 3 is now the default. ### Why are the changes needed? Hadoop 3 is now the default. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GitHub Action dependency test. Closes #29128 from williamhyun/williamhyun-patch-3. Authored-by: williamhyun <[email protected]> Signed-off-by: Sean Owen <[email protected]>

### What changes were proposed in this pull request? use while-loop instead of the recursive way ### Why are the changes needed? 3% ~ 10% faster ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #29095 from zhengruifeng/tree_pred_opt. Authored-by: zhengruifeng <[email protected]> Signed-off-by: Sean Owen <[email protected]>

### What changes were proposed in this pull request? This PR aims to update the docker/spark-test and clean up unused stuff. ### Why are the changes needed? Since Spark 3.0.0, Java 11 is supported. We had better use the latest Java and OS. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually do the following as described in https://github.com/apache/spark/blob/master/external/docker/spark-test/README.md . ``` docker run -v $SPARK_HOME:/opt/spark spark-test-master docker run -v $SPARK_HOME:/opt/spark spark-test-worker spark://<master_ip>:7077 ``` Closes #29150 from williamhyun/docker. Authored-by: William Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… executors ### What changes were proposed in this pull request? This PR adds functionality to consider the running tasks on decommission executors based on some config. In spark-on-cloud , we sometimes already know that an executor won't be alive for more than fix amount of time. Ex- In AWS Spot nodes, once we get the notification, we know that a node will be gone in 120 seconds. So if the running tasks on the decommissioning executors may run beyond currentTime+120 seconds, then they are candidate for speculation. ### Why are the changes needed? Currently when an executor is decommission, we stop scheduling new tasks on those executors but the already running tasks keeps on running on them. Based on the cloud, we might know beforehand that an executor won't be alive for more than a preconfigured time. Different cloud providers gives different timeouts before they take away the nodes. For Ex- In case of AWS spot nodes, an executor won't be alive for more than 120 seconds. We can utilize this information in cloud environments and take better decisions about speculating the already running tasks on decommission executors. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds a new config "spark.executor.decommission.killInterval" which they can explicitly set based on the cloud environment where they are running. ### How was this patch tested? Added UT. Closes #28619 from prakharjain09/SPARK-21040-speculate-decommission-exec-tasks. Authored-by: Prakhar Jain <[email protected]> Signed-off-by: Holden Karau <[email protected]>

### What changes were proposed in this pull request? Replaced floorDiv to just / in `localRebaseGregorianToJulianDays()` in `spark/sql/catalyst/util/RebaseDateTime.scala` ### Why are the changes needed? Easier to understand the logic/code and a little more efficiency. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Proof of concept [here](https://github.com/apache/spark/pull/28573/files). The operation `utcCal.getTimeInMillis / MILLIS_PER_DAY` results in an interger value already. Closes #29008 from Sudhar287/SPARK-31579. Authored-by: Sudharshann D <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…ing modules ### What changes were proposed in this pull request? See again the related PRs like #28971 This completes fixing compilation for 2.13 for all but `repl`, which is a separate task. ### Why are the changes needed? Eventually, we need to support a Scala 2.13 build, perhaps in Spark 3.1. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. (2.13 was not tested; this is about getting it to compile without breaking 2.12) Closes #29147 from srowen/SPARK-29292.4. Authored-by: Sean Owen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? Use `/usr/bin/env python3` consistently instead of `/usr/bin/env python` in build scripts, to reliably select Python 3. ### Why are the changes needed? Scripts no longer work with Python 2. ### Does this PR introduce _any_ user-facing change? No, should be all build system changes. ### How was this patch tested? Existing tests / NA Closes #29151 from srowen/SPARK-29909.2. Authored-by: Sean Owen <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…xpr in distinct aggregates ### What changes were proposed in this pull request? This PR intends to fix a bug of distinct FIRST/LAST aggregates in v2.4.6/v3.0.0/master; ``` scala> sql("SELECT FIRST(DISTINCT v) FROM VALUES 1, 2, 3 t(v)").show() ... Caused by: java.lang.UnsupportedOperationException: Cannot evaluate expression: false#37 at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258) at org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:226) at org.apache.spark.sql.catalyst.expressions.aggregate.First.ignoreNulls(First.scala:68) at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions$lzycompute(First.scala:82) at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions(First.scala:81) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$15.apply(HashAggregateExec.scala:268) ``` A root cause of this bug is that the `Aggregation` strategy replaces a foldable boolean `ignoreNullsExpr` expr with a `Unevaluable` expr (`AttributeReference`) for distinct FIRST/LAST aggregate functions. But, this operation cannot be allowed because the `Analyzer` has checked that it must be foldabe; https://github.com/apache/spark/blob/ffdbbae1d465fe2c710d020de62ca1a6b0b924d9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala#L74-L76 So, this PR proposes to change a vriable for `IGNORE NULLS` from `Expression` to `Boolean` to avoid the case. ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a test in `DataFrameAggregateSuite`. Closes #29143 from maropu/SPARK-32344. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…ions ### What changes were proposed in this pull request? Make the test result log of github action more readable by showing errors from SBT only. 1. Add "--error" flag to sbt in github action to set the log level as "ERROR" 2. Show only failed test cases in stderr output of github action. According to https://www.scalatest.org/user_guide/using_the_runner, with SBT option `-eNCXEHLOPQMDF ` we can drop all the following events: ``` N - drop TestStarting events C - drop TestSucceeded events X - drop TestIgnored events E - drop TestPending events H - drop SuiteStarting events L - drop SuiteCompleted events O - drop InfoProvided events P - drop ScopeOpened events Q - drop ScopeClosed events R - drop ScopePending events M - drop MarkupProvided events ``` and enable the following two mode: ``` D - show all durations F - show full stack traces ``` ### Why are the changes needed? Currently, the output of github action is very long and we have to scroll down to find the failed test cases. Even more, the log may be truncated. In such a case, we will have to wait until all the jobs are completed and then download all the raw logs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Before changes, all the warnings in compiling are shown: ![image](https://user-images.githubusercontent.com/1097932/87846810-98ec8900-c887-11ea-913b-164b84df62cd.png) as well as all the passed and ignored test cases: ![image](https://user-images.githubusercontent.com/1097932/87846834-ca655480-c887-11ea-9c29-977f802e4c82.png) After changes, sbt test only shows the summary for a successful job: ![image](https://user-images.githubusercontent.com/1097932/87846961-e74e5780-c888-11ea-82d5-cf1da1740181.png) ![image](https://user-images.githubusercontent.com/1097932/87745273-5735e280-c7a2-11ea-8ac9-b4b0e3cb458d.png) If there is a test failure, a full stack track is shown as well as a test failure summary at the end of test log: ![image](https://user-images.githubusercontent.com/1097932/87751143-3aa1a680-c7b2-11ea-9d09-52637a322270.png) ![image](https://user-images.githubusercontent.com/1097932/87752404-1f846600-c7b5-11ea-8106-8ddaf3cc3f7e.png) Closes #29133 from gengliangwang/shortLog. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? This PR proposes to remove redundant sorts before repartition nodes whenever the data is ordered after the repartitioning. ### Why are the changes needed? It looks like our `EliminateSorts` rule can be extended further to remove sorts before repartition nodes that don't affect the final output ordering. It seems safe to perform the following rewrites: - `Sort -> Repartition -> Sort -> Scan` as `Sort -> Repartition -> Scan` - `Sort -> Repartition -> Project -> Sort -> Scan` as `Sort -> Repartition -> Project -> Scan` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? More test cases. Closes #29089 from aokolnychyi/spark-32276. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

Closes #29155 from medb/patch-1. Authored-by: Igor Dvorzhak <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? - Adds `DataFramWriterV2` class. - Adds `writeTo` method to `pyspark.sql.DataFrame`. - Adds related SQL partitioning functions (`years`, `months`, ..., `bucket`). ### Why are the changes needed? Feature parity. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added new unit tests. TODO: Should we test against `org.apache.spark.sql.connector.InMemoryTableCatalog`? If so, how to expose it in Python tests? Closes #27331 from zero323/SPARK-29157. Authored-by: zero323 <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

### What is changed? This pull request adds the ability to migrate shuffle files during Spark's decommissioning. The design document associated with this change is at https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE . To allow this change the `MapOutputTracker` has been extended to allow the location of shuffle files to be updated with `updateMapOutput`. When a shuffle block is put, a block update message will be sent which triggers the `updateMapOutput`. Instead of rejecting remote puts of shuffle blocks `BlockManager` delegates the storage of shuffle blocks to it's shufflemanager's resolver (if supported). A new, experimental, trait is added for shuffle resolvers to indicate they handle remote putting of blocks. The existing block migration code is moved out into a separate file, and a producer/consumer model is introduced for migrating shuffle files from the host as quickly as possible while not overwhelming other executors. ### Why are the changes needed? Recomputting shuffle blocks can be expensive, we should take advantage of our decommissioning time to migrate these blocks. ### Does this PR introduce any user-facing change? This PR introduces two new configs parameters, `spark.storage.decommission.shuffleBlocks.enabled` & `spark.storage.decommission.rddBlocks.enabled` that control which blocks should be migrated during storage decommissioning. ### How was this patch tested? New unit test & expansion of the Spark on K8s decom test to assert that decommisioning with shuffle block migration means that the results are not recomputed even when the original executor is terminated. This PR is a cleaned-up version of the previous WIP PR I made #28331 (thanks to attilapiros for his very helpful reviewing on it :)). Closes #28708 from holdenk/SPARK-20629-copy-shuffle-data-when-nodes-are-being-shutdown-cleaned-up. Lead-authored-by: Holden Karau <[email protected]> Co-authored-by: Holden Karau <[email protected]> Co-authored-by: “attilapiros” <[email protected]> Co-authored-by: Attila Zsolt Piros <[email protected]> Signed-off-by: Holden Karau <[email protected]>

…guide ### What changes were proposed in this pull request? In http://spark.apache.org/docs/latest/sql-migration-guide.html#query-engine, there is a invalid reference for datetime reference "sql-ref-datetime-pattern.md". We should fix the link as http://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html. ![image](https://user-images.githubusercontent.com/1097932/87916920-fff57380-ca28-11ea-9028-99b9f9ebdfa4.png) Also, it is nice to add url for [DateTimeFormatter](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html) ### Why are the changes needed? Fix migration guide doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build the doc in local env and check it: ![image](https://user-images.githubusercontent.com/1097932/87919723-13a2d900-ca2d-11ea-9923-a29b4cefaf3c.png) Closes #29162 from gengliangwang/fixDoc. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

… Join/Partitions ### What changes were proposed in this pull request? In #28733 and #28805, CNF conversion is used to push down disjunctive predicates through join and partitions pruning. It's a good improvement, however, converting all the predicates in CNF can lead to a very long result, even with grouping functions over expressions. For example, for the following predicate ``` (p0 = '1' AND p1 = '1') OR (p0 = '2' AND p1 = '2') OR (p0 = '3' AND p1 = '3') OR (p0 = '4' AND p1 = '4') OR (p0 = '5' AND p1 = '5') OR (p0 = '6' AND p1 = '6') OR (p0 = '7' AND p1 = '7') OR (p0 = '8' AND p1 = '8') OR (p0 = '9' AND p1 = '9') OR (p0 = '10' AND p1 = '10') OR (p0 = '11' AND p1 = '11') OR (p0 = '12' AND p1 = '12') OR (p0 = '13' AND p1 = '13') OR (p0 = '14' AND p1 = '14') OR (p0 = '15' AND p1 = '15') OR (p0 = '16' AND p1 = '16') OR (p0 = '17' AND p1 = '17') OR (p0 = '18' AND p1 = '18') OR (p0 = '19' AND p1 = '19') OR (p0 = '20' AND p1 = '20') ``` will be converted into a long query(130K characters) in Hive metastore, and there will be error: ``` javax.jdo.JDOException: Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MPartition' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.PART_NAME,A0.PART_ID,A0.PART_NAME AS NUCORDER0 FROM PARTITIONS A0 LEFT OUTER JOIN TBLS B0 ON A0.TBL_ID = B0.TBL_ID LEFT OUTER JOIN DBS C0 ON B0.DB_ID = C0.DB_ID WHERE B0.TBL_NAME = ? AND C0."NAME" = ? AND ((((((A0.PART_NAME LIKE '%/p1=1' ESCAPE '\' ) OR (A0.PART_NAME LIKE '%/p1=2' ESCAPE '\' )) OR (A0.PART_NAME LIKE '%/p1=3' ESCAPE '\' )) OR ((A0.PART_NAME LIKE '%/p1=4' ESCAPE '\' ) O ... ``` Essentially, we just need to traverse predicate and extract the convertible sub-predicates like what we did in #24598. There is no need to maintain the CNF result set. ### Why are the changes needed? A better implementation for pushing down disjunctive and complex predicates. The pushed down predicates is always equal or shorter than the CNF result. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #29101 from gengliangwang/pushJoin. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…or its output partitioning ### What changes were proposed in this pull request? Currently, the `BroadcastHashJoinExec`'s `outputPartitioning` only uses the streamed side's `outputPartitioning`. However, if the join type of `BroadcastHashJoinExec` is an inner-like join, the build side's info (the join keys) can be added to `BroadcastHashJoinExec`'s `outputPartitioning`. For example, ```Scala spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "500") val t1 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i1", "j1") val t2 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i2", "j2") val t3 = (0 until 20).map(i => (i % 7, i % 11)).toDF("i3", "j3") val t4 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i4", "j4") // join1 is a sort merge join. val join1 = t1.join(t2, t1("i1") === t2("i2")) // join2 is a broadcast join where t3 is broadcasted. val join2 = join1.join(t3, join1("i1") === t3("i3")) // Join on the column from the broadcasted side (i3). val join3 = join2.join(t4, join2("i3") === t4("i4")) join3.explain ``` You see that `Exchange hashpartitioning(i2#103, 200)` is introduced because there is no output partitioning info from the build side. ``` == Physical Plan == *(6) SortMergeJoin [i3#29], [i4#40], Inner :- *(4) Sort [i3#29 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(i3#29, 200), true, [id=#55] : +- *(3) BroadcastHashJoin [i1#7], [i3#29], Inner, BuildRight : :- *(3) SortMergeJoin [i1#7], [i2#18], Inner : : :- *(1) Sort [i1#7 ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(i1#7, 200), true, [id=#28] : : : +- LocalTableScan [i1#7, j1#8] : : +- *(2) Sort [i2#18 ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(i2#18, 200), true, [id=#29] : : +- LocalTableScan [i2#18, j2#19] : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#34] : +- LocalTableScan [i3#29, j3#30] +- *(5) Sort [i4#40 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i4#40, 200), true, [id=#39] +- LocalTableScan [i4#40, j4#41] ``` This PR proposes to introduce output partitioning for the build side for `BroadcastHashJoinExec` if the streamed side has a `HashPartitioning` or a collection of `HashPartitioning`s. There is a new internal config `spark.sql.execution.broadcastHashJoin.outputPartitioningExpandLimit`, which can limit the number of partitioning a `HashPartitioning` can expand to. It can be set to "0" to disable this feature. ### Why are the changes needed? To remove unnecessary shuffle. ### Does this PR introduce _any_ user-facing change? Yes, now the shuffle in the above example can be eliminated: ``` == Physical Plan == *(5) SortMergeJoin [i3#108], [i4#119], Inner :- *(3) Sort [i3#108 ASC NULLS FIRST], false, 0 : +- *(3) BroadcastHashJoin [i1#86], [i3#108], Inner, BuildRight : :- *(3) SortMergeJoin [i1#86], [i2#97], Inner : : :- *(1) Sort [i1#86 ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(i1#86, 200), true, [id=#120] : : : +- LocalTableScan [i1#86, j1#87] : : +- *(2) Sort [i2#97 ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(i2#97, 200), true, [id=#121] : : +- LocalTableScan [i2#97, j2#98] : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#126] : +- LocalTableScan [i3#108, j3#109] +- *(4) Sort [i4#119 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i4#119, 200), true, [id=#130] +- LocalTableScan [i4#119, j4#120] ``` ### How was this patch tested? Added new tests. Closes #28676 from imback82/broadcast_join_output. Authored-by: Terry Kim <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Currently `ShuffledHashJoin.outputPartitioning` inherits from `HashJoin.outputPartitioning`, which only preserves stream side partitioning (`HashJoin.scala`): ``` override def outputPartitioning: Partitioning = streamedPlan.outputPartitioning ``` This loses build side partitioning information, and causes extra shuffle if there's another join / group-by after this join. Example: ``` withSQLConf( SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50", SQLConf.SHUFFLE_PARTITIONS.key -> "2", SQLConf.PREFER_SORTMERGEJOIN.key -> "false") { val df1 = spark.range(10).select($"id".as("k1")) val df2 = spark.range(30).select($"id".as("k2")) Seq("inner", "cross").foreach(joinType => { val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count() .queryExecution.executedPlan assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1) // No extra shuffle before aggregate assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2) }) } ``` Current physical plan (having an extra shuffle on `k1` before aggregate) ``` *(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, count#235L]) +- Exchange hashpartitioning(k1#220L, 2), true, [id=#117] +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], output=[k1#220L, count#239L]) +- *(3) Project [k1#220L] +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft :- Exchange hashpartitioning(k1#220L, 2), true, [id=#109] : +- *(1) Project [id#218L AS k1#220L] : +- *(1) Range (0, 10, step=1, splits=2) +- Exchange hashpartitioning(k2#224L, 2), true, [id=#111] +- *(2) Project [id#222L AS k2#224L] +- *(2) Range (0, 30, step=1, splits=2) ``` Ideal physical plan (no shuffle on `k1` before aggregate) ``` *(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, count#235L]) +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], output=[k1#220L, count#239L]) +- *(3) Project [k1#220L] +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft :- Exchange hashpartitioning(k1#220L, 2), true, [id=#107] : +- *(1) Project [id#218L AS k1#220L] : +- *(1) Range (0, 10, step=1, splits=2) +- Exchange hashpartitioning(k2#224L, 2), true, [id=#109] +- *(2) Project [id#222L AS k2#224L] +- *(2) Range (0, 30, step=1, splits=2) ``` This can be fixed by overriding `outputPartitioning` method in `ShuffledHashJoinExec`, similar to `SortMergeJoinExec`. In addition, also fix one typo in `HashJoin`, as that code path is shared between broadcast hash join and shuffled hash join. ### Why are the changes needed? To avoid shuffle (for queries having multiple joins or group-by), for saving CPU and IO. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `JoinSuite`. Closes #29130 from c21/shj. Authored-by: Cheng Su <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…etesTestComponents ### What changes were proposed in this pull request? Correct the spelling of parameter 'spark.executor.instances' in KubernetesTestComponents ### Why are the changes needed? Parameter spelling error ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Test is not needed. Closes #29164 from merrily01/SPARK-32367. Authored-by: maruilei <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…hould respect case insensitivity ### What changes were proposed in this pull request? This PR proposes to make the datasource options at `PartitioningAwareFileIndex` respect case insensitivity consistently: - `pathGlobFilter` - `recursiveFileLookup ` - `basePath` ### Why are the changes needed? To support consistent case insensitivity in datasource options. ### Does this PR introduce _any_ user-facing change? Yes, now users can also use case insensitive options such as `PathglobFilter`. ### How was this patch tested? Unittest were added. It reuses existing tests and adds extra clues to make it easier to track when the test is broken. Closes #29165 from HyukjinKwon/SPARK-32368. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? Add a config to let users change how SQL/Dataframe data is compressed when cached. This adds a few new classes/APIs for use with this config. 1. `CachedBatch` is a trait used to tag data that is intended to be cached. It has a few APIs that lets us keep the compression/serialization of the data separate from the metrics about it. 2. `CachedBatchSerializer` provides the APIs that must be implemented to cache data. * `convertForCache` is an API that runs a cached spark plan and turns its result into an `RDD[CachedBatch]`. The actual caching is done outside of this API * `buildFilter` is an API that takes a set of predicates and builds a filter function that can be used to filter the `RDD[CachedBatch]` returned by `convertForCache` * `decompressColumnar` decompresses an `RDD[CachedBatch]` into an `RDD[ColumnarBatch]` This is only used for a limited set of data types. These data types may expand in the future. If they do we can add in a new API with a default value that says which data types this serializer supports. * `decompressToRows` decompresses an `RDD[CachedBatch]` into an `RDD[InternalRow]` this API, like `decompressColumnar` decompresses the data in `CachedBatch` but turns it into `InternalRow`s, typically using code generation for performance reasons. There is also an API that lets you reuse the current filtering based on min/max values. `SimpleMetricsCachedBatch` and `SimpleMetricsCachedBatchSerializer`. ### Why are the changes needed? This lets users explore different types of compression and compression ratios. ### Does this PR introduce _any_ user-facing change? This adds in a single config, and exposes some developer API classes described above. ### How was this patch tested? I ran the unit tests around this and I also did some manual performance tests. I could find any performance difference between the old and new code, and if there is any it is within error. Closes #29067 from revans2/pluggable_cache_serializer. Authored-by: Robert (Bobby) Evans <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…atasource ### What changes were proposed in this pull request? Check that there are not duplicate column names on the same level (top level or nested levels) in reading from JDBC datasource. If such duplicate columns exist, throw the exception: ``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the customSchema option value: ``` The check takes into account the SQL config `spark.sql.caseSensitive` (`false` by default). ### Why are the changes needed? To make handling of duplicate nested columns is similar to handling of duplicate top-level columns i. e. output the same error: ```Scala org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the customSchema option value: `camelcase` ``` Checking of top-level duplicates was introduced by #17758, and duplicates in nested structures by #29234. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Added new test suite `JdbcNestedDataSourceSuite`. Closes #29317 from MaxGekk/jdbc-dup-nested-columns. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This PR fixes issues relate to Canonicalization of FileSourceScanExec when it contains unused DPP Filter. ### Why are the changes needed? As part of PlanDynamicPruningFilter rule, the unused DPP Filter are simply replaced by `DynamicPruningExpression(TrueLiteral)` so that they can be avoided. But these unnecessary`DynamicPruningExpression(TrueLiteral)` partition filter inside the FileSourceScanExec affects the canonicalization of the node and so in many cases, this can prevent ReuseExchange from happening. This PR fixes this issue by ignoring the unused DPP filter in the `def doCanonicalize` method. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT. Closes #29318 from prakharjain09/SPARK-32509_df_reuse. Authored-by: Prakhar Jain <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? `regexp_extract_all` is a very useful function expanded the capabilities of `regexp_extract`. There are some description of this function. ``` SELECT regexp_extract('1a 2b 14m', '\d+', 0); -- 1 SELECT regexp_extract_all('1a 2b 14m', '\d+', 0); -- [1, 2, 14] SELECT regexp_extract('1a 2b 14m', '(\d+)([a-z]+)', 2); -- 'a' SELECT regexp_extract_all('1a 2b 14m', '(\d+)([a-z]+)', 2); -- ['a', 'b', 'm'] ``` There are some mainstream database support the syntax. **Presto:** https://prestodb.io/docs/current/functions/regexp.html **Pig:** https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html Note: This PR pick up the work of #21985 ### Why are the changes needed? `regexp_extract_all` is a very useful function and make work easier. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT Closes #27507 from beliefer/support-regexp_extract_all. Lead-authored-by: beliefer <[email protected]> Co-authored-by: gengjiaan <[email protected]> Co-authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…it is a relative path ### What changes were proposed in this pull request? Currently, the user home directory is used as the base path for the database and table locations when their locationa are specified with a relative paths, e.g. ```sql > set spark.sql.warehouse.dir; spark.sql.warehouse.dir file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/spark-warehouse/ spark-sql> create database loctest location 'loctestdbdir'; spark-sql> desc database loctest; Database Name loctest Comment Location file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/loctestdbdir Owner kentyao spark-sql> create table loctest(id int) location 'loctestdbdir'; spark-sql> desc formatted loctest; id int NULL # Detailed Table Information Database default Table loctest Owner kentyao Created Time Thu May 14 16:29:05 CST 2020 Last Access UNKNOWN Created By Spark 3.1.0-SNAPSHOT Type EXTERNAL Provider parquet Location file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/loctestdbdir Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat ``` The user home is not always warehouse-related, unchangeable in runtime, and shared both by database and table as the parent directory. Meanwhile, we use the table path as the parent directory for relative partition locations. The config `spark.sql.warehouse.dir` represents `the default location for managed databases and tables`. For databases, the case above seems not to follow its semantics, because it should use ` `spark.sql.warehouse.dir` as the base path instead. For tables, it seems to be right but here I suggest enriching the meaning that lets it also be the for external tables with relative paths for locations. With changes in this PR, The location of a database will be `warehouseDir/dbpath` when `dbpath` is relative. The location of a table will be `dbpath/tblpath` when `tblpath` is relative. ### Why are the changes needed? bugfix and improvement Firstly, the databases with relative locations should be created under the default location specified by `spark.sql.warehouse.dir`. Secondly, the external tables with relative paths may also follow this behavior for consistency. At last, the behavior for database, tables and partitions with relative paths to choose base paths should be the same. ### Does this PR introduce _any_ user-facing change? Yes, this PR changes the `createDatabase`, `alterDatabase`, `createTable` and `alterTable` APIs and related DDLs. If the LOCATION clause is followed by a relative path, the root path will be `spark.sql.warehouse.dir` for databases, and `spark.sql.warehouse.dir` / `dbPath` for tables. e.g. #### after ```sql spark-sql> desc database loctest; Database Name loctest Comment Location file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/spark-warehouse/loctest Owner kentyao spark-sql> use loctest; spark-sql> create table loctest(id int) location 'loctest'; 20/05/14 18:18:02 WARN InMemoryFileIndex: The directory file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/loctest was not found. Was it deleted very recently? 20/05/14 18:18:02 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 20/05/14 18:18:03 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 20/05/14 18:18:03 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 20/05/14 18:18:03 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist spark-sql> desc formatted loctest; id int NULL # Detailed Table Information Database loctest Table loctest Owner kentyao Created Time Thu May 14 18:18:03 CST 2020 Last Access UNKNOWN Created By Spark 3.1.0-SNAPSHOT Type EXTERNAL Provider parquet Location file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/spark-warehouse/loctest/loctest Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat spark-sql> alter table loctest set location 'loctest2' > ; spark-sql> desc formatted loctest; id int NULL # Detailed Table Information Database loctest Table loctest Owner kentyao Created Time Thu May 14 18:18:03 CST 2020 Last Access UNKNOWN Created By Spark 3.1.0-SNAPSHOT Type EXTERNAL Provider parquet Location file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/spark-warehouse/loctest/loctest2 Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat ``` ### How was this patch tested? Add unit tests. Closes #28527 from yaooqinn/SPARK-31709. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…E /DECIMAL_DIGITS/NUM_PREC_RADIX/ORDINAL_POSITION for thriftserver client tools ### What changes were proposed in this pull request? This PR fulfills some missing fields for SparkGetColumnsOperation including COLUMN_SIZE /DECIMAL_DIGITS/NUM_PREC_RADIX/ORDINAL_POSITION and improve the test coverage. ### Why are the changes needed? make jdbc tools happier ### Does this PR introduce _any_ user-facing change? yes, #### before ![image](https://user-images.githubusercontent.com/8326978/88911764-e78b2180-d290-11ea-8abb-96f137f9c3c4.png) #### after ![image](https://user-images.githubusercontent.com/8326978/88911709-d04c3400-d290-11ea-90ab-02bda3e628e9.png) ![image](https://user-images.githubusercontent.com/8326978/88912007-39cc4280-d291-11ea-96d6-1ef3abbbddec.png) ### How was this patch tested? add unit tests Closes #29303 from yaooqinn/SPARK-32492. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ESET command ### What changes were proposed in this pull request? This PR modified the parser code to handle invalid usages of a SET/RESET command. For example; ``` SET spark.sql.ansi.enabled true ``` The above SQL command does not change the configuration value and it just tries to display the value of the configuration `spark.sql.ansi.enabled true`. This PR disallows using special characters including spaces in the configuration name and reports a user-friendly error instead. In the error message, it tells users a workaround to use quotes or a string literal if they still needs to specify a configuration with them. Before this PR: ``` scala> sql("SET spark.sql.ansi.enabled true").show(1, -1) +---------------------------+-----------+ |key |value | +---------------------------+-----------+ |spark.sql.ansi.enabled true|<undefined>| +---------------------------+-----------+ ``` After this PR: ``` scala> sql("SET spark.sql.ansi.enabled true") org.apache.spark.sql.catalyst.parser.ParseException: Expected format is 'SET', 'SET key', or 'SET key=value'. If you want to include special characters in key, please use quotes, e.g., SET `ke y`=value.(line 1, pos 0) == SQL == SET spark.sql.ansi.enabled true ^^^ ``` ### Why are the changes needed? For better user-friendly errors. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests in `SparkSqlParserSuite`. Closes #29146 from maropu/SPARK-32257. Lead-authored-by: Takeshi Yamamuro <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…and tuning ### What changes were proposed in this pull request? set params default values in trait Params for feature and tuning in both Scala and Python. ### Why are the changes needed? Make ML has the same default param values between estimator and its corresponding transformer, and also between Scala and Python. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing and modified tests Closes #29153 from huaxingao/default2. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Huaxin Gao <[email protected]>

…l.optimizeNullAwareAntiJoin` ### What changes were proposed in this pull request? Add the version `3.1.0` for the SQL config `spark.sql.optimizeNullAwareAntiJoin`. ### Why are the changes needed? To inform users when the config was added, for example on the page http://spark.apache.org/docs/latest/configuration.html. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By compiling and running `./dev/scalastyle`. Closes #29335 from MaxGekk/leanken-SPARK-32290-followup. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Change _To restore the behavior before Spark **3.0**_ to _To restore the behavior before Spark **3.1**_ in the SQL migration guide while telling about the behaviour before new version 3.1. ### Why are the changes needed? To have correct info in the SQL migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #29336 from MaxGekk/fix-version-in-sql-migration. Authored-by: Max Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…ch allow/disallow SparkContext in executors ### What changes were proposed in this pull request? This is a follow-up of #29278. This PR changes the config name to switch allow/disallow `SparkContext` in executors as per the comment #29278 (review). ### Why are the changes needed? The config name `spark.executor.allowSparkContext` is more reasonable. ### Does this PR introduce _any_ user-facing change? Yes, the config name is changed. ### How was this patch tested? Updated tests. Closes #29340 from ueshin/issues/SPARK-32160/change_config_name. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…of DISTINCT ### What changes were proposed in this pull request? This PR is related to #26656. #26656 only support use FILTER clause on aggregate expression without DISTINCT. This PR will enhance this feature when one or more DISTINCT aggregate expressions which allows the use of the FILTER clause. Such as: ``` select sum(distinct id) filter (where sex = 'man') from student; select class_id, sum(distinct id) filter (where sex = 'man') from student group by class_id; select count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student; select class_id, count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student group by class_id; select sum(distinct id), sum(distinct id) filter (where sex = 'man') from student; select class_id, sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id; select class_id, count(id), count(id) filter (where class_id = 1), sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id; ``` ### Why are the changes needed? Spark SQL only support use FILTER clause on aggregate expression without DISTINCT. This PR support Filter expression allows simultaneous use of DISTINCT ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Exists and new UT Closes #29291 from beliefer/support-distinct-with-filter. Lead-authored-by: gengjiaan <[email protected]> Co-authored-by: beliefer <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…s well when specifying "request.timeout.ms" on replacing "default.api.timeout.ms" ### What changes were proposed in this pull request? This patch is a follow-up to fill the gap in #29272 which missed to also provide `default.api.timeout.ms` as well. #29272 unintentionally changed the behavior on Kafka side timeout which is incompatible with the test timeout. (`default.api.timeout.ms` gets default value which is 60 seconds, longer than test timeout.) ### Why are the changes needed? We realized the PR for SPARK-32468 (#29272) doesn't work as we expect. See #29272 (comment) for more details. ### Does this PR introduce _any_ user-facing change? No, as it only touches the tests. ### How was this patch tested? Will trigger builds from Jenkins or Github Action multiple time and confirm. Closes #29343 from HeartSaVioR/SPARK-32468-FOLLOWUP. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…InMemoryRelation.ser ### What changes were proposed in this pull request? This PR aims to clean up `InMemoryRelation.ser` in `CachedBatchSerializerSuite`. ### Why are the changes needed? SPARK-32274 makes SQL cache serialization pluggable. ``` [SPARK-32274][SQL] Make SQL cache serialization pluggable ``` This causes UT failures. ``` $ build/sbt "sql/testOnly *.CachedBatchSerializerSuite *.CachedTableSuite" ... [info] Cause: java.lang.IllegalStateException: This does not work. This is only for testing [info] at org.apache.spark.sql.execution.columnar.TestSingleIntColumnarCachedBatchSerializer.convertInternalRowToCachedBatch(CachedBatchSerializerSuite.scala:49) ... [info] *** 30 TESTS FAILED *** [error] Failed: Total 51, Failed 30, Errors 0, Passed 21 [error] Failed tests: [error] org.apache.spark.sql.CachedTableSuite [error] (sql/test:testOnly) sbt.TestsFailedException: Tests unsuccessful ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. ``` $ build/sbt "sql/testOnly *.CachedBatchSerializerSuite *.CachedTableSuite" [info] Tests: succeeded 51, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [info] Passed: Total 51, Failed 0, Errors 0, Passed 51 ``` Closes #29346 from dongjoon-hyun/SPARK-32524-3. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? Make WithFields Expression not foldable. ### Why are the changes needed? The following query currently fails on master brach: ``` sql("SELECT named_struct('a', 1, 'b', 2) a") .select($"a".withField("c", lit(3)).as("a")) .show(false) // java.lang.UnsupportedOperationException: Cannot evaluate expression: with_fields(named_struct(a, 1, b, 2), c, 3) ``` This happens because the Catalyst optimizer sees that the WithFields Expression is foldable and tries to statically evaluate the WithFields Expression (via the ConstantFolding rule), however it cannot do so because WithFields Expression is Unevaluable. ### Does this PR introduce _any_ user-facing change? Yes, queries like the one shared above will now succeed. That said, this bug was introduced in Spark 3.1.0 which has yet to be released. ### How was this patch tested? A new unit test was added. Closes #29338 from fqaiser94/SPARK-32521. Lead-authored-by: [email protected] <[email protected]> Co-authored-by: fqaiser94 <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

… API ### What changes were proposed in this pull request? Note that this PR is forked from #23340 originally written by edwinalu. This PR proposes to expose the peak executor metrics at the stage level via the REST APIs: * `/applications/<application_id>/stages/`: peak values of executor metrics for **each stage** * `/applications/<application_id>/stages/<stage_id>/< stage_attempt_id >`: peak values of executor metrics for **each executor** for the stage, followed by peak values of executor metrics for the stage ### Why are the changes needed? The stage level peak executor metrics can help better understand your application's resource utilization. ### Does this PR introduce _any_ user-facing change? 1. For the `/applications/<application_id>/stages/` API, you will see the following new info for **each stage**: ```JSON "peakExecutorMetrics" : { "JVMHeapMemory" : 213367864, "JVMOffHeapMemory" : 189011656, "OnHeapExecutionMemory" : 0, "OffHeapExecutionMemory" : 0, "OnHeapStorageMemory" : 2133349, "OffHeapStorageMemory" : 0, "OnHeapUnifiedMemory" : 2133349, "OffHeapUnifiedMemory" : 0, "DirectPoolMemory" : 282024, "MappedPoolMemory" : 0, "ProcessTreeJVMVMemory" : 0, "ProcessTreeJVMRSSMemory" : 0, "ProcessTreePythonVMemory" : 0, "ProcessTreePythonRSSMemory" : 0, "ProcessTreeOtherVMemory" : 0, "ProcessTreeOtherRSSMemory" : 0, "MinorGCCount" : 13, "MinorGCTime" : 115, "MajorGCCount" : 4, "MajorGCTime" : 339 } ``` 2. For the `/applications/<application_id>/stages/<stage_id>/<stage_attempt_id>` API, you will see the following new info for **each executor** under `executorSummary`: ```JSON "peakMemoryMetrics" : { "JVMHeapMemory" : 0, "JVMOffHeapMemory" : 0, "OnHeapExecutionMemory" : 0, "OffHeapExecutionMemory" : 0, "OnHeapStorageMemory" : 0, "OffHeapStorageMemory" : 0, "OnHeapUnifiedMemory" : 0, "OffHeapUnifiedMemory" : 0, "DirectPoolMemory" : 0, "MappedPoolMemory" : 0, "ProcessTreeJVMVMemory" : 0, "ProcessTreeJVMRSSMemory" : 0, "ProcessTreePythonVMemory" : 0, "ProcessTreePythonRSSMemory" : 0, "ProcessTreeOtherVMemory" : 0, "ProcessTreeOtherRSSMemory" : 0, "MinorGCCount" : 0, "MinorGCTime" : 0, "MajorGCCount" : 0, "MajorGCTime" : 0 } ``` , and the following at the stage level: ```JSON "peakExecutorMetrics" : { "JVMHeapMemory" : 213367864, "JVMOffHeapMemory" : 189011656, "OnHeapExecutionMemory" : 0, "OffHeapExecutionMemory" : 0, "OnHeapStorageMemory" : 2133349, "OffHeapStorageMemory" : 0, "OnHeapUnifiedMemory" : 2133349, "OffHeapUnifiedMemory" : 0, "DirectPoolMemory" : 282024, "MappedPoolMemory" : 0, "ProcessTreeJVMVMemory" : 0, "ProcessTreeJVMRSSMemory" : 0, "ProcessTreePythonVMemory" : 0, "ProcessTreePythonRSSMemory" : 0, "ProcessTreeOtherVMemory" : 0, "ProcessTreeOtherRSSMemory" : 0, "MinorGCCount" : 13, "MinorGCTime" : 115, "MajorGCCount" : 4, "MajorGCTime" : 339 } ``` ### How was this patch tested? Added tests. Closes #29020 from imback82/metrics. Lead-authored-by: Terry Kim <[email protected]> Co-authored-by: edwinalu <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

### What changes were proposed in this pull request? Change casting of map and struct values to strings by using the `{}` brackets instead of `[]`. The behavior is controlled by the SQL config `spark.sql.legacy.castComplexTypesToString.enabled`. When it is `true`, `CAST` wraps maps and structs by `[]` in casting to strings. Otherwise, if this is `false`, which is the default, maps and structs are wrapped by `{}`. ### Why are the changes needed? - To distinguish structs/maps from arrays. - To make `show`'s output consistent with Hive and conversions to Hive strings. - To display dataframe content in the same form by `spark-sql` and `show` - To be consistent with the `*.sql` tests ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By existing test suite `CastSuite`. Closes #29308 from MaxGekk/show-struct-map. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This PR fixes the layout of monitoring.html broken after SPARK-31566(#28354). The cause is there are 2 `<td>` tags not closed in `monitoring.md`. ### Why are the changes needed? This is a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Build docs and the following screenshots are before/after. * Before fixed ![broken-doc](https://user-images.githubusercontent.com/4736016/89257873-fba09b80-d661-11ea-90da-06cbc0783011.png) * After fixed. ![fixed-doc2](https://user-images.githubusercontent.com/4736016/89257910-0fe49880-d662-11ea-9a85-7a1ecb1d38d6.png) Of course, the table is still rendered correctly. ![fixed-doc1](https://user-images.githubusercontent.com/4736016/89257948-225ed200-d662-11ea-80fd-d9254b44d4a0.png) Closes #29345 from sarutak/fix-monitoring.md. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

### What changes were proposed in this pull request? This PR proposes to write the main page of PySpark documentation. The base work is finished at #29188. ### Why are the changes needed? For better usability and readability in PySpark documentation. ### Does this PR introduce _any_ user-facing change? Yes, it creates a new main page as below: ![Screen Shot 2020-07-31 at 10 02 44 PM](https://user-images.githubusercontent.com/6477701/89037618-d2d68880-d379-11ea-9a44-562f2aa0e3fd.png) ### How was this patch tested? Manually built the PySpark documentation. ```bash cd python make clean html ``` Closes #29320 from HyukjinKwon/SPARK-32507. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? Implement ALTER TABLE in JDBC Table Catalog The following ALTER TABLE are implemented: ``` ALTER TABLE table_name ADD COLUMNS ( column_name datatype [ , ... ] ); ALTER TABLE table_name RENAME COLUMN old_column_name TO new_column_name; ALTER TABLE table_name DROP COLUMN column_name; ALTER TABLE table_name ALTER COLUMN column_name TYPE new_type; ALTER TABLE table_name ALTER COLUMN column_name SET NOT NULL; ``` I haven't checked ALTER TABLE syntax for all the databases yet. I will check. If there are different syntax, I will have a follow-up to override the dialect. Seems most of the databases don't support updating comments and column position, so I didn't implement UpdateColumnComment and UpdateColumnPosition. ### Why are the changes needed? Complete the JDBCTableCatalog implementation ### Does this PR introduce _any_ user-facing change? Yes `JDBCTableCatalog.alterTable` ### How was this patch tested? add new tests Closes #29324 from huaxingao/alter_table. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

… while casting to strings ### What changes were proposed in this pull request? Convert `NULL` elements of maps, structs and arrays to the `"null"` string while converting maps/struct/array values to strings. The SQL config `spark.sql.legacy.omitNestedNullInCast.enabled` controls the behaviour. When it is `true`, `NULL` elements of structs/maps/arrays will be omitted otherwise, when it is `false`, `NULL` elements will be converted to `"null"`. ### Why are the changes needed? 1. It is impossible to distinguish empty string and null, for instance: ```scala scala> Seq(Seq(""), Seq(null)).toDF().show +-----+ |value| +-----+ | []| | []| +-----+ ``` 2. Inconsistent NULL conversions for top-level values and nested columns, for instance: ```scala scala> sql("select named_struct('c', null), null").show +---------------------+----+ |named_struct(c, NULL)|NULL| +---------------------+----+ | []|null| +---------------------+----+ ``` 3. `.show()` is different from conversions to Hive strings, and as a consequence its output is different from `spark-sql` (sql tests): ```sql spark-sql> select named_struct('c', null) as struct; {"c":null} ``` ```scala scala> sql("select named_struct('c', null) as struct").show +------+ |struct| +------+ | []| +------+ ``` 4. It is impossible to distinguish empty struct/array from struct/array with null in the current implementation: ```scala scala> Seq[Seq[String]](Seq(), Seq(null)).toDF.show() +-----+ |value| +-----+ | []| | []| +-----+ ``` ### Does this PR introduce _any_ user-facing change? Yes, before: ```scala scala> Seq(Seq(""), Seq(null)).toDF().show +-----+ |value| +-----+ | []| | []| +-----+ ``` After: ```scala scala> Seq(Seq(""), Seq(null)).toDF().show +------+ | value| +------+ | []| |[null]| +------+ ``` ### How was this patch tested? By existing test suite `CastSuite`. Closes #29311 from MaxGekk/nested-null-to-string. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? The newly added test fails Jenkins maven jobs, see #29303 (comment) We move the test from `ThriftServerWithSparkContextSuite` to `SparkMetadataOperationSuite`, the former uses an embedded thrift server where the server and the client are in the same JVM process and the latter forks a new process to start the server where the server and client are isolated. The sbt runner seems to be fine with the test in the `ThriftServerWithSparkContextSuite`, but the maven runner with `scalates`t plugin will face the classloader issue as we will switch classloader to the one in the `sharedState` which is not the one that hive uses to load some classes. This is more like an issue that belongs to the maven runner or the `scalatest`. So in this PR, we simply move it to bypass the issue. BTW, we should test against the way of using embedded thrift server to verify whether it is just a maven issue or not, there could be some use cases with this API. ### Why are the changes needed? Jenkins recovery ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? modified uts Closes #29347 from yaooqinn/SPARK-32492-F. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ryComparatorSuite ### What changes were proposed in this pull request? PR #26548 means that RecordBinaryComparator now uses big endian byte order for long comparisons. However, this means that some of the constants in the regression tests no longer map to the same values in the comparison that they used to. For example, one of the tests does a comparison between Long.MIN_VALUE and 1 in order to trigger an overflow condition that existed in the past (i.e. Long.MIN_VALUE - 1). These constants correspond to the values 0x80..00 and 0x00..01. However on a little-endian machine the bytes in these values are now swapped before they are compared. This means that we will now be comparing 0x00..80 with 0x01..00. 0x00..80 - 0x01..00 does not overflow therefore missing the original purpose of the test. To fix this the constants are now explicitly written out in big endian byte order to match the byte order used in the comparison. This also fixes the tests on big endian machines (which would otherwise get a different comparison result to the little-endian machines). ### Why are the changes needed? The regression tests no longer serve their initial purposes and also fail on big-endian systems. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests run on big-endian system (s390x). Closes #29259 from mundaym/fix-endian. Authored-by: Michael Munday <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? SparkR increased the minimal version of Arrow R version to 1.0.0 at SPARK-32452, and Arrow R 0.14 dropped `as_tibble`. We can remove the usage in SparkR. ### Why are the changes needed? To remove codes unused anymore. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GitHub Actions will test them out. Closes #29361 from HyukjinKwon/SPARK-32543. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… status change # What changes were proposed in this pull request? This PR adds a `FileNotFoundException` try catch block while adding a new entry to history server application listing to skip the non-existing path. ### Why are the changes needed? If there are a large number (>100k) of applications log dir, listing the log dir will take a few seconds. After getting the path list some applications might have finished already, and the filename will change from `foo.inprogress` to `foo`. It leads to a problem when adding an entry to the listing, querying file status like `fileSizeForLastIndex` will throw out a `FileNotFoundException` exception if the application was finished. And the exception will abort current loop, in a busy cluster, it will make history server couldn't list and load any application log. ``` 20/08/03 15:17:23 ERROR FsHistoryProvider: Exception in checking for event log updates java.io.FileNotFoundException: File does not exist: hdfs://xx/logs/spark/application_11111111111111.lz4.inprogress at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1527) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1520) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1520) at org.apache.spark.deploy.history.SingleFileEventLogFileReader.status$lzycompute(EventLogFileReaders.scala:170) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? 1. setup another script keeps changing the filename of applications under history log dir 2. launch the history server 3. check whether the `File does not exist` error log was gone. Closes #29350 from yanxiaole/SPARK-32529. Authored-by: Yan Xiaole <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? Exit the executor when it has been asked to decommission and there is nothing left for it to do. This is a rebase of #28817 ### Why are the changes needed? If we want to use decommissioning in Spark's own scale down we should terminate the executor once finished. Furthermore, in graceful shutdown it makes sense to release resources we no longer need if we've been asked to shutdown by the cluster manager instead of always holding the resources as long as possible. ### Does this PR introduce _any_ user-facing change? The decommissioned executors will exit and the end of decommissioning. This is sort of a user facing change, however decommissioning hasn't been in any releases yet. ### How was this patch tested? I changed the unit test to not send the executor exit message and still wait on the executor exited message. Closes #29211 from holdenk/SPARK-31197-exit-execs-redone. Authored-by: Holden Karau <[email protected]> Signed-off-by: Holden Karau <[email protected]>

…ks should consider all kinds of resources ### What changes were proposed in this pull request? 1. Make `CoarseGrainedSchedulerBackend.maxNumConcurrentTasks()` considers all kinds of resources when calculating the max concurrent tasks 2. Refactor `calculateAvailableSlots()` to make it be able to be used for both `CoarseGrainedSchedulerBackend` and `TaskSchedulerImpl` ### Why are the changes needed? Currently, `CoarseGrainedSchedulerBackend.maxNumConcurrentTasks()` only considers the CPU for the max concurrent tasks. This can cause the application to hang when a barrier stage requires extra custom resources but the cluster doesn't have enough corresponding resources. Because, without the checking for other custom resources in `maxNumConcurrentTasks`, the barrier stage can be submitted to the `TaskSchedulerImpl`. But the `TaskSchedulerImpl` won't launch tasks for the barrier stage due to the insufficient task slots calculated by `TaskSchedulerImpl.calculateAvailableSlots` (which does check all kinds of resources). The application hang issue can be reproduced by the added unit test. ### Does this PR introduce _any_ user-facing change? Yes. In case of a barrier stage requires more custom resources than the cluster has, the application can get hang before this PR but can fail due to insufficient resources at the end after this PR. ### How was this patch tested? Added a unit test. Closes #29332 from Ngone51/fix-slots. Authored-by: yi.wu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

… style check by default ### What changes were proposed in this pull request? Disallow `FileSystem.get(Configuration conf)` in Scala style check by default and suggest developers use `FileSystem.get(URI uri, Configuration conf)` or `Path.getFileSystem()` instead. ### Why are the changes needed? The method `FileSystem.get(Configuration conf)` will return a default FileSystem instance if the conf `fs.file.impl` is not set. This can cause file not found exception on reading a target path of non-default file system, e.g. S3. It is hard to discover such a mistake via unit tests. If we disallow it in Scala style check by default and suggest developers use `FileSystem.get(URI uri, Configuration conf)` or `Path.getFileSystem(Configuration conf)`, we can reduce potential regression and PR review effort. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually run scala style check and test. Closes #29357 from gengliangwang/newStyleRule. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

yaooqinn and others added 30 commits July 16, 2020 13:01

[MINOR][DOCS] Fix links to Cloud Storage connectors docs

32a0451

Closes #29155 from medb/patch-1. Authored-by: Igor Dvorzhak <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

LuciferYang and others added 29 commits August 2, 2020 16:46

GuoPhilipse merged commit 5dfb1ac into GuoPhilipse:master Aug 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync #22

sync #22

Uh oh!

GuoPhilipse commented Aug 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

sync #22

sync #22

Uh oh!

Conversation

GuoPhilipse commented Aug 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants