sync #4

GuoPhilipse · 2020-05-25T15:19:56Z

sync

### What changes were proposed in this pull request? Fix a few issues in SQL Reference ### Why are the changes needed? To make SQL Reference look better ### Does this PR introduce _any_ user-facing change? Yes. before: <img width="189" alt="Screen Shot 2020-05-21 at 11 41 34 PM" src="https://user-images.githubusercontent.com/13592258/82639052-d0f38a80-9bbc-11ea-81a4-22def4ca5cc0.png"> after: <img width="195" alt="Screen Shot 2020-05-21 at 11 41 17 PM" src="https://user-images.githubusercontent.com/13592258/82639063-d5b83e80-9bbc-11ea-84d1-8361e6bee949.png"> before: <img width="763" alt="Screen Shot 2020-05-21 at 11 45 22 PM" src="https://user-images.githubusercontent.com/13592258/82639252-3e9fb680-9bbd-11ea-863c-e6a6c2f83a06.png"> after: <img width="724" alt="Screen Shot 2020-05-21 at 11 45 02 PM" src="https://user-images.githubusercontent.com/13592258/82639265-42cbd400-9bbd-11ea-8df2-fc5c255b84d3.png"> before: <img width="437" alt="Screen Shot 2020-05-21 at 11 41 57 PM" src="https://user-images.githubusercontent.com/13592258/82639072-db158900-9bbc-11ea-9963-731881cda4fd.png"> after <img width="347" alt="Screen Shot 2020-05-21 at 11 42 26 PM" src="https://user-images.githubusercontent.com/13592258/82639082-dfda3d00-9bbc-11ea-9bd2-f922cc91f175.png"> ### How was this patch tested? Manually build and check Closes #28608 from huaxingao/doc_fix. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>

…a log file when finding the latest batch ID ### What changes were proposed in this pull request? This patch adds the new method `getLatestBatchId()` in CompactibleFileStreamLog in complement of getLatest() which doesn't read the content of the latest batch metadata log file, and apply to both FileStreamSource and FileStreamSink to avoid unnecessary latency on reading log file. ### Why are the changes needed? Once compacted metadata log file becomes huge, writing outputs for the compact + 1 batch is also affected due to unnecessarily reading the compacted metadata log file. This unnecessary latency can be simply avoided. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UT. Also manually tested under query which has huge metadata log on file stream sink: > before applying the patch ![Screen Shot 2020-02-21 at 4 20 19 PM](https://user-images.githubusercontent.com/1317309/75016223-d3ffb180-54cd-11ea-9063-49405943049d.png) > after applying the patch ![Screen Shot 2020-02-21 at 4 06 18 PM](https://user-images.githubusercontent.com/1317309/75016220-d235ee00-54cd-11ea-81a7-7c03a43c4db4.png) Peaks are compact batches - please compare the next batch after compact batches, especially the area of "light brown". Closes #27664 from HeartSaVioR/SPARK-30915. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Shixiong Zhu <[email protected]>

…tion for invalid length input ### What changes were proposed in this pull request? This PR intends to add trivial tests to check #27024 has already been fixed in the master. Closes #27024 ### Why are the changes needed? For test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #28604 from maropu/SPARK-29854. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>

### What changes were proposed in this pull request? Increase the timeout and register the listener earlier to avoid any race condition of the job starting before the listener is registered. ### Why are the changes needed? The test is currently semi-flaky. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I'm currently running the following bash script on my dev machine to verify the flakiness decreases. It has gotten to 356 iterations without any test failures so I believe issue is fixed. ``` set -ex ./build/sbt clean compile package ((failures=0)) for (( i=0;i<1000;++i )); do echo "Run $i" ((failed=0)) ./build/sbt "core/testOnly org.apache.spark.scheduler.WorkerDecommissionSuite" || ((failed=1)) echo "Resulted in $failed" ((failures=failures+failed)) echo "Current status is failures: $failures out of $i runs" done ``` Closes #28614 from holdenk/SPARK-31791-improve-cache-block-migration-test-reliability. Authored-by: Holden Karau <[email protected]> Signed-off-by: Holden Karau <[email protected]>

### What changes were proposed in this pull request? Add Pagination Support for structured streaming page. Now both tables `Active Queries` and `Completed Queries` will have pagination. To implement pagination, pagination framework from #7399 is used. * Also tables will only be shown if there is at least one entry in the table. ### Why are the changes needed? * This will help users in analysing their structured streaming queries in much better way. * Other Web UI pages support pagination in their table. So this will make web UI more consistent across pages. * This can prevent potential OOM errors. ### Does this PR introduce _any_ user-facing change? Yes. Both tables will support pagination. ### How was this patch tested? Manually. I will add snapshots soon. Closes #28485 from iRakson/SPARK-31642. Authored-by: iRakson <[email protected]> Signed-off-by: Kousuke Saruta <[email protected]>

### What changes were proposed in this pull request? This PR aims to upgrade `kubernetes-client` library to bring the JDK8 related fixes. Please note that JDK11 works fine without any problem. - https://github.com/fabric8io/kubernetes-client/releases/tag/v4.9.2 - JDK8 always uses http/1.1 protocol (Prevent OkHttp from wrongly enabling http/2) ### Why are the changes needed? OkHttp "wrongly" detects the Platform as Jdk9Platform on JDK 8u251. - fabric8io/kubernetes-client#2212 - https://stackoverflow.com/questions/61565751/why-am-i-not-able-to-run-sparkpi-example-on-a-kubernetes-k8s-cluster Although there is a workaround `export HTTP2_DISABLE=true` and `Downgrade JDK or K8s`, we had better avoid this problematic situation. ### Does this PR introduce _any_ user-facing change? No. This will recover the failures on JDK 8u252. ### How was this patch tested? - [x] Pass the Jenkins UT (#28601 (comment)) - [x] Pass the Jenkins K8S IT with the K8s 1.13 (#28601 (comment)) - [x] Manual testing with K8s 1.17.3. (Below) **v1.17.6 result (on Minikube)** ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - Test basic decommissioning Run completed in 8 minutes, 27 seconds. Total number of tests run: 19 Suites: completed 2, aborted 0 Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #28601 from dongjoon-hyun/SPARK-K8S-CLIENT. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…data ### What changes were proposed in this pull request? Currently, the data source scan node stores all the paths in its metadata. The metadata is kept when a SparkPlan is converted into SparkPlanInfo. SparkPlanInfo can be used to construct the Spark plan graph in UI. However, the paths can be very large (e.g. it can be many partitions after partition pruning), while UI pages only require up to 100 bytes for the location metadata. We can reduce the paths stored in metadata to reduce memory usage. ### Why are the changes needed? Reduce unnecessary memory cost. In the heap dump of a driver, the SparkPlanInfo instances are quite large and it should be avoided: ![image](https://user-images.githubusercontent.com/1097932/82642318-8f65de00-9bc2-11ea-9c9c-f05c2b0e1c49.png) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #28610 from gengliangwang/improveLocationMetadata. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

… test This reverts commit d955708. Closes #28624 from sarutak/revert-real-headless-browser-support. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Kousuke Saruta <[email protected]>

…IntegralDivide operator ### What changes were proposed in this pull request? `IntegralDivide` operator returns Long DataType, so integer overflow case should be handled. If the operands are of type Int it will be casted to Long ### Why are the changes needed? As `IntegralDivide` returns Long datatype, integer overflow should not happen ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT and also tested in the local cluster After fix ![image](https://user-images.githubusercontent.com/35216143/82603361-25eccc00-9bd0-11ea-9ca7-001c539e628b.png) SQL Test After fix ![image](https://user-images.githubusercontent.com/35216143/82637689-f0250300-9c22-11ea-85c3-886ab2c23471.png) Before Fix ![image](https://user-images.githubusercontent.com/35216143/82637984-878a5600-9c23-11ea-9e47-5ce2fb923c01.png) Closes #28600 from sandeep-katta/integerOverFlow. Authored-by: sandeep katta <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? add getMetrics in Evaluators to get the corresponding Metrics instance, so users can use it to get any of the metrics scores. For example: ``` val trainer = new LinearRegression val model = trainer.fit(dataset) val predictions = model.transform(dataset) val evaluator = new RegressionEvaluator() val metrics = evaluator.getMetrics(predictions) val rmse = metrics.rootMeanSquaredError val r2 = metrics.r2 val mae = metrics.meanAbsoluteError val variance = metrics.explainedVariance ``` ### Why are the changes needed? Currently, Evaluator.evaluate only access to one metrics, but most users may need to get multiple metrics. This PR adds getMetrics in all the Evaluators, so users can use it to get an instance of the corresponding Metrics to get any of the metrics they want. ### Does this PR introduce _any_ user-facing change? Yes. Add getMetrics in Evaluators. For example: ``` /** * Get a RegressionMetrics, which can be used to get any of the regression * metrics such as rootMeanSquaredError, meanSquaredError, etc. * * param dataset a dataset that contains labels/observations and predictions. * return RegressionMetrics */ Since("3.1.0") def getMetrics(dataset: Dataset[_]): RegressionMetrics ``` ### How was this patch tested? Add new unit tests Closes #28590 from huaxingao/getMetrics. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Sean Owen <[email protected]>

### What changes were proposed in this pull request? This PR aims to use the style that is compatible with both python 2 and 3. ### Why are the changes needed? This will help python 3 migration. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual. Closes #28632 from williamhyun/use_python3_style. Authored-by: William Hyun <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? UnionRDD of PairRDDs causing a bug. The fix is to check for instance type before proceeding ### Why are the changes needed? Changes are needed to avoid users running into issues with union rdd operation with any other type other than JavaRDD. ### Does this PR introduce _any_ user-facing change? Yes Before: SparkSession available as 'spark'. >>> rdd1 = sc.parallelize([1,2,3,4,5]) >>> rdd2 = sc.parallelize([6,7,8,9,10]) >>> pairRDD1 = rdd1.zip(rdd2) >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] = rdds[i]._jrdd File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in setitem File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221, in __set_item File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.None. Trace: py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) After: >>> rdd2 = sc.parallelize([6,7,8,9,10]) >>> pairRDD1 = rdd1.zip(rdd2) >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) >>> unionRDD1.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10)] ### How was this patch tested? Tested with the reproduced piece of code above manually Closes #28603 from redsanket/SPARK-31788. Authored-by: schintap <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…etric' for some joins in SQLMetricSuite ### What changes were proposed in this pull request? Add unit tests to the 'number of output rows metric' for some join types in the SQLMetricSuite. A list of unit tests added are as follows. - ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi - BroadcastNestedLoopJoin: RightOuter - BroadcastHashJoin: LeftAnti ### Why are the changes needed? For some combinations of JoinType and Join algorithm there is no test coverage for the 'number of output rows' metric. ### Does this PR introduce any user-facing change? No ### How was this patch tested? I added debug statements in the code to ensure the correct combination if JoinType and Join algorithms are triggered. I further used Intellij debugger to test the same. Closes #28330 from sririshindra/SPARK-31377. Authored-by: rishi <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…ectly ### What changes were proposed in this pull request? Use `format()` methods for Java date-time types in `Row.jsonValue`. The PR #28582 added the methods to avoid conversions to days and microseconds. ### Why are the changes needed? To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing tests in `RowJsonSuite`. Closes #28620 from MaxGekk/toJson-format-Java-datetime-types. Authored-by: Max Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? Add weight support in ClusteringEvaluator ### Why are the changes needed? Currently, BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator support instance weight, but ClusteringEvaluator doesn't, so we will add instance weight support in ClusteringEvaluator. ### Does this PR introduce _any_ user-facing change? Yes. ClusteringEvaluator.setWeightCol ### How was this patch tested? add new unit test Closes #28553 from huaxingao/weight_evaluator. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Sean Owen <[email protected]>

### What changes were proposed in this pull request? As we support multiple catalogs with DataSourceV2, we may need the `CURRENT_CATALOG` value expression from the SQL standard. `CURRENT_CATALOG` is a general value specification in the SQL Standard, described as: > The value specified by CURRENT_CATALOG is the character string that represents the current default catalog name. ### Why are the changes needed? improve catalog v2 with ANSI SQL standard. ### Does this PR introduce any user-facing change? yes, add a new function `current_catalog()` to point the current active catalog ### How was this patch tested? add ut Closes #27006 from yaooqinn/SPARK-30352. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

… results ### What changes were proposed in this pull request? Re-generate results of: - DateTimeBenchmark - CSVBenchmark - JsonBenchmark in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | ### Why are the changes needed? 1. The PR #28576 changed date-time parser. The `DateTimeBenchmark` should confirm that the PR didn't slow down date/timestamp parsing. 2. CSV/JSON datasources are affected by the above PR too. This PR updates the benchmark results in the same environment as other benchmarks to have a base line for future optimizations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running benchmarks via the script: ```python #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #28613 from MaxGekk/missing-hour-year-benchmarks. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…/L/E/u/Q/q' ### What changes were proposed in this pull request? Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text Style while we turn to use `java.time.DateTimeFormatterBuilder` since 3.0.0, which output the leading single letter of the value, e.g. `December` would be `D`. In Spark 2.4 they mean Full-Text Style. In this PR, we explicitly disable Narrow-Text Style for these pattern characters. ### Why are the changes needed? Without this change, there will be a silent data change. ### Does this PR introduce _any_ user-facing change? Yes, queries with datetime operations using datetime patterns, e.g. `G/M/L/E/u` will fail if the pattern length is 5 and other patterns, e,g. 'k', 'm' also can accept a certain number of letters. 1. datetime patterns that are not supported by the new parser but the legacy will get SparkUpgradeException, e.g. "GGGGG", "MMMMM", "LLLLL", "EEEEE", "uuuuu", "aa", "aaa". 2 options are given to end-users, one is to use legacy mode, and the other is to follow the new online doc for correct datetime patterns 2, datetime patterns that are not supported by both the new parser and the legacy, e.g. "QQQQQ", "qqqqq", will get IllegalArgumentException which is captured by Spark internally and results NULL to end-users. ### How was this patch tested? add unit tests Closes #28592 from yaooqinn/SPARK-31771. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

huaxingao and others added 18 commits May 23, 2020 08:43

Revert "[SPARK-31756][WEBUI] Add real headless browser support for UI…

8441e93

… test This reverts commit d955708. Closes #28624 from sarutak/revert-real-headless-browser-support. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Kousuke Saruta <[email protected]>

GuoPhilipse merged commit df22083 into GuoPhilipse:master May 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync #4

sync #4

Uh oh!

GuoPhilipse commented May 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

sync #4

sync #4

Uh oh!

Conversation

GuoPhilipse commented May 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants