forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 0
Sync with apache:master #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… predicates
## What changes were proposed in this pull request?
This PR aims to fix an ORC performance regression at Spark 2.4.0 RCs from Spark 2.3.2. Currently, for column names with `.`, the pushed predicates are ignored.
**Test Data**
```scala
scala> val df = spark.range(Int.MaxValue).sample(0.2).toDF("col.with.dot")
scala> df.write.mode("overwrite").orc("/tmp/orc")
```
**Spark 2.3.2**
```scala
scala> spark.sql("set spark.sql.orc.impl=native")
scala> spark.sql("set spark.sql.orc.filterPushdown=true")
scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
+------------+
|col.with.dot|
+------------+
| 5|
| 7|
| 8|
+------------+
Time taken: 1542 ms
scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
+------------+
|col.with.dot|
+------------+
| 5|
| 7|
| 8|
+------------+
Time taken: 152 ms
```
**Spark 2.4.0 RC3**
```scala
scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
+------------+
|col.with.dot|
+------------+
| 5|
| 7|
| 8|
+------------+
Time taken: 4074 ms
scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
+------------+
|col.with.dot|
+------------+
| 5|
| 7|
| 8|
+------------+
Time taken: 1771 ms
```
## How was this patch tested?
Pass the Jenkins with a newly added test case.
Closes #22597 from dongjoon-hyun/SPARK-25579.
Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
## What changes were proposed in this pull request? Remove Kafka 0.8 integration ## How was this patch tested? Existing tests, build scripts Closes #22703 from srowen/SPARK-25705. Authored-by: Sean Owen <[email protected]> Signed-off-by: Sean Owen <[email protected]>
- Exposes several metrics regarding application status as a source, useful to scrape them via jmx instead of mining the metrics rest api. Example use case: prometheus + jmx exporter. - Metrics are gathered when a job ends at the AppStatusListener side, could be more fine-grained but most metrics like tasks completed are also counted by executors. More metrics could be exposed in the future to avoid scraping executors in some scenarios. - a config option `spark.app.status.metrics.enabled` is added to disable/enable these metrics, by default they are disabled. This was manually tested with jmx source enabled and prometheus server on k8s:  In the next pic the job delay is shown for repeated pi calculation (Spark action).  Closes #22381 from skonto/add_app_status_metrics. Authored-by: Stavros Kontopoulos <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>
…kaRDDSuite ## What changes were proposed in this pull request? Set a reasonable poll timeout thats used while consuming topics/partitions from kafka. In the absence of it, a default of 2 minute is used as the timeout values. And all the negative tests take a minimum of 2 minute to execute. After this change, we save about 4 minutes in this suite. ## How was this patch tested? Test fix. Closes #22670 from dilipbiswal/SPARK-25631. Authored-by: Dilip Biswal <[email protected]> Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? The PR adds new function `from_csv()` similar to `from_json()` to parse columns with CSV strings. I added the following methods: ```Scala def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column ``` and this signature to call it from Python, R and Java: ```Scala def from_csv(e: Column, schema: String, options: java.util.Map[String, String]): Column ``` ## How was this patch tested? Added new test suites `CsvExpressionsSuite`, `CsvFunctionsSuite` and sql tests. Closes #22379 from MaxGekk/from_csv. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Co-authored-by: hyukjinkwon <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>
## What changes were proposed in this pull request? `Literal.value` should have a value a value corresponding to `dataType`. This pr added code to verify it and fixed the existing tests to do so. ## How was this patch tested? Modified the existing tests. Closes #22724 from maropu/SPARK-25734. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? this PR correct some comment error: 1. change from "as low a possible" to "as low as possible" in RewriteDistinctAggregates.scala 2. delete redundant word “with” in HiveTableScanExec’s doExecute() method ## How was this patch tested? Existing unit tests. Closes #22694 from CarolinePeng/update_comment. Authored-by: 彭灿00244106 <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>
…on thread ## What changes were proposed in this pull request? The SQL execution listener framework was created from scratch(see #9078). It didn't leverage what we already have in the spark listener framework, and one major problem is, the listener runs on the spark execution thread, which means a bad listener can block spark's query processing. This PR re-implements the SQL execution listener framework. Now `ExecutionListenerManager` is just a normal spark listener, which watches the `SparkListenerSQLExecutionEnd` events and post events to the user-provided SQL execution listeners. ## How was this patch tested? existing tests. Closes #22674 from cloud-fan/listener. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? Currently when we open our doc site: https://spark.apache.org/docs/latest/index.html , there is one warning  This PR is to change the CDN as per the migration tips: https://www.mathjax.org/cdn-shutting-down/ This is very very trivial. But it would be good to follow the suggestion from MathJax team and remove the warning, in case one day the original CDN is no longer available. ## How was this patch tested? Manual check. Closes #22753 from gengliangwang/migrateMathJax. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Sean Owen <[email protected]>
When deserializing values of ArrayType with struct elements in java beans, fields of structs get mixed up. I suggest using struct data types retrieved from resolved input data instead of inferring them from java beans. ## What changes were proposed in this pull request? MapObjects expression is used to map array elements to java beans. Struct type of elements is inferred from java bean structure and ends up with mixed up field order. I used UnresolvedMapObjects instead of MapObjects, which allows to provide element type for MapObjects during analysis based on the resolved input data, not on the java bean. ## How was this patch tested? Added a test case. Built complete project on travis. michalsenkyr cloud-fan marmbrus liancheng Closes #22708 from vofque/SPARK-21402. Lead-authored-by: Vladimir Kuriatkov <[email protected]> Co-authored-by: Vladimir Kuriatkov <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.