Skip to content

Conversation

@GuoPhilipse
Copy link
Owner

sync

xy_xin and others added 26 commits June 29, 2020 13:13
…in MERGE INTO

### What changes were proposed in this pull request?
This PR add unlimited MATCHED and NOT MATCHED clauses in MERGE INTO statement.

### Why are the changes needed?
Now the MERGE INTO syntax is,
```
MERGE INTO [db_name.]target_table [AS target_alias]
 USING [db_name.]source_table [<time_travel_version>] [AS source_alias]
 ON <merge_condition>
 [ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
 [ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
 [ WHEN NOT MATCHED [ AND <condition> ] THEN <not_matched_action> ]
```
It would be nice if we support unlimited MATCHED and NOT MATCHED clauses in MERGE INTO statement, because users may want to deal with different "AND <condition>"s, the result of which just like a series of "CASE WHEN"s. The expected syntax looks like
```
MERGE INTO [db_name.]target_table [AS target_alias]
 USING [db_name.]source_table [<time_travel_version>] [AS source_alias]
 ON <merge_condition>
 [when_matched_clause [, ...]]
 [when_not_matched_clause [, ...]]
```
where when_matched_clause is
```
WHEN MATCHED [ AND <condition> ] THEN <matched_action>
```
and when_not_matched_clause is
```
WHEN NOT MATCHED [ AND <condition> ] THEN <not_matched_action>
 ```
matched_action can be one of
```
DELETE
UPDATE SET * or
UPDATE SET col1 = value1 [, col2 = value2, ...]
```
and not_matched_action can be one of
```
INSERT *
INSERT (col1 [, col2, ...]) VALUES (value1 [, value2, ...])
```
### Does this PR introduce _any_ user-facing change?
Yes. The SQL command changes, but it is backward compatible.

### How was this patch tested?
New tests added.

Closes #28875 from xianyinxin/SPARK-32030.

Authored-by: xy_xin <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
… directory

### What changes were proposed in this pull request?

This patch fixes the missed spot - the test initializes FileStreamSinkLog with its "output" directory instead of "metadata" directory, hence the verification against sink log was no-op.

### Why are the changes needed?

Without the fix, the verification against sink log was no-op.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Checked with debugger in test, and verified `allFiles()` returns non-zero entries. (It returned zero entry, as there's no metadata.)

Closes #28930 from HeartSaVioR/SPARK-29999-FOLLOWUP-fix-test.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…e in stage tab

### What changes were proposed in this pull request?
`formatDate` in utils.js `org/apache/spark/ui/static/utils.js` is partly refactored.

### Why are the changes needed?
In branch-2.4,task launch time is returned as html string from driver,
while in branch-3.x,this is returned in JSON Object as`Date`type  from `org.apache.spark.status.api.v1.TaskData`
Due to:
LaunchTime from jersey server in spark driver is correct, which will be converted to date string like `2020-06-28T02:57:42.605GMT` in json object, then the formatDate in utils.js treat it as date.split(".")[0].replace("T", " ").
So `2020-06-28T02:57:42.605GMT` will be converted to `2020-06-28 02:57:42`, but correct is `2020-06-28 10:57:42` in GMT+8 timezone.
![选区_071](https://user-images.githubusercontent.com/7149304/85937186-b6d36780-b933-11ea-8382-80a3891f1c2a.png)
![选区_070](https://user-images.githubusercontent.com/7149304/85937190-bcc94880-b933-11ea-8860-2083c97ea269.png)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual test.

Closes #28918 from TJX2014/master-SPARK-32068-ui-task-lauch-time-tz.

Authored-by: TJX2014 <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>
…location

### What changes were proposed in this pull request?

This PR fixes a typo for a configuration property in the `spark-standalone.md`.
`spark.driver.resourcesfile` should be `spark.driver.resourcesFile`.
I look for similar typo but this is the only typo.

### Why are the changes needed?

The property name is wrong.

### Does this PR introduce _any_ user-facing change?

Yes. The property name is corrected.

### How was this patch tested?

I confirmed the spell of the property name is the correct from the property name defined in o.a.s.internal.config.package.scala.

Closes #28958 from sarutak/fix-resource-typo.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
When loading DataFrames from JDBC datasource with Kerberos authentication, remote executors (yarn-client/cluster etc. modes) fail to establish a connection due to lack of Kerberos ticket or ability to generate it.

This is a real issue when trying to ingest data from kerberized data sources (SQL Server, Oracle) in enterprise environment where exposing simple authentication access is not an option due to IT policy issues.

In this PR I've added Oracle support.

What this PR contains:
* Added `OracleConnectionProvider`
* Added `OracleConnectionProviderSuite`

### Why are the changes needed?
Missing JDBC kerberos support.

### Does this PR introduce _any_ user-facing change?
Yes, now user is able to connect to Oracle using kerberos.

### How was this patch tested?
* Additional + existing unit tests
* Test on cluster manually

Closes #28863 from gaborgsomogyi/SPARK-31336.

Authored-by: Gabor Somogyi <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
JDBC connection providers implementation formatted in a wrong way. In this PR I've fixed the formatting.

### Why are the changes needed?
Wrong spacing in JDBC connection providers.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing unit tests.

Closes #28945 from gaborgsomogyi/provider_spacing.

Authored-by: Gabor Somogyi <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…e example for `timestamp_seconds`

### What changes were proposed in this pull request?
Modify the example for `timestamp_seconds` and replace `collect()` by `show()`.

### Why are the changes needed?
The SQL config `spark.sql.session.timeZone` doesn't influence on the `collect` in the example. The code below demonstrates that:
```
$ export TZ="UTC"
```
```python
>>> from pyspark.sql.functions import timestamp_seconds
>>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
>>> time_df = spark.createDataFrame([(1230219000,)], ['unix_time'])
>>> time_df.select(timestamp_seconds(time_df.unix_time).alias('ts')).collect()
[Row(ts=datetime.datetime(2008, 12, 25, 15, 30))]
```
The expected time is **07:30 but we get 15:30**.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified example via:
```
$ ./python/run-tests --modules=pyspark-sql
```

Closes #28959 from MaxGekk/SPARK-32088-fix-timezone-issue-followup.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
…potential conflicts in dev

### What changes were proposed in this pull request?

This PR proposes to partially reverts back in the tests and some codes at #27728 without touching any behaivours.

Most of changes in tests are back before #27728 by combining `withNestedDataFrame` and `withParquetDataFrame`.

Basically, it addresses the comments #27728 (comment), and my own comment in another PR at #28761 (comment)

### Why are the changes needed?

For maintenance purpose and to avoid a potential conflicts during backports. And also in case when other codes are matched with this.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested.

Closes #28955 from HyukjinKwon/SPARK-25556-followup.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
…US operations

### What changes were proposed in this pull request?
fix error exception messages during exceptions on Union and set operations

### Why are the changes needed?
Union and set operations can only be performed on tables with the compatible column types,while when we have more than two column, the exception messages will have wrong column index.

Steps to reproduce:

```
drop table if exists test1;
drop table if exists test2;
drop table if exists test3;
create table if not exists test1(id int, age int, name timestamp);
create table if not exists test2(id int, age timestamp, name timestamp);
create table if not exists test3(id int, age int, name int);
insert into test1 select 1,2,'2020-01-01 01:01:01';
insert into test2 select 1,'2020-01-01 01:01:01','2020-01-01 01:01:01';
insert into test3 select 1,3,4;
```

Query1:
```sql
select * from test1 except select * from test2;
```
Result1:
```
Error: org.apache.spark.sql.AnalysisException: Except can only be performed on tables with the compatible column types. timestamp <> int at the second column of the second table;; 'Except false :- Project [id#620, age#621, name#622] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#620, age#621, name#622] +- Project [id#623, age#624, name#625] +- SubqueryAlias `default`.`test2` +- HiveTableRelation `default`.`test2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#623, age#624, name#625] (state=,code=0)
```

Query2:

```sql
select * from test1 except select * from test3;
```

Result2:

```
Error: org.apache.spark.sql.AnalysisException: Except can only be performed on tables with the compatible column types
 int <> timestamp at the 2th column of the second table;
```

the above query1 has the right exception message
the above query2 have the wrong errors information, it may need to change to the following

```
Error: org.apache.spark.sql.AnalysisException: Except can only be performed on tables with the compatible column types.
int <> timestamp at the  third column of the second table
```

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
unit test

Closes #28951 from GuoPhilipse/32131-correct-error-messages.

Lead-authored-by: GuoPhilipse <[email protected]>
Co-authored-by: GuoPhilipse <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…tatistics

### What changes were proposed in this pull request?
Update documentation to reflect changes in faf220a

I've changed the documentation to reflect updated statistics may be used to improve query plan.

### Why are the changes needed?
I believe the documentation is stale and misleading.

### Does this PR introduce _any_ user-facing change?
Yes, this is a javadoc documentation fix.

### How was this patch tested?
Doc fix.

Closes #28925 from emkornfield/spark-32095.

Authored-by: Micah Kornfield <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
Spark can't push down scan predicate condition of **Or**:
Such as if I have a table `default.test`, it's partition col is `dt`,
If we use query :
```
select * from default.test
where dt=20190625 or (dt = 20190626 and id in (1,2,3) )
```

In this case, Spark will resolve **Or** condition as one expression, and since this expr has reference of "id", then it can't been push down.

Base on pr #28733, In my PR ,  for SQL like
`select * from default.test`
 `where  dt = 20190626  or  (dt = 20190627  and xxx="a")   `

For this  condition `dt = 20190626  or  (dt = 20190627  and xxx="a"   )`, it will  been converted  to CNF
```
(dt = 20190626 or dt = 20190627) and (dt = 20190626 or xxx = "a" )
```
then condition `dt = 20190626 or dt = 20190627` will be push down when partition pruning

### Why are the changes needed?
Optimize partition pruning

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Added UT

Closes #28805 from AngersZhuuuu/cnf-for-partition-pruning.

Lead-authored-by: angerszhu <[email protected]>
Co-authored-by: AngersZhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…odel

### What changes were proposed in this pull request?
Add summary to RandomForestClassificationModel...

### Why are the changes needed?
so user can get a summary of this classification model, and retrieve common metrics such as accuracy, weightedTruePositiveRate, roc (for binary), pr curves (for binary), etc.

### Does this PR introduce _any_ user-facing change?
Yes
```
RandomForestClassificationModel.summary
RandomForestClassificationModel.evaluate
```

### How was this patch tested?
Add new tests

Closes #28913 from huaxingao/rf_summary.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
### What changes were proposed in this pull request?

This is a followup of #28534 , to make `TIMESTAMP_SECONDS` function support fractional input as well.

### Why are the changes needed?

Previously the cast function can cast fractional values to timestamp. Now we suggest users to ues these new functions, and we need to cover all the cast use cases.

### Does this PR introduce _any_ user-facing change?

Yes, now `TIMESTAMP_SECONDS` function accepts fractional input.

### How was this patch tested?

new tests

Closes #28956 from cloud-fan/follow.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

This PR aims to add `PrometheusServletSuite`.

### Why are the changes needed?

This improves the test coverage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the newly added test suite.

Closes #28865 from erenavsarogullari/spark_driver_prometheus_metrics_improvement.

Authored-by: Eren Avsarogullari <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…ctive in data source options

### What changes were proposed in this pull request?

This is a followup of #28760 to fix the remaining issues:
1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand`
2. should consider data source options when inferring schema for file source
3. should consider data source options when getting the qualified path in file source v2.

### Why are the changes needed?

We didn't catch these issues in #28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

rewrite the test to make sure the entire data source read/write action can succeed.

Closes #28948 from cloud-fan/fix.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
### What changes were proposed in this pull request?
Set the JSON option `inferTimestamp` to `false` if an user don't pass it as datasource option.

### Why are the changes needed?
To prevent perf regression while inferring schemas from JSON with potential timestamps fields.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
- Modified existing tests in `JsonSuite` and `JsonInferSchemaSuite`.
- Regenerated results of `JsonBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM 11.0.7+10 |

Closes #28966 from MaxGekk/json-inferTimestamps-disable-by-default.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…nt and sql when AQE is enabled

### What changes were proposed in this pull request?

As the followup of #28900, this patch extends coalescing partitions to repartitioning using hints and SQL syntax without specifying number of partitions, when AQE is enabled.

### Why are the changes needed?

When repartitionning using hints and SQL syntax, we should follow the shuffling behavior of repartition by expression/range to coalesce partitions when AQE is enabled.

### Does this PR introduce _any_ user-facing change?

Yes. After this change, if users don't specify the number of partitions when repartitioning using `REPARTITION`/`REPARTITION_BY_RANGE` hint or `DISTRIBUTE BY`/`CLUSTER BY`, AQE will coalesce partitions.

### How was this patch tested?

Unit tests.

Closes #28952 from viirya/SPARK-32056-sql.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…ui.port) property if mentioned explicitly

## What changes were proposed in this pull request?
When a Spark Job launched in Cluster mode with Yarn, Application Master sets spark.ui.port port to 0 which means Driver's web UI gets any random port even if we want to explicitly set the Port range for Driver's Web UI

## Why are the changes needed?
We access Spark Web UI via Knox Proxy, and there are firewall restrictions due to which we can not access Spark Web UI since Web UI port range gets random port even if we set explicitly.

This Change will check if there is a specified port range explicitly mentioned so that it does not assign a random port.

## Does this PR introduce any user-facing change?
No

## How was this patch tested?
Local Tested.

Closes #28880 from rajatahujaatinmobi/ahujarajat261/SPARK-32039-change-yarn-webui-port-range-with-property-latest-spark.

Authored-by: Rajat Ahuja <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

This patch fixes wrong groupBy result if the grouping key is a null-value struct.

### Why are the changes needed?

`NormalizeFloatingNumbers` reconstructs a struct if input expression is StructType. If the input struct is null, it will reconstruct a struct with null-value fields, instead of null.

### Does this PR introduce _any_ user-facing change?

Yes, fixing incorrect groupBy result.

### How was this patch tested?

Unit test.

Closes #28962 from viirya/SPARK-32136.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?
Correct file seprate use in `ExecutorDiskUtils.createNormalizedInternedPathname` on Windows

### Why are the changes needed?
`ExternalShuffleBlockResolverSuite` failed on Windows, see detail at:
https://issues.apache.org/jira/browse/SPARK-32121

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
The existed test suite.

Closes #28940 from pan3793/SPARK-32121.

Lead-authored-by: pancheng <[email protected]>
Co-authored-by: chengpan <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?

Minor fix so that the documentation of `getActiveSession` is fixed.
The sample code snippet doesn't come up formatted rightly, added spacing for this to be fixed.
Also added return to docs.

### Why are the changes needed?

The sample code is getting mixed up as description in the docs.

[Current Doc Link](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=getactivesession#pyspark.sql.SparkSession.getActiveSession)

![image](https://user-images.githubusercontent.com/6907950/86331522-d7b6f800-bc66-11ea-998c-42085f5e5b04.png)

### Does this PR introduce _any_ user-facing change?

Yes, documentation of getActiveSession is fixed.
And added description about return.

### How was this patch tested?

Adding a spacing between description and code seems to fix the issue.

Closes #28978 from animenon/docs_minor.

Authored-by: animenon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?

For queries like `t1d in (SELECT t2d FROM  t2 ORDER  BY t2c LIMIT 2)`, the result can be non-deterministic as the result of the subquery may output different results (it's not sorted by `t2d` and it has shuffle).

This PR makes the test more robust by sorting the output column.

### Why are the changes needed?

avoid flaky test

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #28976 from cloud-fan/small.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
## What changes were proposed in this pull request?
Check the namespace existence while calling "use namespace", and throw NoSuchNamespaceException if namespace not exists.

### Why are the changes needed?
Users need to know that the namespace does not exist when they try to set a wrong namespace.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Run all suites and add a test for this

Closes #27900 from stczwd/SPARK-31100.

Authored-by: stczwd <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…s from in HiveExternalCatalogSuite

### What changes were proposed in this pull request?
1.Merge two similar tests for SPARK-31061 and make the code clean.
2.Fix table alter issue due to lose path.

### Why are the changes needed?
Because this two tests for SPARK-31061 is very similar and could be merged.
And the first test case should use `rawTable` instead of `parquetTable` to alter.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test.

Closes #28980 from TJX2014/master-follow-merge-spark-31061-test-case.

Authored-by: TJX2014 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…mark

### What changes were proposed in this pull request?
Set the JSON option `inferTimestamp` to `true` for the cases that measure perf of timestamp inference.

### Why are the changes needed?
The PR #28966 disabled timestamp inference by default. As a consequence, some benchmarks don't measure perf of timestamp inference from JSON fields. This PR explicitly enable such inference.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By re-generating results of `JsonBenchmark`.

Closes #28981 from MaxGekk/json-inferTimestamps-disable-by-default-followup.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
… table

### What changes were proposed in this pull request?

docs/sql-ref-syntax-qry-select-usedb.md -> docs/sql-ref-syntax-ddl-usedb.md
docs/sql-ref-syntax-aux-refresh-table.md -> docs/sql-ref-syntax-aux-cache-refresh-table.md

### Why are the changes needed?
usedb belongs to DDL. Its location should be consistent with other DDL commands file locations
similar reason for refresh table

### Does this PR introduce _any_ user-facing change?
before change, when clicking USE DATABASE, the side bar menu shows select commands
<img width="1200" alt="Screen Shot 2020-07-04 at 9 05 35 AM" src="https://user-images.githubusercontent.com/13592258/86516696-b45f8a80-bdd7-11ea-8dba-3a5cca22aad3.png">

after change, when clicking USE DATABASE, the side bar menu shows DDL commands
<img width="1120" alt="Screen Shot 2020-07-04 at 9 06 06 AM" src="https://user-images.githubusercontent.com/13592258/86516703-bf1a1f80-bdd7-11ea-8a90-ae7eaaafd44c.png">

before change, when clicking refresh table, the side bar menu shows Auxiliary statements
<img width="1200" alt="Screen Shot 2020-07-04 at 9 30 40 AM" src="https://user-images.githubusercontent.com/13592258/86516877-3d2af600-bdd9-11ea-9568-0a6f156f57da.png">

after change, when clicking refresh table, the side bar menu shows Cache statements
<img width="1199" alt="Screen Shot 2020-07-04 at 9 35 21 AM" src="https://user-images.githubusercontent.com/13592258/86516937-b4f92080-bdd9-11ea-8ad1-5f5a7f58d76b.png">

### How was this patch tested?
Manually build and check

Closes #28995 from huaxingao/docs_fix.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Huaxin Gao <[email protected]>
@GuoPhilipse GuoPhilipse merged commit dc24541 into GuoPhilipse:master Jul 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.