Skip to content
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
6e0235a
[SPARK-33976][SQL] Spark script TRANSFORM related change doc
AngersZhuuuu Jan 4, 2021
f75b85b
Merge branch 'master' into SPARK-33976
AngersZhuuuu Apr 1, 2021
7dbfebf
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 1, 2021
488aed4
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 1, 2021
a7c61a0
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 2, 2021
8db29cb
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 2, 2021
e59a686
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 2, 2021
e0ce6a5
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 8, 2021
f7f8952
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 8, 2021
5da3676
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 8, 2021
f6590af
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 12, 2021
0dc289c
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 12, 2021
0d60cb1
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 13, 2021
b636dd1
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 13, 2021
a26f61e
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 13, 2021
1082710
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 15, 2021
89eee47
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 15, 2021
4807201
follow comment
AngersZhuuuu Apr 15, 2021
8a11d90
follow comment
AngersZhuuuu Apr 15, 2021
9b7f66d
update
AngersZhuuuu Apr 15, 2021
8650461
Update sql-ref-syntax-qry-select.md
AngersZhuuuu Apr 15, 2021
e37d75d
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 15, 2021
c040ef6
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 15, 2021
7603821
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 19, 2021
05057d3
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 19, 2021
25fa153
Update sql-ref-syntax-qry-select-transform.md
AngersZhuuuu Apr 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/_data/menu-sql.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,8 @@
url: sql-ref-syntax-qry-select-lateral-view.html
- text: PIVOT Clause
url: sql-ref-syntax-qry-select-pivot.html
- text: TRANSFORM Clause
url: sql-ref-syntax-qry-select-transform.html
- text: EXPLAIN
url: sql-ref-syntax-qry-explain.html
- text: Auxiliary Statements
Expand Down
297 changes: 297 additions & 0 deletions docs/sql-ref-syntax-qry-select-transform.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,297 @@
---
layout: global
title: TRANSFORM
displayTitle: TRANSFORM
license: |
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
---

### Description

The `TRANSFORM` clause is used to specifies a hive-style transform (SELECT TRANSFORM/MAP/REDUCE)
query specification to transform the input by forking and running the specified script. Users can
plug in their own custom mappers and reducers in the data stream by using features natively supported
in the Spark/Hive language. e.g. in order to run a custom mapper script - map_script - and a custom
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe tick-quote map_script, reduce_script

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

reducer script - reduce_script - the user can issue the following command which uses the TRANSFORM
clause to embed the mapper and the reducer scripts.

Currently, Spark's script transform support two mode:

1. Without Hive: It means we run Spark SQL without hive support, in this mode, we can use default format
by treating data as STRING and use Spark's own SerDe.
2. WIth Hive: It means we run Spark SQL with Hive support, in this mode, when we use default format,
it will be treated as Hive default fomat. And we can use Hive supported SerDe to process data.

In both mode with default format, columns will be transformed to STRING and delimited by TAB before feeding
to the user script, Similarly, all NULL values will be converted to the literal string \N in order to
differentiate NULL values from empty strings. The standard output of the user script will be treated as
TAB-separated STRING columns, any cell containing only \N will be re-interpreted as a NULL, and then the
resulting STRING column will be cast to the data type specified in the table declaration in the usual way.
User scripts can output debug information to standard error which will be shown on the task detail page on hadoop.
These defaults can be overridden with `ROW FORMAT DELIMITED`.

### Syntax

```sql
rowFormat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

: ROW FORMAT SERDE serde_class [ WITH SERDEPROPERTIES serde_props ]
| ROW FORMAT DELIMITED
[ FIELDS TERMINATED BY fields_terminated_char [ ESCAPED BY escapedBy ] ]
[ COLLECTION ITEMS TERMINATED BY collectionItemsTerminatedBy ]
[ MAP KEYS TERMINATED BY keysTerminatedBy ]
[ LINES TERMINATED BY linesSeparatedBy ]
[ NULL DEFINED AS nullDefinedAs ]

inRowFormat=rowFormat
outRowFormat=rowFormat
namedExpressionSeq = named_expression [ , ... ]

transformClause:
SELECT [ TRANSFORM ( namedExpressionSeq ) | MAP namedExpressionSeq | REDUCE namedExpressionSeq ]
[ inRowFormat ]
[ RECORDWRITER recordWriter_class ]
USING script
[ AS ( [ col_name [ col_type ]] [ , ... ] ) ]
[ outRowFormat ]
[ RECORDREADER recordReader_class ]
[ WHERE boolean_expression ]
[ GROUP BY expression [ , ... ] ]
[ HAVING boolean_expression ]
```

### Parameters

* **named_expression**

An expression with an assigned name. In general, it denotes a column expression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update it please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same reason to #31010 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


**Syntax:** `expression [AS] [alias]`

* **row_format**

Use the `SERDE` clause to specify a custom SerDe for one table. Otherwise, use the `DELIMITED` clause to use the native SerDe and specify the delimiter, escape character, null character and so on.

* **SERDE**

Specifies a custom SerDe for one table.

* **serde_class**

Specifies a fully-qualified class name of a custom SerDe.

* **SERDEPROPERTIES**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we document serde_props instead and explain its syntax?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


A list of key-value pairs that is used to tag the SerDe definition.

* **DELIMITED**

The `DELIMITED` clause can be used to specify the native SerDe and state the delimiter, escape character, null character and so on.

* **FIELDS TERMINATED BY**

Used to define a column separator.

* **COLLECTION ITEMS TERMINATED BY**

Used to define a collection item separator.

* **MAP KEYS TERMINATED BY**

Used to define a map key separator.

* **LINES TERMINATED BY**

Used to define a row separator.

* **NULL DEFINED AS**

Used to define the specific value for NULL.

* **ESCAPED BY**

Used for escape mechanism.

* **RECORDREADER**

Specifies a custom RecordReader for one table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we mention that users should specify a full class name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "removed" means? I couldn't find the fix corresponding to the @cloud-fan comment above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "removed" means? I couldn't find the fix corresponding to the @cloud-fan comment above.

Sorry..push failed, I mean here is duplicated with record_reder_class, I think remove this part is ok.


* **RECORDWRITER**

Specifies a custom RecordWriter for one table.

* **recordReader_class**

Specifies a fully-qualified class name of a custom RecordReader.
Default value is `org.apache.hadoop.hive.ql.exec.TextRecordReader`

* **recordWriter_class**

Specifies a fully-qualified class name of a custom RecordWriter.
Default value is `org.apache.hadoop.hive.ql.exec.TextRecordWriter`.

* **script**

Specify a command to process data.

* **boolean_expression**

Specifies any expression that evaluates to a result type `boolean`. Two or
more expressions may be combined together using the logical
operators ( `AND`, `OR` ).

* **expression**

Specifies combination of one or more values, operators and SQL functions that results in a value.

### Without Hive support Mode

Now Spark Script transform can run without `-Phive` or `SparkSession.builder.enableHiveSupport()`.
In this case, now we only use script transform with `ROW FORMAT DELIMIT` and treat all value passed
to script as string.

### With Hive Support Mode

When build Spark with `-Phive` and start Spark SQL with `enableHiveSupport()`, we can use script
transform with Hive SerDe and both `ROW FORMAT DELIMIT`.

### Schema-less Script Transforms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just document this as the default schema when introducing USING script [ AS ( [ col_name [ col_type ] ] [ , ... ] ) ]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, like current change ?


If there don't have AS clause after USING my_script, Spark assumes that the output of the script contains 2 parts:

1. key: which is before the first tab,
2. value: which is the rest after the first tab.

Note that this is different from specifying AS key, value because in that case, value will only contain the portion
between the first tab and the second tab if there are multiple tabs.

### Examples

```sql
CREATE TABLE person (zip_code INT, name STRING, age INT);
INSERT INTO person VALUES
(94588, 'Zen Hui', 50),
(94588, 'Dan Li', 18),
(94588, 'Anil K', 27),
(94588, 'John V', NULL),
(94511, 'David K', 42),
(94511, 'Aryan B.', 18),
(94511, 'Lalit B.', NULL);

-- With specified out put without data type
SELECT TRANSFORM(zip_code, name, age)
USING 'cat' AS (a, b, c)
FROM person
WHERE zip_code > 94511;
+-------+---------+-----+
| a | b| c|
+-------+---------+-----+
| 94588| Anil K| 27|
| 94588| John V| NULL|
| 94588| Zen Hui| 50|
| 94588| Dan Li| 18|
+-------+---------+-----+

-- With specified out put without data type
SELECT TRANSFORM(zip_code, name, age)
USING 'cat' AS (a STRING, b STRING, c STRING)
FROM person
WHERE zip_code > 94511;
+-------+---------+-----+
| a | b| c|
+-------+---------+-----+
| 94588| Anil K| 27|
| 94588| John V| NULL|
| 94588| Zen Hui| 50|
| 94588| Dan Li| 18|
+-------+---------+-----+

-- ROW FORMAT DELIMIT
SELECT TRANSFORM(name, age)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
NULL DEFINED AS 'NULL'
USING 'cat' AS (name_age string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '@'
LINES TERMINATED BY '\n'
NULL DEFINED AS 'NULL'
FROM person;
+---------------+
| name_age|
+---------------+
| Anil K,27|
| John V,null|
| ryan B.,18|
| David K,42|
| Zen Hui,50|
| Dan Li,18|
| Lalit B.,null|
+---------------+

-- Hive Serde
SELECT TRANSFORM(zip_code, name, age)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim' = '\t'
)
USING 'cat' AS (a STRING, b STRING, c STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim' = '\t'
)
FROM person
WHERE zip_code > 94511;
+-------+---------+-----+
| a | b| c|
+-------+---------+-----+
| 94588| Anil K| 27|
| 94588| John V| NULL|
| 94588| Zen Hui| 50|
| 94588| Dan Li| 18|
+-------+---------+-----+

-- Schema-less mode
SELECT TRANSFORM(zip_code, name, age)
USING 'cat'
FROM person
WHERE zip_code > 94500;
+-------+-----------------+
| key| value|
+-------+-----------------+
| 94588| Anil K 27|
| 94588| John V \N|
| 94511| Aryan B. 18|
| 94511| David K 42|
| 94588| Zen Hui 50|
| 94588| Dan Li 18|
| 94511| Lalit B. \N|
+-------+-----------------+
```

### Related Statements

* [SELECT Main](sql-ref-syntax-qry-select.html)
* [WHERE Clause](sql-ref-syntax-qry-select-where.html)
* [GROUP BY Clause](sql-ref-syntax-qry-select-groupby.html)
* [HAVING Clause](sql-ref-syntax-qry-select-having.html)
* [ORDER BY Clause](sql-ref-syntax-qry-select-orderby.html)
* [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html)
* [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html)
* [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)
* [CASE Clause](sql-ref-syntax-qry-select-case.html)
* [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
* [LATERAL VIEW Clause](sql-ref-syntax-qry-select-lateral-view.html)
8 changes: 7 additions & 1 deletion docs/sql-ref-syntax-qry-select.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ select_statement [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select_stat

While `select_statement` is defined as
```sql
SELECT [ hints , ... ] [ ALL | DISTINCT ] { [ named_expression | regex_column_names ] [ , ... ] }
SELECT [ hints , ... ] [ ALL | DISTINCT ] { [[ named_expression | regex_column_names ] [ , ... ] | TRANSFORM Clause ] }
FROM { from_item [ , ... ] }
[ PIVOT clause ]
[ LATERAL VIEW clause ] [ ... ]
Expand Down Expand Up @@ -164,6 +164,11 @@ SELECT [ hints , ... ] [ ALL | DISTINCT ] { [ named_expression | regex_column_na
)
```

* **TRANSFORM**

Specifies a hive-style transform (SELECT TRANSFORM/MAP/REDUCE) query specification to transform
the input by forking and running the specified script.

### Related Statements

* [WHERE Clause](sql-ref-syntax-qry-select-where.html)
Expand All @@ -187,3 +192,4 @@ SELECT [ hints , ... ] [ ALL | DISTINCT ] { [ named_expression | regex_column_na
* [CASE Clause](sql-ref-syntax-qry-select-case.html)
* [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
* [LATERAL VIEW Clause](sql-ref-syntax-qry-select-lateral-view.html)
* [TRANSFORM Clause](sql-ref-syntax-qry-select-transform.html)
1 change: 1 addition & 0 deletions docs/sql-ref-syntax-qry.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,4 +49,5 @@ ability to generate logical and physical plan for a given query using
* [CASE Clause](sql-ref-syntax-qry-select-case.html)
* [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
* [LATERAL VIEW Clause](sql-ref-syntax-qry-select-lateral-view.html)
* [TRANSFORM Clause](sql-ref-syntax-qry-select-transform.html)
* [EXPLAIN Statement](sql-ref-syntax-qry-explain.html)
1 change: 1 addition & 0 deletions docs/sql-ref-syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ Spark SQL is Apache Spark's module for working with structured data. The SQL Syn
* [CASE Clause](sql-ref-syntax-qry-select-case.html)
* [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
* [LATERAL VIEW Clause](sql-ref-syntax-qry-select-lateral-view.html)
* [TRANSFORM Clause](sql-ref-syntax-qry-select-transform.html)
* [EXPLAIN](sql-ref-syntax-qry-explain.html)

### Auxiliary Statements
Expand Down