apache · AngersZhuuuu · Jan 4, 2021 · Apr 1, 2021 · Apr 1, 2021 · Apr 1, 2021
diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml
@@ -188,6 +188,8 @@
                   url: sql-ref-syntax-qry-select-lateral-view.html
                 - text: PIVOT Clause
                   url: sql-ref-syntax-qry-select-pivot.html
+                - text: TRANSFORM Clause
+                  url: sql-ref-syntax-qry-select-transform.html
             - text: EXPLAIN
               url: sql-ref-syntax-qry-explain.html
         - text: Auxiliary Statements

diff --git a/docs/sql-ref-syntax-qry-select-transform.md b/docs/sql-ref-syntax-qry-select-transform.md
@@ -0,0 +1,297 @@
+---
+layout: global
+title: TRANSFORM
+displayTitle: TRANSFORM
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+### Description
+
+The `TRANSFORM` clause is used to specifies a hive-style transform (SELECT TRANSFORM/MAP/REDUCE)
+query specification to transform the input by forking and running the specified script. Users can
+plug in their own custom mappers and reducers in the data stream by using features natively supported
+in the Spark/Hive language. e.g. in order to run a custom mapper script - map_script - and a custom
+reducer script - reduce_script - the user can issue the following command which uses the TRANSFORM
+clause to embed the mapper and the reducer scripts.
+
+Currently, Spark's script transform support two mode:
+
+    1. Without Hive: It means we run Spark SQL without hive support, in this mode, we can use default format 
+       by treating data as STRING and use Spark's own SerDe.
+    2. WIth Hive: It means we run Spark SQL with Hive support, in this mode, when we use default format, 
+       it will be treated as Hive default fomat. And we can use Hive supported SerDe to process data.
+
+In both mode with default format, columns will be transformed to STRING and delimited by TAB before feeding
+to the user script, Similarly, all NULL values will be converted to the literal string \N in order to
+differentiate NULL values from empty strings. The standard output of the user script will be treated as
+TAB-separated STRING columns, any cell containing only \N will be re-interpreted as a NULL, and then the
+resulting STRING column will be cast to the data type specified in the table declaration in the usual way.
+User scripts can output debug information to standard error which will be shown on the task detail page on hadoop.
+These defaults can be overridden with `ROW FORMAT DELIMITED`.
+
+### Syntax
+
+```sql
+rowFormat
+    : ROW FORMAT SERDE serde_class [ WITH SERDEPROPERTIES serde_props ]
+    | ROW FORMAT DELIMITED
+       [ FIELDS TERMINATED BY fields_terminated_char [ ESCAPED BY escapedBy ] ]
+       [ COLLECTION ITEMS TERMINATED BY collectionItemsTerminatedBy ]
+       [ MAP KEYS TERMINATED BY keysTerminatedBy ]
+       [ LINES TERMINATED BY linesSeparatedBy ]
+       [ NULL DEFINED AS nullDefinedAs ]  
+
+inRowFormat=rowFormat
+outRowFormat=rowFormat
+namedExpressionSeq = named_expression [ , ... ]
+
+transformClause:
+  SELECT [ TRANSFORM ( namedExpressionSeq ) | MAP namedExpressionSeq | REDUCE namedExpressionSeq ]
+    [ inRowFormat ]
+    [ RECORDWRITER recordWriter_class ]
+    USING script
+    [ AS ( [ col_name [ col_type ]] [ , ... ] ) ]
+    [ outRowFormat ]
+    [ RECORDREADER recordReader_class ]
+  [ WHERE boolean_expression  ]
+  [ GROUP BY expression [ , ... ] ]
+  [ HAVING boolean_expression ]
+```
+
+### Parameters
+
+* **named_expression**
+
+    An expression with an assigned name. In general, it denotes a column expression.
+
+    **Syntax:** `expression [AS] [alias]`
+
+* **row_format**    
+
+    Use the `SERDE` clause to specify a custom SerDe for one table. Otherwise, use the `DELIMITED` clause to use the native SerDe and specify the delimiter, escape character, null character and so on.
+
+* **SERDE**
+
+    Specifies a custom SerDe for one table.
+
+* **serde_class**
+
+    Specifies a fully-qualified class name of a custom SerDe.
+
+* **SERDEPROPERTIES**
+
+    A list of key-value pairs that is used to tag the SerDe definition.
+
+* **DELIMITED**
+
+    The `DELIMITED` clause can be used to specify the native SerDe and state the delimiter, escape character, null character and so on.
+
+* **FIELDS TERMINATED BY**
+
+    Used to define a column separator.
+
+* **COLLECTION ITEMS TERMINATED BY**
+
+    Used to define a collection item separator.
+
+* **MAP KEYS TERMINATED BY**
+
+    Used to define a map key separator.
+
+* **LINES TERMINATED BY**
+
+    Used to define a row separator.
+
+* **NULL DEFINED AS**
+
+    Used to define the specific value for NULL.
+
+* **ESCAPED BY**
+
+    Used for escape mechanism.
+
+* **RECORDREADER**
+
+    Specifies a custom RecordReader for one table.
+
+* **RECORDWRITER**
+
+    Specifies a custom RecordWriter for one table.
+
+* **recordReader_class**
+
+    Specifies a fully-qualified class name of a custom RecordReader. 
+    Default value is `org.apache.hadoop.hive.ql.exec.TextRecordReader`
+
+* **recordWriter_class**
+
+    Specifies a fully-qualified class name of a custom RecordWriter. 
+    Default value is `org.apache.hadoop.hive.ql.exec.TextRecordWriter`.
+
+* **script**
+
+    Specify a command to process data.
+
+* **boolean_expression**
+
+    Specifies any expression that evaluates to a result type `boolean`. Two or
+    more expressions may be combined together using the logical
+    operators ( `AND`, `OR` ).
+
+* **expression**
+
+    Specifies combination of one or more values, operators and SQL functions that results in a value.
+
+### Without Hive support Mode
+
+Now Spark Script transform can run without `-Phive` or `SparkSession.builder.enableHiveSupport()`.
+In this case, now we only use script transform with `ROW FORMAT DELIMIT` and treat all value passed
+to script as string. 
+
+### With Hive Support Mode
+
+When build Spark with `-Phive` and start Spark SQL with `enableHiveSupport()`, we can use script 
+transform with Hive SerDe and both `ROW FORMAT DELIMIT`.
+
+### Schema-less Script Transforms
+
+If there don't have AS clause after USING my_script, Spark assumes that the output of the script contains 2 parts:
+
+   1. key: which is before the first tab, 
+   2. value: which is the rest after the first tab.
+
+Note that this is different from specifying AS key, value because in that case, value will only contain the portion
+between the first tab and the second tab if there are multiple tabs. 
+
+### Examples
+
+```sql
+CREATE TABLE person (zip_code INT, name STRING, age INT);
+INSERT INTO person VALUES
+    (94588, 'Zen Hui', 50),
+    (94588, 'Dan Li', 18),
+    (94588, 'Anil K', 27),
+    (94588, 'John V', NULL),
+    (94511, 'David K', 42),
+    (94511, 'Aryan B.', 18),
+    (94511, 'Lalit B.', NULL);
+
+-- With specified out put without data type
+SELECT TRANSFORM(zip_code, name, age)
+ USING 'cat' AS (a, b, c)
+FROM person
+WHERE zip_code > 94511;
++-------+---------+-----+
+|    a  |        b|    c|
++-------+---------+-----+
+|  94588|   Anil K|	  27|
+|  94588|   John V|	NULL|
+|  94588|  Zen Hui|	  50|
+|  94588|   Dan Li|   18|
++-------+---------+-----+
+
+-- With specified out put without data type
+SELECT TRANSFORM(zip_code, name, age)
+ USING 'cat' AS (a STRING, b STRING, c STRING)
+FROM person
+WHERE zip_code > 94511;
++-------+---------+-----+
+|    a  |        b|    c|
++-------+---------+-----+
+|  94588|   Anil K|	  27|
+|  94588|   John V|	NULL|
+|  94588|  Zen Hui|	  50|
+|  94588|   Dan Li|   18|
++-------+---------+-----+
+
+-- ROW FORMAT DELIMIT 
+SELECT TRANSFORM(name, age)
+  ROW FORMAT DELIMITED
+  FIELDS TERMINATED BY ','
+  LINES TERMINATED BY '\n'
+  NULL DEFINED AS 'NULL'
+  USING 'cat' AS (name_age string)
+  ROW FORMAT DELIMITED
+  FIELDS TERMINATED BY '@'
+  LINES TERMINATED BY '\n'
+  NULL DEFINED AS 'NULL'
+FROM person;
++---------------+
+|       name_age|
++---------------+
+|      Anil K,27|
+|    John V,null|
+|     ryan B.,18|
+|     David K,42|
+|     Zen Hui,50|
+|      Dan Li,18|
+|  Lalit B.,null|
++---------------+
+
+-- Hive Serde
+SELECT TRANSFORM(zip_code, name, age)
+  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
+  WITH SERDEPROPERTIES (
+    'field.delim' = '\t'
+  )
+  USING 'cat' AS (a STRING, b STRING, c STRING)
+  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
+  WITH SERDEPROPERTIES (
+    'field.delim' = '\t'
+  )
+FROM person
+WHERE zip_code > 94511;
++-------+---------+-----+
+|    a  |        b|    c|
++-------+---------+-----+
+|  94588|   Anil K|	  27|
+|  94588|   John V|	NULL|
+|  94588|  Zen Hui|	  50|
+|  94588|   Dan Li|   18|
++-------+---------+-----+
+
+-- Schema-less mode
+SELECT TRANSFORM(zip_code, name, age)
+  USING 'cat'
+FROM person
+WHERE zip_code > 94500;
++-------+-----------------+
+|    key|            value|
++-------+-----------------+
+|  94588|	  Anil K	27|
+|  94588|	  John V	\N|
+|  94511|	Aryan B.	18|
+|  94511|	 David K	42|
+|  94588|	 Zen Hui	50|
+|  94588|	  Dan Li	18|
+|  94511|	Lalit B.	\N|
++-------+-----------------+
+```
+
+### Related Statements
+
+* [SELECT Main](sql-ref-syntax-qry-select.html)
+* [WHERE Clause](sql-ref-syntax-qry-select-where.html)
+* [GROUP BY Clause](sql-ref-syntax-qry-select-groupby.html)
+* [HAVING Clause](sql-ref-syntax-qry-select-having.html)
+* [ORDER BY Clause](sql-ref-syntax-qry-select-orderby.html)
+* [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html)
+* [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html)
+* [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)
+* [CASE Clause](sql-ref-syntax-qry-select-case.html)
+* [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
+* [LATERAL VIEW Clause](sql-ref-syntax-qry-select-lateral-view.html)
diff --git a/docs/sql-ref-syntax-qry-select.md b/docs/sql-ref-syntax-qry-select.md
@@ -41,7 +41,7 @@ select_statement [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select_stat
 
 While `select_statement` is defined as
 ```sql
-SELECT [ hints , ... ] [ ALL | DISTINCT ] { [ named_expression | regex_column_names ] [ , ... ] }
+SELECT [ hints , ... ] [ ALL | DISTINCT ] { [[ named_expression | regex_column_names ] [ , ... ] | TRANSFORM Clause ] }
     FROM { from_item [ , ... ] }
     [ PIVOT clause ]
     [ LATERAL VIEW clause ] [ ... ] 
@@ -164,6 +164,11 @@ SELECT [ hints , ... ] [ ALL | DISTINCT ] { [ named_expression | regex_column_na
      )
      ```
 
+* **TRANSFORM**
+
+     Specifies a hive-style transform (SELECT TRANSFORM/MAP/REDUCE) query specification to transform
+     the input by forking and running the specified script.
+
 ### Related Statements
 
 * [WHERE Clause](sql-ref-syntax-qry-select-where.html)
@@ -187,3 +192,4 @@ SELECT [ hints , ... ] [ ALL | DISTINCT ] { [ named_expression | regex_column_na
 * [CASE Clause](sql-ref-syntax-qry-select-case.html)
 * [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
 * [LATERAL VIEW Clause](sql-ref-syntax-qry-select-lateral-view.html)
+* [TRANSFORM Clause](sql-ref-syntax-qry-select-transform.html)
diff --git a/docs/sql-ref-syntax-qry.md b/docs/sql-ref-syntax-qry.md
@@ -49,4 +49,5 @@ ability to generate logical and physical plan for a given query using
   * [CASE Clause](sql-ref-syntax-qry-select-case.html)
   * [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
   * [LATERAL VIEW Clause](sql-ref-syntax-qry-select-lateral-view.html)
+  * [TRANSFORM Clause](sql-ref-syntax-qry-select-transform.html)
 * [EXPLAIN Statement](sql-ref-syntax-qry-explain.html)
diff --git a/docs/sql-ref-syntax.md b/docs/sql-ref-syntax.md
@@ -70,6 +70,7 @@ Spark SQL is Apache Spark's module for working with structured data. The SQL Syn
    * [CASE Clause](sql-ref-syntax-qry-select-case.html)
    * [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
    * [LATERAL VIEW Clause](sql-ref-syntax-qry-select-lateral-view.html)
+   * [TRANSFORM Clause](sql-ref-syntax-qry-select-transform.html)
  * [EXPLAIN](sql-ref-syntax-qry-explain.html)
 
 ### Auxiliary Statements