Skip to content
Closed
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/_data/menu-sql.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,8 @@
url: sql-ref-syntax-qry-select-inline-table.html
- text: Common Table Expression
url: sql-ref-syntax-qry-select-cte.html
- text: Window Function
url: sql-ref-syntax-qry-window.html
- text: EXPLAIN
url: sql-ref-syntax-qry-explain.html
- text: Auxiliary Statements
Expand Down
194 changes: 191 additions & 3 deletions docs/sql-ref-syntax-qry-window.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: global
title: Windowing Analytic Functions
displayTitle: Windowing Analytic Functions
title: Window Functions
displayTitle: Window Functions
license: |
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
Expand All @@ -19,4 +19,192 @@ license: |
limitations under the License.
---

**This page is under construction**
### Description

Window functions operate on a group of rows, referred to as a window, and calculate a return value for each row based on the group of rows. Window functions are useful for processing tasks such as calculating a moving average, computing a cumulative statistic, or accessing the value of rows given the relative position of the current row.
Spark SQL supports three types of window functions:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this look better now? @maropu @viirya

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also cc @srowen
Please feel free to rephrase. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computing a cumulative -> computing a cumulative sum (or anything similar: average, statistic)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it looks better. How about putting the last statement in a new line?;

...the current row. 

Spark SQL supports three types of window functions:

  * Ranking Functions
  * Analytic Functions
  * Aggregate Functions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this list here? The Syntax section has the same list.

* Ranking Functions
* Analytic Functions
* Aggregate Functions

### Syntax

{% highlight sql %}
window_function OVER
( [ { PARTITION | DISTRIBUTE } BY partition_col_name = partition_col_val ( [ , ... ] ) ]
{ ORDER | SORT } BY expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ]
[ window_frame ] )
{% endhighlight %}

### Parameters

<dl>
<dt><code><em>window_function</em></code></dt>
<dd>
<ul>
<li>Ranking Functions</li>
<br>
<b>Syntax:</b>
<code>
RANK | DENSE_RANK | PERCENT_RANK | NTILE | ROW_NUMBER
</code>
</ul>
<ul>
<li>Analytic Functions</li>
<br>
<b>Syntax:</b>
<code>
CUME_DIST | LAG | LEAD
</code>
</ul>
<ul>
<li>Aggregate Functions</li>
<br>
<b>Syntax:</b>
<code>
MAX | MIN | COUNT | SUM | AVG | ...
</code>
<br>
Please refer to the <a href="api/sql/">Built-in Functions</a> document for a complete list of Spark aggregate functions.
</ul>
</dd>
</dl>
<dl>
<dt><code><em>window_frame</em></code></dt>
<dd>
Specifies which row to start the window on and where to end it.<br><br>
<b>Syntax:</b><br>
<code>
{ RANGE | ROWS } [ BETWEEN ]
UNBOUNDED { PRECEDING | FOLLOWING }
| CURRENT ROW
| boolean_expression { PRECEDING | FOLLOWING }
</code> <br><br>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to describe what these clauses (RANGE, ROWS, BETWEEN, ...) are.

<code>boolean_expression</code><br>
Specifies an expression with a return type of boolean.
</dd>
</dl>

### Examples

{% highlight sql %}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this blank.

CREATE TABLE employees (name STRING, dept STRING, salary INT, age INT);

INSERT INTO employees VALUES ("Lisa", "Sales", 10000, 35);
INSERT INTO employees VALUES ("Evan", "Sales", 32000, 38);
INSERT INTO employees VALUES ("Fred", "Engineering", 21000, 28);
INSERT INTO employees VALUES ("Alex", "Sales", 30000, 33);
INSERT INTO employees VALUES ("Tom", "Engineering", 23000, 33);
INSERT INTO employees VALUES ("Jane", "Marketing", 29000, 28);
INSERT INTO employees VALUES ("Jeff", "Marketing", 35000, 38);
INSERT INTO employees VALUES ("Paul", "Engineering", 29000, 23);
INSERT INTO employees VALUES ("Chloe", "Engineering", 23000, 25);

SELECT * FROM employees;
+-----+-----------+------+-----+
| name| dept|salary| age|
+-----+-----------+------+-----+
|Chloe|Engineering| 23000| 25|
| Fred|Engineering| 21000| 28|
| Paul|Engineering| 29000| 23|
|Helen| Marketing| 29000| 40|
| Tom|Engineering| 23000| 33|
| Jane| Marketing| 29000| 28|
| Jeff| Marketing| 35000| 38|
| Evan| Sales| 32000| 38|
| Lisa| Sales| 10000| 35|
| Alex| Sales| 30000| 33|
+-----+-----------+------+-----+

SELECT name, dept, RANK() OVER (PARTITION BY dept ORDER BY salary) AS rank FROM employees;
+-----+-----------+------+----+
| name| dept|salary|rank|
+-----+-----------+------+----+
| Lisa| Sales| 10000| 1|
| Alex| Sales| 30000| 2|
| Evan| Sales| 32000| 3|
| Fred|Engineering| 21000| 1|
| Tom|Engineering| 23000| 2|
|Chloe|Engineering| 23000| 2|
| Paul|Engineering| 29000| 4|
|Helen| Marketing| 29000| 1|
| Jane| Marketing| 29000| 1|
| Jeff| Marketing| 35000| 3|
+-----+-----------+------+----+

SELECT name, dept, DENSE_RANK() OVER (PARTITION BY dept ORDER BY salary ROWS BETWEEN
UNBOUNDED PRECEDING AND CURRENT ROW) AS dense_rank FROM employees;
+-----+-----------+------+----------+
| name| dept|salary|dense_rank|
+-----+-----------+------+----------+
| Lisa| Sales| 10000| 1|
| Alex| Sales| 30000| 2|
| Evan| Sales| 32000| 3|
| Fred|Engineering| 21000| 1|
| Tom|Engineering| 23000| 2|
|Chloe|Engineering| 23000| 2|
| Paul|Engineering| 29000| 3|
|Helen| Marketing| 29000| 1|
| Jane| Marketing| 29000| 1|
| Jeff| Marketing| 35000| 2|
+-----+-----------+------+----------+

SELECT name, dept, age, CUME_DIST() OVER (PARTITION BY dept ORDER BY age
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cume_dist FROM employees;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: as -> AS here, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix this

+-----+-----------+------+------------------+
| name| dept|age | cume_dist|
+-----+-----------+------+------------------+
| Alex| Sales| 33|0.3333333333333333|
| Lisa| Sales| 35|0.6666666666666666|
| Evan| Sales| 38| 1.0|
| Paul|Engineering| 23| 0.25|
|Chloe|Engineering| 25| 0.75|
| Fred|Engineering| 28| 0.25|
| Tom|Engineering| 33| 1.0|
| Jane| Marketing| 28|0.3333333333333333|
| Jeff| Marketing| 38|0.6666666666666666|
|Helen| Marketing| 40| 1.0|
+-----+-----------+------+------------------+

SELECT name, dept, salary, MIN(salary) OVER (PARTITION BY dept ORDER BY salary) AS min
FROM employees;
+-----+-----------+------+-----+
| name| dept|salary| min|
+-----+-----------+------+-----+
| Lisa| Sales| 10000|10000|
| Alex| Sales| 30000|10000|
| Evan| Sales| 32000|10000|
|Helen| Marketing| 29000|29000|
| Jane| Marketing| 29000|29000|
| Jeff| Marketing| 35000|29000|
| Fred|Engineering| 21000|21000|
| Tom|Engineering| 23000|21000|
|Chloe|Engineering| 23000|21000|
| Paul|Engineering| 29000|21000|
+-----+-----------+------+-----+

SELECT name, salary,
LAG(salary) OVER (PARTITION BY dept ORDER BY salary) as lag,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as -> AS
but definitely don't change it just for that. Looks fine. I'll merge shortly

LEAD(salary, 1, 0) OVER (PARTITION BY dept ORDER BY salary) as lead
FROM employees;
+-----+-----------+------+-----+-----+
| name| dept|salary| lag| lead|
+-----+-----------+------+-----+-----+
| Lisa| Sales| 10000|NULL |30000|
| Alex| Sales| 30000|10000|32000|
| Evan| Sales| 32000|30000| 0|
| Fred|Engineering| 21000| NULL|23000|
|Chloe|Engineering| 23000|21000|23000|
| Tom|Engineering| 23000|23000|29000|
| Paul|Engineering| 29000|23000| 0|
|Helen| Marketing| 29000| NULL|29000|
| Jane| Marketing| 29000|29000|35000|
| Jeff| Marketing| 35000|29000| 0|
+-----+-----------+------+-----+-----+
{% endhighlight %}

### Related Statements

* [SELECT](sql-ref-syntax-qry-select.html)