Skip to content

Conversation

@DylanGuedes
Copy link
Contributor

@DylanGuedes DylanGuedes commented Jun 16, 2019

What changes were proposed in this pull request?

This PR ports window.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/window.sql

The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/window.out

How was this patch tested?

Pass the Jenkins.

@DylanGuedes DylanGuedes changed the title [SPARK-23160][SQL][WIP] Port window.sql [SPARK-23160][SQL] Port window.sql Jun 16, 2019
@DylanGuedes
Copy link
Contributor Author

Question: should I check the window.sql src/test/resources/sql-tests/inputs to remove overlapping tests here or it is ok to have duplicates?

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-23160][SQL] Port window.sql [SPARK-23160][SQL][TEST] Port window.sql Jun 16, 2019
@dongjoon-hyun
Copy link
Member

ok to test

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR, @DylanGuedes .

  • You don't need to avoid the test coverage duplication . The purpose of this porting is to ensure Apache Spark's capability.
  • For the error and different results, please file Apache Spark JIRA issues after checking the duplications.

cc @gatorsmile


SELECT depname, empno, salary, sum(salary) OVER w FROM empsalary WINDOW w AS (PARTITION BY depname);

-- I get an error when trying `order by rank() over w`, however it works for `order by r' if column rank is renamed to r
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the following original query as a comment. And file a JIRA.

SELECT depname, empno, salary, rank() OVER w FROM empsalary WINDOW w AS (PARTITION BY depname ORDER BY salary) ORDER BY rank() OVER w;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


SELECT ntile(3) OVER (ORDER BY ten, four), ten, four FROM tenk1 WHERE unique2 < 10;

-- Spark does not accept null as input for `ntile`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file an Apache Spark JIRA issue if it doesn't exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@SparkQA
Copy link

SparkQA commented Jun 16, 2019

Test build #106545 has finished for PR 24881 at commit 7ec6b76.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SELECT i,SUM(v) OVER (ORDER BY i ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
FROM (VALUES(1,1),(2,2),(3,3),(4,4)) t(i,v);

-- bool_and?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add SPARK-27880 here.

@DylanGuedes
Copy link
Contributor Author

Hey guys, I added new references to JIRAs. I removed the WIP tag so that the CI could run because I wanted to be sure that I wasn't breaking other things - and I was lmao. I forgot to drop the tables at the end of the file. However, I still need to pay attention to a few things to finish this. For instance, postgres have tons of tests with UDF+Window, but I don't know if we should also because I don't think that we can create UDFs in SQL.

Btw, a question:
I checked and the results order are different from postgresql. I.e: if I don't explicitly order by some column, the results between Postgresql and Spark are different. For me, this makes a lot of sense: Spark is a distributed processing engine, so the order may differ unless we explicitly say to not. Should this be documented or it is too obvious?

@SparkQA
Copy link

SparkQA commented Jun 16, 2019

Test build #106556 has finished for PR 24881 at commit 4b87073.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 16, 2019

Test build #106561 has finished for PR 24881 at commit fe280f3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@DylanGuedes
Copy link
Contributor Author

DylanGuedes commented Jun 16, 2019

Btw, another question: the CI isn't passing and I can't identify which query causes it: https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/106561/testReport/junit/org.apache.spark.sql/SQLQueryTestSuite/sql/
I think that it has some relation with the new tempview that I created (tenk2) but I'm not sure. Suggestions?

@SparkQA
Copy link

SparkQA commented Jun 17, 2019

Test build #106563 has finished for PR 24881 at commit b679e58.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member

wangyum commented Jun 17, 2019

@DylanGuedes Please re-generate golden file by:

SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z window.sql"

You can verify it by:

build/sbt "sql/test-only *SQLQueryTestSuite -- -z window.sql"

@DylanGuedes
Copy link
Contributor Author

Hmm I was always generating them, I think that at the end I made a minor change and forgot to rerun then because I just changed a comment. Whatever, thank you!

@DylanGuedes
Copy link
Contributor Author

I updated with a few changes. I just got noticed (after @wangyum help) that I was having problems in udf-inner-join.sql, but I didn't figured out why since I don't built any changes to it. Do you guys have any idea of the reason?

@SparkQA
Copy link

SparkQA commented Jun 17, 2019

Test build #106596 has finished for PR 24881 at commit 8273331.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, please regenerate this test file.
This output is wrong.

@dongjoon-hyun
Copy link
Member

The failure is due to the wrong generated output. Please configure your Python environment and regenerate this.

@SparkQA
Copy link

SparkQA commented Jun 19, 2019

Test build #106687 has finished for PR 24881 at commit 8273331.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@DylanGuedes
Copy link
Contributor Author

You are right, I had an environment variable that was setting ipython instead. Whatever, I regenerated the golden files, looks fine now.

@SparkQA
Copy link

SparkQA commented Jun 20, 2019

Test build #106722 has finished for PR 24881 at commit dc8915c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@DylanGuedes
Copy link
Contributor Author

By the way, the CI looks fine now.

@dongjoon-hyun
Copy link
Member

Hi, @DylanGuedes .
The output file (21,485 line) looks ridiculously bigger than the original one (3823 line).

Could you compare the result with PostgreSQL. Is the output correct?

@DylanGuedes
Copy link
Contributor Author

DylanGuedes commented Jul 5, 2019

@dongjoon-hyun You are absolutely right - for some reason, results from two queries were not being truncated. I commented them out and will probably create a JIRA for them, but for now I just commented they out.

@SparkQA
Copy link

SparkQA commented Jul 5, 2019

Test build #107284 has finished for PR 24881 at commit 48bd010.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Signed-off-by: DylanGuedes <[email protected]>
Signed-off-by: DylanGuedes <[email protected]>
Signed-off-by: DylanGuedes <[email protected]>
Signed-off-by: DylanGuedes <[email protected]>
Signed-off-by: DylanGuedes <[email protected]>
Signed-off-by: DylanGuedes <[email protected]>
Signed-off-by: DylanGuedes <[email protected]>
Signed-off-by: DylanGuedes <[email protected]>
Signed-off-by: DylanGuedes <[email protected]>
@SparkQA
Copy link

SparkQA commented Aug 14, 2019

Test build #109122 has finished for PR 24881 at commit 826708d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

cc @maropu @dongjoon-hyun @wangyum

@maropu
Copy link
Member

maropu commented Aug 26, 2019

Yea, I'll check now.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think that this pr is too big in a sing pr. So, how about splitting into smaller prs along with the aggregate tests? WDYT? @dongjoon-hyun @wangyum

-- !query 145 schema
struct<>
-- !query 145 output
org.apache.spark.sql.AnalysisException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Spark does not handle 'NaN' for inline tables.

SUM(b) OVER(ORDER BY A ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
FROM (VALUES(1,1),(2,2),(3,'NaN'),(4,3),(5,4)) t(a,b);

select f_float4, sum(f_float4) over (order by f_float8 rows between 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, I don't know what happened there. Whatever, it is fixed now.

FROM (VALUES(1,1.5),(2,2.5),(3,NULL),(4,NULL)) t(i,v);

-- [SPARK-28602] Spark does not recognize 'interval' type as 'numeric'
-- SELECT i,AVG(cast(v as interval)) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- WINDOW wnd AS (ORDER BY i ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
-- ORDER BY i;

-- SELECT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe a comment-out reason for each query where possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

struct<>
-- !query 104 output
org.apache.spark.sql.AnalysisException
Undefined function: 'range'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the message is misleading, I think that they represent the same: there's no existent range aggregated function. I should JIRA that?

@gatorsmile
Copy link
Member

@DylanGuedes any update?

@DylanGuedes
Copy link
Contributor Author

@gatorsmile @maropu Sorry guys, altough I`ve got noticed about some of your comments, I didn't get any notification about the new suggestions to the merge, so I thought that your guys were discussing about splitting the merge or not. I'll try to finish all the suggestions til next week. Also, if you guys preffer a splitted merge, I can work on it, for sure.

@maropu
Copy link
Member

maropu commented Sep 4, 2019

ping @dongjoon-hyun @wangyum

-- Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
--
-- Window Functions Testing
-- https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/window.sql
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REL_12_BETA2 -> REL_12_BETA3. Let's use REL_12_BETA3.

@SparkQA
Copy link

SparkQA commented Sep 7, 2019

Test build #110280 has finished for PR 24881 at commit 80f2915.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

Looks we're blocked. If the current status doesn't looks enough to merge, shall we split? @wangyum, @maropu WDYT? I am willing to help as well.

@HyukjinKwon
Copy link
Member

This seems the last ticket in PostgreSQL umbrella. Let's get this done.

@maropu
Copy link
Member

maropu commented Sep 17, 2019

+1 for the split....

@maropu
Copy link
Member

maropu commented Sep 17, 2019

@DylanGuedes Are you still here? Can you split this pr (~1400 lines) into 4 parts (each file has 300-400 lines) by referring pgSQL/aggregates_partX.sql?
https://github.com/apache/spark/tree/master/sql/core/src/test/resources/sql-tests/inputs/pgSQL

@DylanGuedes
Copy link
Contributor Author

@maropu Yes, I'm following everything. Yes, I can split th PR.

@maropu
Copy link
Member

maropu commented Sep 17, 2019

Thanks!

@SparkQA
Copy link

SparkQA commented Sep 17, 2019

Test build #110683 has finished for PR 24881 at commit 80f2915.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Sep 19, 2019

Close this pr cuz I saw your split prs.

@maropu maropu closed this Sep 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants