Skip to content

Conversation

@dbatomic
Copy link
Contributor

@dbatomic dbatomic commented Dec 12, 2023

What changes were proposed in this pull request?

With this PR proposal is to do inline table resolution in two phases:

  1. If there are no expressions that depend on current context (e.g. expressions that depend on CURRENT_DATABASE, CURRENT_USER, CURRENT_TIME etc.) they will be evaluated as part of ResolveInlineTable rule.
  2. Expressions that do depend on CURRENT_* evaluation will be kept as expressions and they evaluation will be delayed to post analysis phase.

Why are the changes needed?

This PR aims to solve two problems with inline tables.

Example1:

SELECT COUNT(DISTINCT ct) FROM VALUES
(CURRENT_TIMESTAMP()),
(CURRENT_TIMESTAMP()),
(CURRENT_TIMESTAMP()) as data(ct)

Prior to this change this example would return 3 (i.e. all CURRENT_TIMESTAMP expressions would return different value since they would be evaluated individually as part of inline table evaluation). After this change result is 1.

Example 2:

CREATE VIEW V as (SELECT * FROM VALUES(CURRENT_TIMESTAMP())

In this example VIEW would be saved with literal evaluated during VIEW creation. After this change CURRENT_TIMESTAMP() will eval during VIEW execution.

Does this PR introduce any user-facing change?

See section above.

How was this patch tested?

New test that validates this behaviour is introduced.

Was this patch authored or co-authored using generative AI tooling?

No.

Copy link
Contributor

@beliefer beliefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a discussion.

SELECT COUNT(DISTINCT ct) FROM VALUES
(CURRENT_TIMESTAMP()),
(CURRENT_TIMESTAMP()),
(CURRENT_TIMESTAMP()) as data(ct)

The three call of CURRENT_TIMESTAMP() should have the same value?

@dbatomic
Copy link
Contributor Author

I think we need a discussion.

SELECT COUNT(DISTINCT ct) FROM VALUES
(CURRENT_TIMESTAMP()),
(CURRENT_TIMESTAMP()),
(CURRENT_TIMESTAMP()) as data(ct)

The three call of CURRENT_TIMESTAMP() should have the same value?

That's right. All the invocations of CURRENT_TIMESTAMP/CURRENT_DATE/NOW() should be replaced with single value which represents time of query arrival. We already do this for majority of scenarios (e.g. if time function is pretty much anywhere else in the query). The bug this PR tries to solve is that we don't do this for inline tables.

@beliefer
Copy link
Contributor

I checked in Postgres.
select CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP;
The output:

current_timestamp            |current_timestamp            |current_timestamp            |
-----------------------------+-----------------------------+-----------------------------+
2023-12-14 16:27:19.663 +0800|2023-12-14 16:27:19.663 +0800|2023-12-14 16:27:19.663 +0800|

@dbatomic
Copy link
Contributor Author

dbatomic commented Dec 14, 2023

I checked in Postgres. select CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP; The output:

current_timestamp            |current_timestamp            |current_timestamp            |
-----------------------------+-----------------------------+-----------------------------+
2023-12-14 16:27:19.663 +0800|2023-12-14 16:27:19.663 +0800|2023-12-14 16:27:19.663 +0800|

Yep, just to illustrate that replacement happens for inline tables:

postgres=# SELECT * FROM (VALUES(EXTRACT(epoch FROM current_timestamp),EXTRACT(epoch FROM current_timestamp),EXTRACT(epoch FROM current_timestamp)));
  column1      |      column2      |      column3

-------------------+-------------------+-------------------
1702546463.398428 | 1702546463.398428 | 1702546463.398428

1) ResolveInlineTables that will check the shape and add all the needed casts.
2) EvalInlineTables that will call the actual evaluation of the rows into LocalRelation at the end of finish analysis phase.
1) ResolveInlineTables that will check the shape and add all the needed casts.
2) EvalInlineTables that will call the actual evaluation of the rows into LocalRelation at the end of finish analysis phase.
@srielau
Copy link
Contributor

srielau commented Dec 15, 2023

I agree with the semantic changes. General comment, though:
VALUES in its current implementation is much too restrictive.
Will this PR move us closer to allowing arbitrary expressions (including non determinism and correlation)?
E.g. the following cannot be done in VALUES today:
scala> spark.sql("VALUES (rand())").show()
org.apache.spark.sql.AnalysisException: [INVALID_INLINE_TABLE.CANNOT_EVALUATE_EXPRESSION_IN_INLINE_TABLE] Invalid inline table. Cannot evaluate the expression "rand()" in inline table definition. SQLSTATE: 42000; line 1 pos 8

or:
SELECT pk, c FROM T, LATERAL(VALUES(T.c1), (T.c2)) AS unpivot(c)

@dbatomic
Copy link
Contributor Author

I agree with the semantic changes. General comment, though: VALUES in its current implementation is much too restrictive. Will this PR move us closer to allowing arbitrary expressions (including non determinism and correlation)? E.g. the following cannot be done in VALUES today: scala> spark.sql("VALUES (rand())").show() org.apache.spark.sql.AnalysisException: [INVALID_INLINE_TABLE.CANNOT_EVALUATE_EXPRESSION_IN_INLINE_TABLE] Invalid inline table. Cannot evaluate the expression "rand()" in inline table definition. SQLSTATE: 42000; line 1 pos 8

or: SELECT pk, c FROM T, LATERAL(VALUES(T.c1), (T.c2)) AS unpivot(c)

Yeah, I maybe can use this PR to deal with rand() problem as well (i.e. allow non-deterministic expressions in VALUES), because it is rather similar to CURRENT_* issue.

"""SELECT COUNT(DISTINCT ct) FROM VALUES
| CURRENT_TIMESTAMP(),
| CURRENT_TIMESTAMP(),
| CURRENT_TIMESTAMP() as data(ct)""".stripMargin), Row(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not needed as we have the same test in the golden file



-- !query
select count(distinct ct) from values now(), now(), now() as data(ct)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 133 to 135
def earlyEvalPossible =
table.rows.flatten.forall(!_.containsPattern(CURRENT_LIKE))
if (earlyEvalPossible) EvalInlineTables(table) else table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def earlyEvalPossible =
table.rows.flatten.forall(!_.containsPattern(CURRENT_LIKE))
if (earlyEvalPossible) EvalInlineTables(table) else table
val earlyEvalPossible = table.rows.flatten.forall(!_.containsPattern(CURRENT_LIKE))
if (earlyEvalPossible) EvalInlineTables(table) else table

*/
object EvalInlineTables extends Rule[LogicalPlan] with CastSupport {
override def apply(plan: LogicalPlan): LogicalPlan = plan.transformDownWithSubqueriesAndPruning(
AlwaysProcess.fn, ruleId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add pruning for ResolvedInlineTable

Copy link
Contributor

@beliefer beliefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

select count(distinct ct) from values now(), now(), now() as data(ct);

-- current_timestamp() should be kept as tempResolved inline expression.
select count(distinct ct) from values current_timestamp(), current_timestamp() as data(ct);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add tests mixed current_timestamp and other deterministic function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's testing the correct value using count distinct.

InternalRow.fromSeq(row.zipWithIndex.map { case (e, ci) =>
val targetType = fields(ci).dataType
try {
val castedRows: Seq[Seq[Expression]] = table.rows.map { row =>
Copy link
Contributor

@beliefer beliefer Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we only need the Seq[Expression] here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a table (rows X columns)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that. You means the X columns for each row is different?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it now. Thank you!

* @param output list of column attributes
* @param rows expressions for the data rows
*/
case class ResolvedInlineTable(rows: Seq[Seq[Expression]], output: Seq[Attribute])
Copy link
Contributor

@beliefer beliefer Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we simplify rows: Seq[Seq[Expression]] to exprs: Seq[Expression]?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbatomic After review this PR again. I'm sorry for the above comment.

@cloud-fan
Copy link
Contributor

The failed test is unrelated, thanks, merging to master/3.5!

@cloud-fan cloud-fan closed this in 5fe963f Dec 21, 2023
cloud-fan pushed a commit that referenced this pull request Dec 21, 2023
…ne table expressions

With this PR proposal is to do inline table resolution in two phases:
1) If there are no expressions that depend on current context (e.g. expressions that depend on CURRENT_DATABASE, CURRENT_USER, CURRENT_TIME etc.) they will be evaluated as part of ResolveInlineTable rule.
2) Expressions that do depend on CURRENT_* evaluation will be kept as expressions and they evaluation will be delayed to post analysis phase.

This PR aims to solve two problems with inline tables.

Example1:
```sql
SELECT COUNT(DISTINCT ct) FROM VALUES
(CURRENT_TIMESTAMP()),
(CURRENT_TIMESTAMP()),
(CURRENT_TIMESTAMP()) as data(ct)
```
Prior to this change this example would return 3 (i.e. all CURRENT_TIMESTAMP expressions would return different value since they would be evaluated individually as part of inline table evaluation). After this change result is 1.

Example 2:
```sql
CREATE VIEW V as (SELECT * FROM VALUES(CURRENT_TIMESTAMP())
```
In this example VIEW would be saved with literal evaluated during VIEW creation. After this change CURRENT_TIMESTAMP() will eval during VIEW execution.

See section above.

New test that validates this behaviour is introduced.

No.

Closes #44316 from dbatomic/inline_tables_curr_time_fix.

Lead-authored-by: Aleksandar Tomic <[email protected]>
Co-authored-by: Aleksandar Tomic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 5fe963f)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Dec 22, 2023
…s and ResolveInlineTablesSuite

### What changes were proposed in this pull request?
#44316 replace current time/date prior to evaluating inline table expressions.
This PR propose to simplify the code for `ResolveInlineTables` and let `ResolveInlineTablesSuite` apply the rule `ResolveInlineTables`.

### Why are the changes needed?
Simplify the code for `ResolveInlineTables` and `ResolveInlineTablesSuite`.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Test cases updated.
GA tests.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #44447 from beliefer/SPARK-46380_followup.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants