-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-46380][SQL]Replace current time/date prior to evaluating inline table expressions. #44316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
Outdated
Show resolved
Hide resolved
beliefer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a discussion.
SELECT COUNT(DISTINCT ct) FROM VALUES
(CURRENT_TIMESTAMP()),
(CURRENT_TIMESTAMP()),
(CURRENT_TIMESTAMP()) as data(ct)
The three call of CURRENT_TIMESTAMP() should have the same value?
That's right. All the invocations of CURRENT_TIMESTAMP/CURRENT_DATE/NOW() should be replaced with single value which represents time of query arrival. We already do this for majority of scenarios (e.g. if time function is pretty much anywhere else in the query). The bug this PR tries to solve is that we don't do this for inline tables. |
|
I checked in Postgres. |
Yep, just to illustrate that replacement happens for inline tables: postgres=# SELECT * FROM (VALUES(EXTRACT(epoch FROM current_timestamp),EXTRACT(epoch FROM current_timestamp),EXTRACT(epoch FROM current_timestamp)));-------------------+-------------------+------------------- |
1) ResolveInlineTables that will check the shape and add all the needed casts. 2) EvalInlineTables that will call the actual evaluation of the rows into LocalRelation at the end of finish analysis phase.
1) ResolveInlineTables that will check the shape and add all the needed casts. 2) EvalInlineTables that will call the actual evaluation of the rows into LocalRelation at the end of finish analysis phase.
|
I agree with the semantic changes. General comment, though: or: |
Yeah, I maybe can use this PR to deal with rand() problem as well (i.e. allow non-deterministic expressions in VALUES), because it is rather similar to CURRENT_* issue. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala
Outdated
Show resolved
Hide resolved
| """SELECT COUNT(DISTINCT ct) FROM VALUES | ||
| | CURRENT_TIMESTAMP(), | ||
| | CURRENT_TIMESTAMP(), | ||
| | CURRENT_TIMESTAMP() as data(ct)""".stripMargin), Row(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems not needed as we have the same test in the golden file
|
|
||
|
|
||
| -- !query | ||
| select count(distinct ct) from values now(), now(), now() as data(ct) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a test for CURRENT_TIMESTAMP and remove https://github.com/apache/spark/pull/44316/files#r1430980065
| def earlyEvalPossible = | ||
| table.rows.flatten.forall(!_.containsPattern(CURRENT_LIKE)) | ||
| if (earlyEvalPossible) EvalInlineTables(table) else table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def earlyEvalPossible = | |
| table.rows.flatten.forall(!_.containsPattern(CURRENT_LIKE)) | |
| if (earlyEvalPossible) EvalInlineTables(table) else table | |
| val earlyEvalPossible = table.rows.flatten.forall(!_.containsPattern(CURRENT_LIKE)) | |
| if (earlyEvalPossible) EvalInlineTables(table) else table |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala
Outdated
Show resolved
Hide resolved
| */ | ||
| object EvalInlineTables extends Rule[LogicalPlan] with CastSupport { | ||
| override def apply(plan: LogicalPlan): LogicalPlan = plan.transformDownWithSubqueriesAndPruning( | ||
| AlwaysProcess.fn, ruleId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add pruning for ResolvedInlineTable
beliefer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
| select count(distinct ct) from values now(), now(), now() as data(ct); | ||
|
|
||
| -- current_timestamp() should be kept as tempResolved inline expression. | ||
| select count(distinct ct) from values current_timestamp(), current_timestamp() as data(ct); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add tests mixed current_timestamp and other deterministic function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's testing the correct value using count distinct.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala
Outdated
Show resolved
Hide resolved
…mizer/finishAnalysis.scala Co-authored-by: Jiaan Geng <[email protected]>
| InternalRow.fromSeq(row.zipWithIndex.map { case (e, ci) => | ||
| val targetType = fields(ci).dataType | ||
| try { | ||
| val castedRows: Seq[Seq[Expression]] = table.rows.map { row => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we only need the Seq[Expression] here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a table (rows X columns)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that. You means the X columns for each row is different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it now. Thank you!
| * @param output list of column attributes | ||
| * @param rows expressions for the data rows | ||
| */ | ||
| case class ResolvedInlineTable(rows: Seq[Seq[Expression]], output: Seq[Attribute]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we simplify rows: Seq[Seq[Expression]] to exprs: Seq[Expression]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbatomic After review this PR again. I'm sorry for the above comment.
|
The failed test is unrelated, thanks, merging to master/3.5! |
…ne table expressions With this PR proposal is to do inline table resolution in two phases: 1) If there are no expressions that depend on current context (e.g. expressions that depend on CURRENT_DATABASE, CURRENT_USER, CURRENT_TIME etc.) they will be evaluated as part of ResolveInlineTable rule. 2) Expressions that do depend on CURRENT_* evaluation will be kept as expressions and they evaluation will be delayed to post analysis phase. This PR aims to solve two problems with inline tables. Example1: ```sql SELECT COUNT(DISTINCT ct) FROM VALUES (CURRENT_TIMESTAMP()), (CURRENT_TIMESTAMP()), (CURRENT_TIMESTAMP()) as data(ct) ``` Prior to this change this example would return 3 (i.e. all CURRENT_TIMESTAMP expressions would return different value since they would be evaluated individually as part of inline table evaluation). After this change result is 1. Example 2: ```sql CREATE VIEW V as (SELECT * FROM VALUES(CURRENT_TIMESTAMP()) ``` In this example VIEW would be saved with literal evaluated during VIEW creation. After this change CURRENT_TIMESTAMP() will eval during VIEW execution. See section above. New test that validates this behaviour is introduced. No. Closes #44316 from dbatomic/inline_tables_curr_time_fix. Lead-authored-by: Aleksandar Tomic <[email protected]> Co-authored-by: Aleksandar Tomic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 5fe963f) Signed-off-by: Wenchen Fan <[email protected]>
…s and ResolveInlineTablesSuite ### What changes were proposed in this pull request? #44316 replace current time/date prior to evaluating inline table expressions. This PR propose to simplify the code for `ResolveInlineTables` and let `ResolveInlineTablesSuite` apply the rule `ResolveInlineTables`. ### Why are the changes needed? Simplify the code for `ResolveInlineTables` and `ResolveInlineTablesSuite`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Test cases updated. GA tests. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44447 from beliefer/SPARK-46380_followup. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
With this PR proposal is to do inline table resolution in two phases:
Why are the changes needed?
This PR aims to solve two problems with inline tables.
Example1:
Prior to this change this example would return 3 (i.e. all CURRENT_TIMESTAMP expressions would return different value since they would be evaluated individually as part of inline table evaluation). After this change result is 1.
Example 2:
In this example VIEW would be saved with literal evaluated during VIEW creation. After this change CURRENT_TIMESTAMP() will eval during VIEW execution.
Does this PR introduce any user-facing change?
See section above.
How was this patch tested?
New test that validates this behaviour is introduced.
Was this patch authored or co-authored using generative AI tooling?
No.