Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

What changes were proposed in this pull request?

This PR fixes uncaching table by name without cascading.

Why are the changes needed?

These changes are needed to invalidate data cache correctly.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This PR comes with a test that previously failed.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Oct 23, 2025
@aokolnychyi
Copy link
Contributor Author

@aokolnychyi
Copy link
Contributor Author

This was discovered and discussed in another PR.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-54004] Fix uncaching table by name without cascading [SPARK-54004][SQL] Fix uncaching table by name without cascading Oct 23, 2025
@dongjoon-hyun
Copy link
Member

Ack. Thank you, @aokolnychyi .

}

plan match {
EliminateSubqueryAliases(plan) match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we keep SubqueryAlias when putting the logical plan into cache data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative solution could be to call EliminateSubqueryAliases BEFORE putting the plan into the cache. This, however, will remove ALL subquery aliases... I was not sure about consequences, but I would be open to consider this option if everyone thinks it is safe.

Thoughts, @viirya @dongjoon-hyun @szehon-ho @cloud-fan?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea we don't need SubqueryAlias in the cache keys, as during lookup, we call LogicalPlan#sameResult which strips the SubqueryAlias.

However, the cache key logical plans are exposed to custom normalization rules (See SparkSessionExtensions#injectPlanNormalizationRule), so seems safer to keep it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it then, it is a fragile part of code.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

Although it looks irrelevant, please re-trigger the failed PySpark CI, @aokolnychyi .

@aokolnychyi
Copy link
Contributor Author

Retrying PySpark CI...

@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 4.1.0-preview3.
Thank you, @aokolnychyi and all.

@aokolnychyi
Copy link
Contributor Author

huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
### What changes were proposed in this pull request?

This PR fixes uncaching table by name without cascading.

### Why are the changes needed?

These changes are needed to invalidate data cache correctly.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This PR comes with a test that previously failed.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#52712 from aokolnychyi/spark-54004.

Authored-by: Anton Okolnychyi <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants