Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Today, Spark is very conservative and uses the analyzed plan instead of the optimized plan as the cache key. Many cache opportunities are missed.

This PR updates SparkSessionExtensions to allow people to inject custom plan normalization rules. Users can pick some safe optimizer rules, or implement new rules based on their business needs, to do plan normalization and increase the cache hit rate.

Why are the changes needed?

allow advanced users to do caching better.

Does this PR introduce any user-facing change?

no

How was this patch tested?

new test

@github-actions github-actions bot added the SQL label Nov 17, 2022
@cloud-fan
Copy link
Contributor Author

* <ul>
* <li>Analyzer Rules.</li>
* <li>Check Analysis Rules.</li>
* <li>Plan Normalization Rules.</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache Plan Normalization Rules? Since it is used for cache only.

case other => other
}

lazy val normalized: LogicalPlan = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps adding some comments for this.

Comment on lines +204 to +207
df.select("i").filter($"i" > 1).cache()
assert(df.filter($"i" > 1).select("i").queryExecution.executedPlan.find {
case _: org.apache.spark.sql.execution.columnar.InMemoryTableScanExec => true
case _ => false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So without the added rule, caching is unable to apply here, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a negative test that verifies this? Might be overkill...

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea to me.

planChangeLogger.logRule(rule.ruleName, p, result)
result
})
if (normalizationRules.nonEmpty) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think entering this else block means normalizationRules is non empty, no?

private[this] val planNormalizationRules = mutable.Buffer.empty[RuleBuilder]

def buildPlanNormalizationRules(session: SparkSession): Seq[Rule[LogicalPlan]] = {
planNormalizationRules.map(_.apply(session)).toSeq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isn't .apply(...) redundant with just (...) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm following the existing code style in this file. I assume the reason is people who are not familiar with Scala may be confused when reading the code .map(_(session))

commandExecuted
} else {
val planChangeLogger = new PlanChangeLogger[LogicalPlan]()
val normalized = normalizationRules.foldLeft(commandExecuted)((p, rule) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: (... => { ... }) is equivalent to just { ... => ... } in this context

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@cloud-fan cloud-fan closed this in 31d90d0 Nov 21, 2022
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, late LGTM. Thank you, @cloud-fan and all.

HyukjinKwon pushed a commit that referenced this pull request Nov 24, 2022
…tionRules to injectPlanNormalizationRule

### What changes were proposed in this pull request?

Followup of #38692. To follow other APIs in `SparkSessionExtensions`, the name should be `inject...Rule` and `build...Rules`.

### Why are the changes needed?

typo fix

### Does this PR introduce _any_ user-facing change?

not a released API

### How was this patch tested?

n/a

Closes #38767 from cloud-fan/small.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…caching

### What changes were proposed in this pull request?

Today, Spark is very conservative and uses the analyzed plan instead of the optimized plan as the cache key. Many cache opportunities are missed.

This PR updates `SparkSessionExtensions` to allow people to inject custom plan normalization rules. Users can pick some safe optimizer rules, or implement new rules based on their business needs, to do plan normalization and increase the cache hit rate.

### Why are the changes needed?

allow advanced users to do caching better.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test

Closes apache#38692 from cloud-fan/cache.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 15, 2022
…caching

### What changes were proposed in this pull request?

Today, Spark is very conservative and uses the analyzed plan instead of the optimized plan as the cache key. Many cache opportunities are missed.

This PR updates `SparkSessionExtensions` to allow people to inject custom plan normalization rules. Users can pick some safe optimizer rules, or implement new rules based on their business needs, to do plan normalization and increase the cache hit rate.

### Why are the changes needed?

allow advanced users to do caching better.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test

Closes apache#38692 from cloud-fan/cache.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 15, 2022
…tionRules to injectPlanNormalizationRule

### What changes were proposed in this pull request?

Followup of apache#38692. To follow other APIs in `SparkSessionExtensions`, the name should be `inject...Rule` and `build...Rules`.

### Why are the changes needed?

typo fix

### Does this PR introduce _any_ user-facing change?

not a released API

### How was this patch tested?

n/a

Closes apache#38767 from cloud-fan/small.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
…caching

### What changes were proposed in this pull request?

Today, Spark is very conservative and uses the analyzed plan instead of the optimized plan as the cache key. Many cache opportunities are missed.

This PR updates `SparkSessionExtensions` to allow people to inject custom plan normalization rules. Users can pick some safe optimizer rules, or implement new rules based on their business needs, to do plan normalization and increase the cache hit rate.

### Why are the changes needed?

allow advanced users to do caching better.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test

Closes apache#38692 from cloud-fan/cache.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
…tionRules to injectPlanNormalizationRule

### What changes were proposed in this pull request?

Followup of apache#38692. To follow other APIs in `SparkSessionExtensions`, the name should be `inject...Rule` and `build...Rules`.

### Why are the changes needed?

typo fix

### Does this PR introduce _any_ user-facing change?

not a released API

### How was this patch tested?

n/a

Closes apache#38767 from cloud-fan/small.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants