-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22673][SQL] InMemoryRelation should utilize existing stats whenever possible #19864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
b2fb1d2
0971900
32f7c74
2082c0e
bc70817
4650307
b6a36be
898f1b5
4c34701
8f912a3
064f6d1
385edc0
fbb3729
0170070
6299f3d
eb2dc0a
0fde46e
c2a92d4
4b2fcb6
b2d829b
de2905c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -94,14 +94,16 @@ class CacheManager extends Logging { | |
| logWarning("Asked to cache already cached data.") | ||
| } else { | ||
| val sparkSession = query.sparkSession | ||
| cachedData.add(CachedData( | ||
| planToCache, | ||
| InMemoryRelation( | ||
| sparkSession.sessionState.conf.useCompression, | ||
| sparkSession.sessionState.conf.columnBatchSize, | ||
| storageLevel, | ||
| sparkSession.sessionState.executePlan(planToCache).executedPlan, | ||
| tableName))) | ||
| val inMemoryRelation = InMemoryRelation( | ||
| sparkSession.sessionState.conf.useCompression, | ||
| sparkSession.sessionState.conf.columnBatchSize, | ||
| storageLevel, | ||
| sparkSession.sessionState.executePlan(planToCache).executedPlan, | ||
| tableName) | ||
| if (planToCache.conf.cboEnabled && planToCache.stats.rowCount.isDefined) { | ||
| inMemoryRelation.setStatsFromCachedPlan(planToCache) | ||
| } | ||
|
||
| cachedData.add(CachedData(planToCache, inMemoryRelation)) | ||
| } | ||
| } | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -25,13 +25,15 @@ import org.apache.spark.sql.catalyst.InternalRow | |
| import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation | ||
| import org.apache.spark.sql.catalyst.expressions._ | ||
| import org.apache.spark.sql.catalyst.plans.logical | ||
| import org.apache.spark.sql.catalyst.plans.logical.Statistics | ||
| import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Statistics} | ||
| import org.apache.spark.sql.execution.SparkPlan | ||
| import org.apache.spark.sql.execution.datasources.LogicalRelation | ||
|
||
| import org.apache.spark.storage.StorageLevel | ||
| import org.apache.spark.util.LongAccumulator | ||
|
|
||
|
|
||
| object InMemoryRelation { | ||
|
|
||
|
||
| def apply( | ||
| useCompression: Boolean, | ||
| batchSize: Int, | ||
|
|
@@ -71,14 +73,20 @@ case class InMemoryRelation( | |
|
|
||
| override def computeStats(): Statistics = { | ||
| if (batchStats.value == 0L) { | ||
| // Underlying columnar RDD hasn't been materialized, no useful statistics information | ||
| // available, return the default statistics. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| Statistics(sizeInBytes = child.sqlContext.conf.defaultSizeInBytes) | ||
| inheritedStats.getOrElse(Statistics(sizeInBytes = child.sqlContext.conf.defaultSizeInBytes)) | ||
| } else { | ||
| Statistics(sizeInBytes = batchStats.value.longValue) | ||
| } | ||
| } | ||
|
|
||
| private var inheritedStats: Option[Statistics] = _ | ||
|
|
||
| private[execution] def setStatsFromCachedPlan(planToCache: LogicalPlan): Unit = { | ||
| require(planToCache.conf.cboEnabled, "you cannot use the stats of cached plan in" + | ||
| " InMemoryRelation without cbo enabled") | ||
| inheritedStats = Some(planToCache.stats) | ||
| } | ||
|
|
||
| // If the cached column buffers were not passed in, we calculate them in the constructor. | ||
| // As in Spark, the actual work of caching is lazy. | ||
| if (_cachedColumnBuffers == null) { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to limit to those conditions? I think we can pass the stats into the created
InMemoryRelationeven the two conditions don't match.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason I put it here is that when we did not enable CBO, the stats in the underlying plan might be much smaller than the actual size in memory leading to the potential risk of OOM error.
The underlying cause is that without CBO enabled, the size of the plan is calculated with BaseRelation's sizeInBytes, but with CBO, we can have a more accurate estimation,
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala
Lines 42 to 46 in 03fdc92
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
Lines 370 to 381 in 03fdc92
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When CBO is disabled, don't we just set the
sizeInBytestodefaultSizeInBytes? Is it different than current statistics of first run?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, if CBO is disabled, the relation's sizeInBytes is the file size
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala
Line 85 in 5c3a1f3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LogicalRelationuses the statistics from therelationonly when there is no givencatalogTable. In this case, it doesn't consider if CBO is enabled or not.Only
catalogTableconsiders CBO when computing its statistics intoPlanStats. It doesn't refer to relation's statistics, IIUC.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a catalog table doesn't have statistics in its metadata, we will fill it with
defaultSizeInBytes.spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala
Lines 121 to 134 in 326f1d6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya you're right! Thanks for clearing the confusion
however, to prevent using relation's stats which can be much smaller than the in-memory size and lead to a potential OOM error, we should still have this condition here (we can remove cboEnabled though), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The statistics from relation is based on files size, will it easily cause OOM issue? I think in the cases other than cached query, we still use this relation's statistics. If this is an issue, doesn't it also affect the other cases?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's true, it affects I believe....there is a similar discussion in #19743