-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark: Better SparkBatchScan statistics estimation #3038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When estimating statistics, we should take the read schema into account.
|
Here is a part of the optimized logical plan for a TPC-DS query (q18) on V1 tables: The customer table is 25.5 MiB but the Project plan (4 columns of the 18 columns of the customer table) is small enough to be broadcast. A read schema of 5 columns has already been pushed down to the Iceberg customer table, but the statistics for it is the full 25.5 MiB for the table. Consequently, the Project plan, which is estimated basically using the ratio of the size of the 4 columns to the size of the 5 columns times the 25.5 MiB, is too big to be broadcast. Here the statistics for the Iceberg customer table is adjusted according to read schema and is now 2.4 MiB, and the estimate for the Project adjusts it by the ratio of the size of the 4 columns to the size of the 5 columns, making it 2.0 MiB, and it can be broadcast. |
|
@rdblue, @aokolnychyi, can you please review this? Can you also please enlighten me why we use record count * row size as the estimate for partitioned tables without filter expressions, but file sizes for unpartitioned tables or partitioned tables with filter expressions? Should we use the record count * row size everywhere? |
|
I'm surprised to see that we report the total data file size back to Spark. At Netflix, we switched over to reporting row size estimate * number of rows a long time ago to fix broadcast join memory problems. Spark should ignore the size on disk and use the row size * num rows, but if Spark doesn't do that then I think Iceberg should report that as the size instead. |
|
Thanks @wypoon! I'm okay with this, but I'd prefer to return a better estimate based on number of rows and not compressed size at all. Interested to hear what @aokolnychyi thinks. |
| long approximateSize = 0; | ||
| for (StructField sparkField : tableSchema.fields()) { | ||
| approximateSize += sparkField.dataType().defaultSize(); | ||
| long result; | ||
| try { | ||
| result = LongMath.checkedMultiply(tableSchema.defaultSize(), totalRecords); | ||
| } catch (ArithmeticException e) { | ||
| result = Long.MAX_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StructType has a defaultSize method; there is no need to re-implement it. The main utility of the estimateSize static method is checking for overflow; we should just use Guava's LongMath.checkedMultiply.
| LOG.debug("using table metadata to estimate table statistics"); | ||
| long totalRecords = PropertyUtil.propertyAsLong(table.currentSnapshot().summary(), | ||
| SnapshotSummary.TOTAL_RECORDS_PROP, Long.MAX_VALUE); | ||
| Schema projectedSchema = expectedSchema != null ? expectedSchema : table.schema(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think expectedSchema is ever null. Thus, I think we can just call readSchema() to get the StructType.
|
@rdblue thanks for reviewing this. I changed the statistics estimation to use (row size * number of rows) for all cases. Of course, the row size estimation can be very far from accurate. Is this version preferable? @aokolnychyi or @RussellSpitzer do you prefer this change? |
Fix SparkSchemaUtil.estimateSize to use LongMath.checkedMultiply.
682b648 to
5deeb56
Compare
|
Thanks, @wypoon! Looks great. |
|
Sorry for the delay. I think this change makes sense too. Thanks, @wypoon! Shall we update Spark 2 as well? |
|
I created #3108 so that we don't forget. |
|
@aokolnychyi thanks for seconding the change. |
|
Much appreciated, @wypoon! |
Follow-up to apache#3038. Use (estimated) row size * number of rows to estimate the size instead of adding up file sizes. The row size is estimated from the pruned schema if we prune columns.
Follow-up to #3038. Use (estimated) row size * number of rows to estimate the size instead of adding up file sizes. The row size is estimated from the pruned schema if we prune columns.
well, rows * schema size over estimates table sizes under some circumstances, for example TPCDS-sf1000 Q7 demographies dim table, causing broadcast join degrade to sort merge join. I guess it still works in 3.2 because of AQE. totalSize * readcols size / total cols size is what hive adopted. but certainly this is underestimating in some circumstances. cc @rdblue |
When estimating statistics, we should take the read schema into account.
As explained in the PR for SPARK-36568:
V2ScanRelationPushDowncan column pruneDataSourceV2ScanRelations and change read schema ofScanoperations.In
SparkBatchScan.estimateStatistics(), before this change, for unpartitioned tables or partitioned tables with no filter expressions, we sum the file sizes of the files to be scanned and use that as the estimate for the size in bytes. With this change, we adjust that by the ratio of the size of the columns to be read to the size of all the columns; we also adjust that by the compression factor in case it is set (the default is 1.0). With this change, some joins that are broadcast joins on V1 tables but are SortMergeJoins on Iceberg tables can be broadcast joins as well.Basically this change does for Iceberg what the above-mentioned Spark PR does for the built-in V2
FileScans.