[SPARK-20151][SQL] Account for partition pruning in scan metadataTime metrics#17476
[SPARK-20151][SQL] Account for partition pruning in scan metadataTime metrics#17476rxin wants to merge 1 commit intoapache:masterfrom
Conversation
|
Test build #75379 has finished for PR 17476 at commit
|
| fileStatusCache: FileStatusCache, | ||
| override val partitionSpec: PartitionSpec) | ||
| override val partitionSpec: PartitionSpec, | ||
| override val metadataOpsTimeNs: Option[Long]) |
There was a problem hiding this comment.
Add param doc, as it's not immediately obvious what a user is supposed to supply here.
I'd say something like "time it took to obtain the partitionSpec from the Hive metastore", but maybe that's too specific..
There was a problem hiding this comment.
It actually includes more than that. We do file listing as part of that ...
| * file listing time in some implementations and physical execution calls it in this method | ||
| * to update the metrics. | ||
| */ | ||
| def metadataOpsTimeNs: Option[Long] = None |
There was a problem hiding this comment.
I think it's hard to define the semantic of this method for general FileIndex, as everytime we call listFiles, the value of this method should be updated.
how about we only put this method in PrunedInMemoryFileIndex?
There was a problem hiding this comment.
I thought about that but there is no API level guarantee that we'd get PrunedInMemoryFileIndex after partition pruning. It is more just a current implementation detail. I'd rather have something that's more specified in the API.
|
Merging in master. |
What changes were proposed in this pull request?
After SPARK-20136, we report metadata timing metrics in scan operator. However, that timing metric doesn't include one of the most important part of metadata, which is partition pruning. This patch adds that time measurement to the scan metrics.
How was this patch tested?
N/A - I tried adding a test in SQLMetricsSuite but it was extremely convoluted to the point that I'm not sure if this is worth it.