Skip to content

[SPARK-20151][SQL] Account for partition pruning in scan metadataTime metrics#17476

Closed
rxin wants to merge 1 commit intoapache:masterfrom
rxin:SPARK-20151
Closed

[SPARK-20151][SQL] Account for partition pruning in scan metadataTime metrics#17476
rxin wants to merge 1 commit intoapache:masterfrom
rxin:SPARK-20151

Conversation

@rxin
Copy link
Copy Markdown
Contributor

@rxin rxin commented Mar 30, 2017

What changes were proposed in this pull request?

After SPARK-20136, we report metadata timing metrics in scan operator. However, that timing metric doesn't include one of the most important part of metadata, which is partition pruning. This patch adds that time measurement to the scan metrics.

How was this patch tested?

N/A - I tried adding a test in SQLMetricsSuite but it was extremely convoluted to the point that I'm not sure if this is worth it.

@rxin
Copy link
Copy Markdown
Contributor Author

rxin commented Mar 30, 2017

@SparkQA
Copy link
Copy Markdown

SparkQA commented Mar 30, 2017

Test build #75379 has finished for PR 17476 at commit 8789cf0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Copy Markdown
Contributor

@adrian-ionescu adrian-ionescu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

fileStatusCache: FileStatusCache,
override val partitionSpec: PartitionSpec)
override val partitionSpec: PartitionSpec,
override val metadataOpsTimeNs: Option[Long])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add param doc, as it's not immediately obvious what a user is supposed to supply here.
I'd say something like "time it took to obtain the partitionSpec from the Hive metastore", but maybe that's too specific..

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually includes more than that. We do file listing as part of that ...

* file listing time in some implementations and physical execution calls it in this method
* to update the metrics.
*/
def metadataOpsTimeNs: Option[Long] = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's hard to define the semantic of this method for general FileIndex, as everytime we call listFiles, the value of this method should be updated.

how about we only put this method in PrunedInMemoryFileIndex?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that but there is no API level guarantee that we'd get PrunedInMemoryFileIndex after partition pruning. It is more just a current implementation detail. I'd rather have something that's more specified in the API.

@rxin
Copy link
Copy Markdown
Contributor Author

rxin commented Mar 31, 2017

Merging in master.

@asfgit asfgit closed this in a8a765b Mar 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants