[SPARK-21669] Internal API for collecting metrics/stats during FileFormatWriter jobs by adrian-ionescu · Pull Request #18884 · apache/spark

adrian-ionescu · 2017-08-08T14:56:09Z

What changes were proposed in this pull request?

This patch introduces an internal interface for tracking metrics and/or statistics on data on the fly, as it is being written to disk during a FileFormatWriter job and partially reimplements SPARK-20703 in terms of it.

The interface basically consists of 3 traits:

WriteTaskStats: just a tag for classes that represent statistics collected during a WriteTask
The only constraint it adds is that the class should be Serializable, as instances of it will be collected on the driver from all executors at the end of the WriteJob.
WriteTaskStatsTracker: a trait for classes that can actually compute statistics based on tuples that are processed by a given WriteTask and eventually produce a WriteTaskStats instance.
WriteJobStatsTracker: a trait for classes that act as containers of Serializable state that's necessary for instantiating WriteTaskStatsTracker on executors and finally process the resulting collection of WriteTaskStats, once they're gathered back on the driver.

Potential future use of this interface is e.g. CBO stats maintenance during INSERT INTO table ... operations.

How was this patch tested?

Existing tests for SPARK-20703 exercise the new code: hive/SQLMetricsSuite, sql/JavaDataFrameReaderWriterSuite, etc.

rxin · 2017-08-08T17:37:59Z

Jenkins, add to white list.

rxin · 2017-08-08T19:19:03Z

+class BasicWriteTaskStatsTracker(hadoopConf: Configuration)
+  extends WriteTaskStatsTracker {
+
+  var numPartitions: Int = 0


private[this] ?

rxin · 2017-08-08T19:19:52Z

+ * It is therefore important that such an objects is [[Serializable]], as it will be sent
+ * from the driver to all executors.
+ */
+trait WriteJobStatsTracker


so i think the general approach is that the final implementation should add serializable, and the trait shouldn't ...

No strong preference, just curious.. why is that preferable?
The way I see it, this way you're sure to get it; otherwise you might forget to mix it in and then you'll only realize it at runtime when faced with a "task not serializable" exception.
Is there some disadvantage to mixing it into the trait?

SparkQA · 2017-08-08T20:17:09Z

Test build #3885 has finished for PR 18884 at commit 3665f2f.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-08-08T20:25:41Z

        hadoopConf = hadoopConf,
        partitionColumns = partitionColumns,
        bucketSpec = None,
+        statsTrackers = Nil,


don't we want that as well?

We might, but it wasn't originally handled in #18159 and so FileStreamSink doesn't extend DataWritingCommand.
@viirya, do you remember if there was any particular reason for this, or was it just overlooked / deemed out of scope?
Anyway, I could try to add handling for it in this PR, but I'd say it's rather orthogonal.

I think FileStreamSink is not a RunnableCommand? #18159 adds data writing metrics for certain RunnableCommand.

Do we have a common physical node representing data writing operation in a FileStreamSink so we can bind its SQLMetric and update the metrics after insertion? Looks like the addBatch accepts arbitrary DataFrame and writes the data from it directly.

I'd think it might be another issue.

viirya · 2017-08-09T03:52:35Z

+    assert(statsTrackers.length == statsPerTracker.length,
+      s"""Every WriteTask should have produced one `WriteTaskStats` object for every tracker.
+         |statsTrackers = ${statsTrackers}
+         |statsPerTracker = ${statsPerTracker}


In case of many states, this might result big error message. Just printing length is enough?

viirya · 2017-08-09T04:01:29Z

+   * the corresponding [[WriteTaskStats]] from all executors.
+   */
+  private def processStats(
+      statsTrackers: Seq[WriteJobStatsTracker],


The current framework looks like the trackers can't share the collection of states or some same metrics. Isn't a likely happened use case? When two trackers needs the same metrics, we will need to collect it in two copies of stats.

Not sure if that's a common use case..
But if you do need to share some stats in two trackers, I can think of two solutions within the current framework:

in the processStats of the first tracker store the somewhere (e.g. catalog) and then retrieve them during processStats of the second tracker

replace the two trackers with a single, combined one (inheritance / composition)

Because some metrics might be costly, I was just thinking of a case that more than one tracker need to use few overlapping metrics. Instead of having measuring the metrics in two different collections of states, measuring once and having just one copy of the metrics seems more reasonable. For now, this may lead to overdesign, just curious how we can deal with it easily. We can consider that once we hit that.

If 2 trackers have overlapped metrics, I think we probably need to combine them into one (inheritance / composition).

rxin · 2017-08-09T21:10:01Z

This looks good to me, but I didn't review super carefully. It's a great clean-up and abstraction.

cc @cloud-fan

cloud-fan · 2017-08-10T12:25:35Z

      partitionColumns: Seq[Attribute],
      bucketSpec: Option[BucketSpec],
+      statsTrackers: Seq[WriteJobStatsTracker],
      refreshFunction: (Seq[ExecutedWriteSummary]) => Unit,


instead of having the refreshFunction, can we just let this write method return Seq[ExecutedWriteSummary]

or return Set[String] as updated partitions

cloud-fan · 2017-08-10T12:35:00Z

+
+    val numStatsTrackers = statsTrackers.length
+    assert(statsPerTask.forall(_.length == numStatsTrackers),
+      s"""Every WriteTask should have produced one `WriteTaskStats` object for every tracker.


nit: $numStatsTrackers

cloud-fan · 2017-08-10T13:08:21Z

+   * Process the given collection of stats computed during this job.
+   * E.g. aggregate them, write them to memory / disk, issue warnings, whatever.
+   * @param stats One [[WriteTaskStats]] object from each successful write task.
+   *              @note The type here is too generic. These classes should probably be parametrized:


nit: should be only one space before @note

cloud-fan · 2017-08-10T13:09:16Z

LGTM except some minor comments, great clean up!

cloud-fan · 2017-08-10T15:09:53Z

LGTM, pending jenkins

cloud-fan · 2017-08-10T15:49:55Z

retest this please

SparkQA · 2017-08-10T18:31:52Z

Test build #80493 has finished for PR 18884 at commit 7ec545b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait WriteJobStatsTracker extends Serializable

rxin · 2017-08-10T19:36:16Z

Merging in master.

adrian-ionescu added 6 commits August 7, 2017 14:57

initial

67e333e

tests pass; missing docs

176726e

newPartition() takes InternalRow instead of String

6f40246

bug fix + docs

e6ab459

minor

3665f2f

minor

0eae6f1

rxin reviewed Aug 8, 2017

View reviewed changes

viirya reviewed Aug 9, 2017

View reviewed changes

address easy review remarks

f599236

Merge branch 'master' into write-stats-tracker-api

dbf58b7

cloud-fan reviewed Aug 10, 2017

View reviewed changes

address review feedback from cloud-fan

7ec545b

minor

e8884b7

asfgit closed this in 95ad960 Aug 10, 2017

gatorsmile mentioned this pull request Feb 10, 2019

[SPARK-26666][SQL] Support DSv2 overwrite and dynamic partition overwrite. #23606

Closed

Conversation

adrian-ionescu commented Aug 8, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rxin commented Aug 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Aug 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Aug 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Aug 10, 2017

Uh oh!

cloud-fan commented Aug 10, 2017

Uh oh!

cloud-fan commented Aug 10, 2017

Uh oh!

SparkQA commented Aug 10, 2017

Uh oh!

rxin commented Aug 10, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

viirya Aug 9, 2017 •

edited

Loading

rxin commented Aug 9, 2017 •

edited

Loading

cloud-fan Aug 10, 2017 •

edited

Loading