[SPARK-23559][SS] Add epoch ID to DataWriterFactory. #20710

jose-torres · 2018-03-02T00:08:53Z

What changes were proposed in this pull request?

Add an epoch ID argument to DataWriterFactory for use in streaming. As a side effect of passing in this value, DataWriter will now have a consistent lifecycle; commit() or abort() ends the lifecycle of a DataWriter instance in any execution mode.

I considered making a separate streaming interface and adding the epoch ID only to that one, but I think it requires a lot of extra work for no real gain. I think it makes sense to define epoch 0 as the one and only epoch of a non-streaming query.

How was this patch tested?

existing unit tests

jose-torres · 2018-03-02T00:10:12Z

@tdas @rdblue @cloud-fan

I haven't forgotten that we need a design doc before finalization; SPARK-23556 tracks that.

SparkQA · 2018-03-02T00:25:43Z

Test build #87862 has finished for PR 20710 at commit 5bbd497.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-02T23:02:20Z

Test build #87909 has finished for PR 20710 at commit cb6b2cf.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-03-03T00:02:54Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java

+   *                id will always be zero.
   */
-  DataWriter<T> createDataWriter(int partitionId, int attemptNumber);
+  DataWriter<T> createDataWriter(int partitionId, int attemptNumber, long epochId);


Add clear lifecycle semantics.

tdas · 2018-03-03T00:03:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala

        logError(s"Writer for partition ${context.partitionId()} is aborting.")
-        dataWriter.abort()
+        if (dataWriter != null) dataWriter.abort()
        logError(s"Writer for partition ${context.partitionId()} aborted.")


nit: add comment that the exception will be rethrown.

tdas · 2018-03-03T00:04:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala

        try {
+          dataWriter = writeTask.createDataWriter(
+            context.partitionId(), context.attemptNumber(), currentEpoch)
          iter.foreach(dataWriter.write)


fix this! dont use foreach.

tdas · 2018-03-03T00:06:25Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriter.java

 * succeeds), a {@link WriterCommitMessage} will be sent to the driver side and pass to
 * {@link DataSourceWriter#commit(WriterCommitMessage[])} with commit messages from other data
 * writers. If this data writer fails(one record fails to write or {@link #commit()} fails), an
 * exception will be sent to the driver side, and Spark will retry this writing task for some times,


Spark may retry... (in continuous we dont retry the task)

for some times --> for a few times

Break this sentence. very long.

tdas · 2018-03-03T00:07:49Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java

   *                      tasks with the same task id running at the same time. Implementations can
   *                      use this attempt number to distinguish writers of different task attempts.
+   * @param epochId A monotonically increasing id for streaming queries that are split in to
+   *                discrete periods of execution. For queries that execute as a single batch, this


For non-streaming queries, this...

Also, make it clear that, this is batchId for MicroBatch processing and epochId for Continuous processing

tdas · 2018-03-03T00:11:13Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/streaming/StreamWriter.java

-   * To support exactly-once processing, writer implementations should ensure that this method is
-   * idempotent. The execution engine may call commit() multiple times for the same epoch
-   * in some circumstances.
+   * The execution engine may call commit() multiple times for the same epoch in some circumstances.


Somewhere in this file, add docs about what epochId means for MB and C execution.

SparkQA · 2018-03-03T03:04:52Z

Test build #87912 has finished for PR 20710 at commit 544eb1b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-03T03:22:23Z

Test build #87915 has finished for PR 20710 at commit 79495b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-03T03:28:22Z

Test build #87918 has finished for PR 20710 at commit 215c225.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-03T03:33:38Z

Test build #87917 has finished for PR 20710 at commit 9fb74e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-03-05T02:54:31Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriter.java

- * and finally call {@link DataSourceWriter#abort(WriterCommitMessage[])} if all retry fail.
+ * exception will be sent to the driver side, and Spark may retry this writing task a few times.
+ * In each retry, {@link DataWriterFactory#createDataWriter(int, int, long)} will receive a
+ * different `attemptNumber`. Spark will call {@link DataSourceWriter#abort(WriterCommitMessage[])}


This is not clear to me. Isnt it the case that abort will be called every time a task attempt ends in an error?
This seems to give the impression that abort is called only after N failed attempts have been made.

The local abort will be called every time a task attempt fails. The global abort referenced here is called only when the job fails.

cloud-fan · 2018-03-05T19:36:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala

-          iter.foreach(dataWriter.write)
+          dataWriter = writeTask.createDataWriter(
+            context.partitionId(), context.attemptNumber(), currentEpoch)
+          while (iter.hasNext) {


is there a reason to change foreach to a while loop?

IIUC (/cc @tdas) foreach can be problematic in tight loops, because it introduces a lambda that isn't always optimized away.

cloud-fan · 2018-03-05T19:37:25Z

LGTM

rdblue · 2018-03-05T20:33:52Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java

+   *                this ID will always be 0.
   */
-  DataWriter<T> createDataWriter(int partitionId, int attemptNumber);
+  DataWriter<T> createDataWriter(int partitionId, int attemptNumber, long epochId);


Why are we using the same interface for streaming and batch here? Is there a compelling reason to do so instead of adding StreamingWriterFactory? Are the guarantees for an epoch identical to those of a single batch job?

The guarantees are identical, and in the current execution model, each epoch is in fact processed by a single batch job.

tdas · 2018-03-05T21:30:47Z

@rdblue @jose-torres arrgh... i didnt notice that you guys were still commenting before i merged it.
feel free to continue discussion and if any change is needed we will deal with this accordingly. sorry about it!

rdblue · 2018-03-05T21:38:58Z

@tdas, thanks for letting us know. I'm really wondering if we should be using the same interfaces between batch and streaming. The epoch id strikes me as strange for data sources that won't support streaming. What do you think?

jose-torres · 2018-03-05T22:06:01Z

My primary concern with splitting the interfaces is that it makes it easy for Spark changes to accidentally do the wrong thing. Callers of DataWriterFactory.createDataWriter() won't necessarily notice that there's a StreamingDataWriterFactory which needs to be supported; they'd likely just end up writing code which will break with an opaque internal error in a streaming query.

rdblue · 2018-03-05T23:43:11Z

@jose-torres, can you explain that more for me? Why would callers only use one interface but not the other? Wouldn't streaming use one and batch the other? Why would batch need to know about streaming and vice versa? The simplification is for implementers. It seems odd for implementations to deal with parameters that are for something else (e.g., don't worry about this for batch).

jose-torres · 2018-03-05T23:52:21Z

There isn't a currently a distinction between streaming and batch in the places where this interface is called (except in the experimental continuous processing streaming mode). The streaming engine executes a sequence of WriteToDataSourceV2Exec plans, in the same way that a sequence of unrelated batch queries would be executed. The only thing distinguishing streaming queries is that they have a custom DataSourceWriter implementation, which forwards each individual epoch to the StreamWriter.

rdblue · 2018-03-06T00:12:12Z

Could the non-continuous streaming mode just use the batch interface, since each write is basically separate?

jose-torres · 2018-03-06T00:27:34Z

I'm not certain I understand the question.

From the perspective of query plan execution, the non-continuous streaming mode does just use the batch interface. The motivation of adding epoch ID to DataWriterFactory is to allow it to continue using the batch interface, rather than adding a StreamingDataWriterFactory which it must use instead of the batch interface.

From the perspective of the writer, the batch interface isn't sufficient. Epoch ID is relevant for the same reason that partition ID is; Spark may need to distinguish between different segments of the data when talking to the remote sink.

rdblue · 2018-03-06T00:34:24Z

My question is: why can't we use a batch interface for batch and micro-batch (which behaves like batch) and add a separate streaming interface for continuous streaming? I see no reason to have epoch ID for batch, and it seems janky to add options that implementers should know to ignore.

Spark may need to distinguish between different segments of the data when talking to the remote sink.

For which case, continuous or micro-batch?

jose-torres · 2018-03-06T00:49:14Z

For either case. Any streaming writer has to know that epoch 1 and epoch 2 are part of the same query, for the same reasons it has to know that task attempt 0 and task attempt 1 are iterations of the same task.

jose-torres · 2018-03-06T00:51:51Z

Partitions are a better example than task attempts, but it's still roughly the same idea. Data source writers need to be able to reason about what progress they've made, which is impossible in the streaming case if each epoch is its own disconnected query.

rdblue · 2018-03-06T01:06:01Z

Data source writers need to be able to reason about what progress they've made, which is impossible in the streaming case if each epoch is its own disconnected query.

I don't think the writers necessarily need to reason about progress. Are you saying that there are guarantees the writers need to make, like ordering how data appears?

I'm thinking of an implementation that creates a file for each task commit and the driver's commit operation makes those available. That doesn't require any progress tracking on tasks.

As far as a writer knowing that different epochs are part of the same query: why? Is there something the writer needs to do? If so, then I think that is more of an argument for a separate streaming interface, or else batch implementations that ignore the epoch might do the wrong thing.

jose-torres · 2018-03-06T01:35:45Z

As you say, there's no strict semantic need to have createDataWriter() take arguments. We could simply have each DataWriter identify itself by a random UUID, and require upstream components to keep track of which UUIDs map to which of the writers they care about. But the current API design is to enable each data writer to identify its logical place in the query, and epoch ID is an important part of that. (I expect it would be infeasible to migrate existing sources to an API which didn't provide things like partition ID or attempt number.)

StreamWriter is the separate streaming interface, and DataWriterFactory implementations in streaming queries will always come from a StreamWriter.

rdblue · 2018-03-06T01:41:46Z

Epoch ID is not a valid part of the logical place in a query for batch. I think we should separate batch and streaming, as they are already coming from different interfaces. There's no need to pass useless information to a batch writer or committer.

Implementations can choose to use the same logic if they want, but we should keep the API focused on what is needed, to keep it reasonable for implementers.

jose-torres · 2018-03-06T17:16:13Z

I still maintain that it's sensible to say a batch query is a query that has only one epoch, and that the ship has sailed on passing useless information. But I'm bikeshedding here. Created #20752 to split the interfaces.

Add an epoch ID argument to DataWriterFactory for use in streaming. As a side effect of passing in this value, DataWriter will now have a consistent lifecycle; commit() or abort() ends the lifecycle of a DataWriter instance in any execution mode. I considered making a separate streaming interface and adding the epoch ID only to that one, but I think it requires a lot of extra work for no real gain. I think it makes sense to define epoch 0 as the one and only epoch of a non-streaming query. existing unit tests Author: Jose Torres <[email protected]> Closes apache#20710 from jose-torres/api2. Ref: LIHADOOP-48531 RB=1838805 A=

jose-torres added 5 commits March 2, 2018 15:36

change api

42bca60

add docs

84cfa21

don't use epochId in batch sources

a18a57b

change comment

b2ee7f3

fix docs

55b38db

jose-torres force-pushed the api2 branch from cb6b2cf to 55b38db Compare March 2, 2018 23:49

fix docs

544eb1b

tdas reviewed Mar 3, 2018

View reviewed changes

clarify DataWriter lifecycle

f5948e8

tdas reviewed Mar 3, 2018

View reviewed changes

jose-torres added 2 commits March 2, 2018 16:09

fix foreach

4588616

various doc fixes

79495b1

tdas reviewed Mar 3, 2018

View reviewed changes

jose-torres added 2 commits March 2, 2018 16:15

describe what's an epoch

9fb74e2

clarify retry docs

215c225

tdas reviewed Mar 5, 2018

View reviewed changes

cloud-fan reviewed Mar 5, 2018

View reviewed changes

rdblue reviewed Mar 5, 2018

View reviewed changes

asfgit closed this in b0f422c Mar 5, 2018

[SPARK-23559][SS] Add epoch ID to DataWriterFactory. #20710

[SPARK-23559][SS] Add epoch ID to DataWriterFactory. #20710

Uh oh!

Conversation

jose-torres commented Mar 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

jose-torres commented Mar 2, 2018

Uh oh!

SparkQA commented Mar 2, 2018

Uh oh!

SparkQA commented Mar 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 3, 2018

Uh oh!

SparkQA commented Mar 3, 2018

Uh oh!

SparkQA commented Mar 3, 2018

Uh oh!

SparkQA commented Mar 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas commented Mar 5, 2018

Uh oh!

rdblue commented Mar 5, 2018

Uh oh!

jose-torres commented Mar 5, 2018

Uh oh!

rdblue commented Mar 5, 2018

Uh oh!

jose-torres commented Mar 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Mar 6, 2018

Uh oh!

jose-torres commented Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Mar 6, 2018

Uh oh!

jose-torres commented Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jose-torres commented Mar 6, 2018

Uh oh!

jose-torres commented Mar 2, 2018 •

edited

Loading

jose-torres commented Mar 5, 2018 •

edited

Loading

jose-torres commented Mar 6, 2018 •

edited

Loading

jose-torres commented Mar 6, 2018 •

edited

Loading