[SPARK-23559][SS] Create StreamingDataWriterFactory for epoch ID. #20752

jose-torres · 2018-03-06T17:15:49Z

What changes were proposed in this pull request?

Create StreamingDataWriterFactory for epoch ID.

How was this patch tested?

existing unit tests

…rnalRow

SparkQA · 2018-03-06T17:18:46Z

Test build #88013 has finished for PR 20752 at commit c6d4ff5.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class StreamingInternalRowDataWriterFactory(
case class MemoryWriterFactory(outputMode: OutputMode) extends StreamingDataWriterFactory[Row]

SparkQA · 2018-03-06T17:23:45Z

Test build #88014 has finished for PR 20752 at commit 3cf5479.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-06T17:41:32Z

Test build #88015 has finished for PR 20752 at commit 9c276f3.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2018-03-06T18:27:18Z

...c/main/java/org/apache/spark/sql/sources/v2/writer/streaming/StreamingDataWriterFactory.java

+  DataWriter<T> createDataWriter(int partitionId, int attemptNumber, long epochId);
+
+  @Override default DataWriter<T> createDataWriter(int partitionId, int attemptNumber) {
+    throw new IllegalStateException("Streaming data writer factory cannot create data writers without epoch.");


Why extend DataWriterFactory if this method is going to throw an exception? Why not make them independent interfaces?

If there's no common interface, DataSourceRDD would need to take a java.util.List[Any] instead of java.util.List[DataWriterFactory[T]]. This kind of pattern is present in a lot of DataSourceV2 interfaces, and I think it's endemic to the general design.

I suppose we could have it take a (partition, attempt number, epoch) => DataWriter lambda instead of Any if we really don't want to extend DataWriterFactory.

Can you point me to the code where this would need to change? I don't see it here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala

Sorry, wrong side of the query. I meant DataWritingSparkTask.run().

rdblue · 2018-03-06T18:35:59Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/streaming/StreamWriter.java

 * increasing numeric ID. This writer handles commits and aborts for each successive epoch.
+ *
+ * Note that StreamWriter implementations should provide instances of
+ * {@link StreamingDataWriterFactory}.


What about adding createStreamWriterFactory that returns the streaming interface? That would make it easier for implementations and prevent throwing cast exceptions because a StreamingDataWriterFactory is expected.

That wouldn't be compatible with SupportsWriteInternalRow. We could add a StreamingSupportsWriteInternalRow, but that seems much more confusing both for Spark developers and for data source implementers.

What do you think about removing the SupportsWriteInternalRow and always using InternalRow? For the read side, I think using Row and UnsafeRow is a problem: https://issues.apache.org/jira/browse/SPARK-23325

I don't see the value of using Row instead of InternalRow for readers, so maybe we should just simplify on both the read and write paths.

I'm broadly supportive. I'll detail my thoughts in the jira.

rdblue · 2018-03-06T18:36:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala

+        new StreamingInternalRowDataWriterFactory(w.createWriterFactory(), query.schema)
+      case w: StreamWriter =>
+        new StreamingInternalRowDataWriterFactory(
+          w.createWriterFactory().asInstanceOf[StreamingDataWriterFactory[Row]],


This will cause a cast exception, right? It think it is better to use a separate create method.

cloud-fan · 2018-03-06T18:47:11Z

I'm not very familiar with the streaming side, but here are my 2 cents: I agree with @rdblue that it's unnecessary to introduce the epoch id to data sources that don't care about streaming. However, I think it's natural to say that batch data source is a special case of streaming data source and only needs to deal with one epoch.

So it's a tradeoff, do we wanna make it easier to implement a batch data source, or do we wanna make it easier to implement a data source supports both batch and streaming?

To be clear, I'm not talking about code complexity, as the extra epochId parameter won't add too much code complexity to batch data source. I'm talking about the concepts/stories Spark tells the data source developers.

rdblue · 2018-03-06T19:05:03Z

Thanks for the clear summary, @cloud-fan. I think we want to make it easy to support batch, and then easy to reuse those internals to support streaming by adding new mix-in interfaces. Streaming is more complicated for implementers, and I'd like to help people conceptually ramp up instead of requiring a lot of understanding to get the simple cases working.

I think we may also want to put a design for the streaming side on the dev list. If the batch side warranted a design discussion, then I think the streaming side does as well. Changing the batch side for streaming changes as they become necessary doesn't seem like a good way to arrive at a solid design to me.

jose-torres · 2018-03-06T19:17:39Z

I agree that we should put a design for the streaming side on the dev list, and I intend to do so. The streaming interfaces will remain evolving until a design discussion about them has happened.

Right now, we're still at the point where we aren't quite sure what a streaming API needs to look like. We're starting from basically step zero; the V1 streaming API just throws a DataFrame at the sink and tells it to catch. So we need to iterate towards something that works at all before a meaningful design discussion is possible.

rdblue · 2018-03-06T19:21:51Z

Right now, we're still at the point where we aren't quite sure what a streaming API needs to look like. We're starting from basically ground zero; the V1 streaming API just throws a DataFrame at the sink and tells it to catch. So we need to iterate towards something that works at all before a meaningful design discussion is possible.

Thanks for the context. This aligns with the impression I've gotten and it makes sense. My push for separation between the batch and streaming sides comes from wanting to keep that evolution from making too many changes to the batch side that's better understood. I also think that streaming is different enough that we might be heading in the wrong direction by trying to combine the interfaces too early on.

jose-torres · 2018-03-06T19:52:06Z

Sounds fair to me. I'll continue iterating on the read side, and send out a design proposal for the write side incorporating this discussion in the next few days.

SparkQA · 2018-03-06T21:23:41Z

Test build #88019 has finished for PR 20752 at commit 0c68fd1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jose-torres · 2018-03-09T21:21:29Z

Doc sent to the dev list: https://docs.google.com/document/d/1PJYfb68s2AG7joRWbhrgpEWhrsPqbhyRwUVl9V1wPOE

jose-torres · 2018-03-16T21:32:19Z

There've been no comments on the doc. Should we move forward with this PR?

AmplabJenkins · 2018-06-09T00:13:10Z

Can one of the admins verify this patch?

Create StreamingDataWriterFactory for epoch ID.

c6d4ff5

jose-torres mentioned this pull request Mar 6, 2018

[SPARK-23559][SS] Add epoch ID to DataWriterFactory. #20710

Closed

remove override in StreamWriter - incompatible with SupportsWriteInte…

3cf5479

…rnalRow

jose-torres added 2 commits March 6, 2018 09:21

add header

d4d765b

correct for no override

9c276f3

fix docs again

0c68fd1

rdblue reviewed Mar 6, 2018

View reviewed changes

jose-torres closed this Nov 29, 2018

[SPARK-23559][SS] Create StreamingDataWriterFactory for epoch ID. #20752

[SPARK-23559][SS] Create StreamingDataWriterFactory for epoch ID. #20752

Uh oh!

Conversation

jose-torres commented Mar 6, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 6, 2018

Uh oh!

SparkQA commented Mar 6, 2018

Uh oh!

SparkQA commented Mar 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Mar 6, 2018

Uh oh!

jose-torres commented Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Mar 6, 2018

Uh oh!

jose-torres commented Mar 6, 2018

Uh oh!

SparkQA commented Mar 6, 2018

Uh oh!

jose-torres commented Mar 9, 2018

Uh oh!

jose-torres commented Mar 16, 2018

Uh oh!

AmplabJenkins commented Jun 9, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cloud-fan commented Mar 6, 2018 •

edited

Loading

jose-torres commented Mar 6, 2018 •

edited

Loading