-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23559][SS] Create StreamingDataWriterFactory for epoch ID. #20752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c6d4ff5
3cf5479
d4d765b
9c276f3
0c68fd1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.sources.v2.writer.streaming; | ||
|
|
||
| import org.apache.spark.annotation.InterfaceStability; | ||
| import org.apache.spark.sql.sources.v2.writer.DataWriter; | ||
| import org.apache.spark.sql.sources.v2.writer.DataWriterFactory; | ||
|
|
||
| @InterfaceStability.Evolving | ||
| public interface StreamingDataWriterFactory<T> extends DataWriterFactory<T> { | ||
| /** | ||
| * Returns a data writer to do the actual writing work. | ||
| * | ||
| * If this method fails (by throwing an exception), the action would fail and no Spark job was | ||
| * submitted. | ||
| * | ||
| * @param partitionId A unique id of the RDD partition that the returned writer will process. | ||
| * Usually Spark processes many RDD partitions at the same time, | ||
| * implementations should use the partition id to distinguish writers for | ||
| * different partitions. | ||
| * @param attemptNumber Spark may launch multiple tasks with the same task id. For example, a task | ||
| * failed, Spark launches a new task wth the same task id but different | ||
| * attempt number. Or a task is too slow, Spark launches new tasks wth the | ||
| * same task id but different attempt number, which means there are multiple | ||
| * tasks with the same task id running at the same time. Implementations can | ||
| * use this attempt number to distinguish writers of different task attempts. | ||
| * @param epochId A monotonically increasing id for streaming queries that are split in to | ||
| * discrete periods of execution. For non-streaming queries, | ||
| * this ID will always be 0. | ||
| */ | ||
| DataWriter<T> createDataWriter(int partitionId, int attemptNumber, long epochId); | ||
|
|
||
| @Override default DataWriter<T> createDataWriter(int partitionId, int attemptNumber) { | ||
| throw new IllegalStateException("Streaming data writer factory cannot create data writers without epoch."); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why extend
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If there's no common interface, DataSourceRDD would need to take a java.util.List[Any] instead of java.util.List[DataWriterFactory[T]]. This kind of pattern is present in a lot of DataSourceV2 interfaces, and I think it's endemic to the general design.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suppose we could have it take a (partition, attempt number, epoch) => DataWriter lambda instead of Any if we really don't want to extend DataWriterFactory.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you point me to the code where this would need to change? I don't see it here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, wrong side of the query. I meant DataWritingSparkTask.run(). |
||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,8 +31,9 @@ import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan | |
| import org.apache.spark.sql.execution.SparkPlan | ||
| import org.apache.spark.sql.execution.streaming.{MicroBatchExecution, StreamExecution} | ||
| import org.apache.spark.sql.execution.streaming.continuous.{CommitPartitionEpoch, ContinuousExecution, EpochCoordinatorRef, SetWriterPartitions} | ||
| import org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter | ||
| import org.apache.spark.sql.sources.v2.writer._ | ||
| import org.apache.spark.sql.sources.v2.writer.streaming.StreamWriter | ||
| import org.apache.spark.sql.sources.v2.writer.streaming.{StreamingDataWriterFactory, StreamWriter} | ||
| import org.apache.spark.sql.types.StructType | ||
| import org.apache.spark.util.Utils | ||
|
|
||
|
|
@@ -54,7 +55,14 @@ case class WriteToDataSourceV2Exec(writer: DataSourceWriter, query: SparkPlan) e | |
| override protected def doExecute(): RDD[InternalRow] = { | ||
| val writeTask = writer match { | ||
| case w: SupportsWriteInternalRow => w.createInternalRowWriterFactory() | ||
| case _ => new InternalRowDataWriterFactory(writer.createWriterFactory(), query.schema) | ||
| case w: MicroBatchWriter => | ||
| new StreamingInternalRowDataWriterFactory(w.createWriterFactory(), query.schema) | ||
| case w: StreamWriter => | ||
| new StreamingInternalRowDataWriterFactory( | ||
| w.createWriterFactory().asInstanceOf[StreamingDataWriterFactory[Row]], | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will cause a cast exception, right? It think it is better to use a separate create method. |
||
| query.schema) | ||
| case _ => | ||
| new InternalRowDataWriterFactory(writer.createWriterFactory(), query.schema) | ||
| } | ||
|
|
||
| val useCommitCoordinator = writer.useCommitCoordinator | ||
|
|
@@ -75,7 +83,8 @@ case class WriteToDataSourceV2Exec(writer: DataSourceWriter, query: SparkPlan) e | |
| .askSync[Unit](SetWriterPartitions(rdd.getNumPartitions)) | ||
|
|
||
| (context: TaskContext, iter: Iterator[InternalRow]) => | ||
| DataWritingSparkTask.runContinuous(writeTask, context, iter) | ||
| DataWritingSparkTask.runContinuous( | ||
| writeTask.asInstanceOf[StreamingDataWriterFactory[InternalRow]], context, iter) | ||
| case _ => | ||
| (context: TaskContext, iter: Iterator[InternalRow]) => | ||
| DataWritingSparkTask.run(writeTask, context, iter, useCommitCoordinator) | ||
|
|
@@ -132,8 +141,13 @@ object DataWritingSparkTask extends Logging { | |
| val stageId = context.stageId() | ||
| val partId = context.partitionId() | ||
| val attemptId = context.attemptNumber() | ||
| val epochId = Option(context.getLocalProperty(MicroBatchExecution.BATCH_ID_KEY)).getOrElse("0") | ||
| val dataWriter = writeTask.createDataWriter(partId, attemptId, epochId.toLong) | ||
| val dataWriter = writeTask match { | ||
| case w: StreamingDataWriterFactory[InternalRow] => | ||
| val epochId = Option(context.getLocalProperty(MicroBatchExecution.BATCH_ID_KEY)).get | ||
| w.createDataWriter(partId, attemptId, epochId.toLong) | ||
|
|
||
| case w => w.createDataWriter(partId, attemptId) | ||
| } | ||
|
|
||
| // write the data and commit this writer. | ||
| Utils.tryWithSafeFinallyAndFailureCallbacks(block = { | ||
|
|
@@ -170,7 +184,7 @@ object DataWritingSparkTask extends Logging { | |
| } | ||
|
|
||
| def runContinuous( | ||
| writeTask: DataWriterFactory[InternalRow], | ||
| writeTask: StreamingDataWriterFactory[InternalRow], | ||
| context: TaskContext, | ||
| iter: Iterator[InternalRow]): WriterCommitMessage = { | ||
| val epochCoordinator = EpochCoordinatorRef.get( | ||
|
|
@@ -217,6 +231,17 @@ class InternalRowDataWriterFactory( | |
| rowWriterFactory: DataWriterFactory[Row], | ||
| schema: StructType) extends DataWriterFactory[InternalRow] { | ||
|
|
||
| override def createDataWriter(partitionId: Int, attemptNumber: Int): DataWriter[InternalRow] = { | ||
| new InternalRowDataWriter( | ||
| rowWriterFactory.createDataWriter(partitionId, attemptNumber), | ||
| RowEncoder.apply(schema).resolveAndBind()) | ||
| } | ||
| } | ||
|
|
||
| class StreamingInternalRowDataWriterFactory( | ||
| rowWriterFactory: StreamingDataWriterFactory[Row], | ||
| schema: StructType) extends StreamingDataWriterFactory[InternalRow] { | ||
|
|
||
| override def createDataWriter( | ||
| partitionId: Int, | ||
| attemptNumber: Int, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding
createStreamWriterFactorythat returns the streaming interface? That would make it easier for implementations and prevent throwing cast exceptions because aStreamingDataWriterFactoryis expected.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That wouldn't be compatible with SupportsWriteInternalRow. We could add a StreamingSupportsWriteInternalRow, but that seems much more confusing both for Spark developers and for data source implementers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about removing the
SupportsWriteInternalRowand always usingInternalRow? For the read side, I think usingRowandUnsafeRowis a problem: https://issues.apache.org/jira/browse/SPARK-23325I don't see the value of using
Rowinstead ofInternalRowfor readers, so maybe we should just simplify on both the read and write paths.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm broadly supportive. I'll detail my thoughts in the jira.