Skip to content
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1087,6 +1087,14 @@ object SQLConf {
.checkValue(v => Set(1, 2).contains(v), "Valid versions are 1 and 2")
.createWithDefault(2)

val STOP_RUNNING_DUPLICATE_STREAM = buildConf("spark.sql.streaming.stopExistingDuplicateStream")
Copy link
Member

@dongjoon-hyun dongjoon-hyun Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stopExistingDuplicateStream -> stopExistingDuplicatedStream?

.doc("Running two streams using the same checkpoint location concurrently is not supported. " +
"In the case where multiple streams are started on different SparkSessions, access to the " +
"older stream's SparkSession may not be possible, and the stream may have turned into a " +
"zombie stream. When this flag is true, we will stop the old stream to start the new one.")
.booleanConf
.createWithDefault(true)
Copy link
Member

@dongjoon-hyun dongjoon-hyun Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we have false by default to avoid the behavior changes?
cc @gatorsmile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. Here's my argument why we should change it:

  1. This change is going into Spark 3.0, a release where we can actually break existing behavior (unless it is critical behavior which people depend on)
  2. The existing behavior was that any new start of a stream would fail, because an existing stream was already running. This is programming error on the user's part.
  3. However, there are legitimate cases, where a user would like to restart a new instance of the stream (because they upgrade the code for instance), but they have no way of stopping the existing stream, because it turns into a zombie.

I would argue that 3 is more common than 2, and including 1, this is where we can change behavior and mention in release notes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. Here's my argument why we should change it:

  1. This change is going into Spark 3.0, a release where we can actually break existing behavior (unless it is critical behavior which people depend on)
  2. The existing behavior was that any new start of a stream would fail, because an existing stream was already running. This is programming error on the user's part.
  3. However, there are legitimate cases, where a user would like to restart a new instance of the stream (because they upgrade the code for instance), but they have no way of stopping the existing stream, because it turns into a zombie.

I would argue that 3 is more common than 2, and including 1, this is where we can change behavior and mention in release notes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the release notes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the docs can be better. here are confusing parts.

  • it seems that this will work only when the stream is restarted in a different session. but is it s
  • the term stream is confusing here. does it refer to a streaming query, a query run? We should try to be clear by same starting a "streaming query" instead of a "stream" in the explanation, and depending on what is consistent with other confs.

Copy link
Contributor

@tdas tdas Nov 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 one the name now. I like it.


val UNSUPPORTED_OPERATION_CHECK_ENABLED =
buildConf("spark.sql.streaming.unsupportedOperationCheck")
.internal()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -355,11 +355,22 @@ class StreamingQueryManager private[sql] (sparkSession: SparkSession) extends Lo
// Make sure no other query with same id is active across all sessions
val activeOption =
Option(sparkSession.sharedState.activeStreamingQueries.putIfAbsent(query.id, this))
if (activeOption.isDefined || activeQueries.values.exists(_.id == query.id)) {

val streamAlreadyActive =
activeOption.isDefined || activeQueries.values.exists(_.id == query.id)
val turnOffOldStream =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TuneOff?? make it same as the conf.

sparkSession.sessionState.conf.getConf(SQLConf.STOP_RUNNING_DUPLICATE_STREAM)
if (streamAlreadyActive && turnOffOldStream) {
val queryManager = activeOption.getOrElse(this)
logInfo(s"Stopping existing streaming query [id=${query.id}], as a new run is being " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: whether giving a warning is better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, make this a warning, and add the previous runId and new runId to make it easier to debug.

"started.")
queryManager.get(query.id).stop()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the existing stream is a "zombie", can it happen that it does not respond to stop() and then this will block forever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. I can add some safeguards against this, but in most cases we mean that the stream is a "zombie", because we lost all references to it, not because it is uninterruptable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be fine. If stop returns, the query should already be stopped, because stop waits until the streaming thread dies.

Copy link
Member

@zsxwing zsxwing Nov 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this has a deadlock. We are waiting for stopping query inside a lock which the query needs to remove itself from the active queries.

} else if (streamAlreadyActive) {
throw new IllegalStateException(
s"Cannot start query with id ${query.id} as another query with same id is " +
s"already active. Perhaps you are attempting to restart a query from checkpoint " +
s"that is already active.")
"Cannot start query with id ${query.id} as another query with same id is " +
"already active. Perhaps you are attempting to restart a query from checkpoint " +
"that is already active. You may stop the old query by setting the SQL " +
s"""configuration: spark.conf.set("${SQLConf.STOP_RUNNING_DUPLICATE_STREAM}", true).""")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQLConf.STOP_RUNNING_DUPLICATE_STREAM => SQLConf.STOP_RUNNING_DUPLICATE_STREAM.key
And

You may stop the old query by setting the SQL ... and retry.

}

activeQueries.put(query.id, query)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ import org.apache.spark.SparkException
import org.apache.spark.sql.{Dataset, Encoders}
import org.apache.spark.sql.execution.datasources.v2.StreamingDataSourceV2Relation
import org.apache.spark.sql.execution.streaming._
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.streaming.util.BlockingSource
import org.apache.spark.util.Utils

Expand Down Expand Up @@ -274,48 +275,102 @@ class StreamingQueryManagerSuite extends StreamTest {
}

testQuietly("can't start multiple instances of the same streaming query in the same session") {
withTempDir { dir =>
val (ms1, ds1) = makeDataset
val (ms2, ds2) = makeDataset
val chkLocation = new File(dir, "_checkpoint").getCanonicalPath
val dataLocation = new File(dir, "data").getCanonicalPath

val query1 = ds1.writeStream.format("parquet")
.option("checkpointLocation", chkLocation).start(dataLocation)
ms1.addData(1, 2, 3)
try {
val e = intercept[IllegalStateException] {
ds2.writeStream.format("parquet")
.option("checkpointLocation", chkLocation).start(dataLocation)
withSQLConf(SQLConf.STOP_RUNNING_DUPLICATE_STREAM.key -> "false") {
withTempDir { dir =>
val (ms1, ds1) = makeDataset
val (ms2, ds2) = makeDataset
val chkLocation = new File(dir, "_checkpoint").getCanonicalPath
val dataLocation = new File(dir, "data").getCanonicalPath

val query1 = ds1.writeStream.format("parquet")
.option("checkpointLocation", chkLocation).start(dataLocation)
ms1.addData(1, 2, 3)
try {
val e = intercept[IllegalStateException] {
ds2.writeStream.format("parquet")
.option("checkpointLocation", chkLocation).start(dataLocation)
}
assert(e.getMessage.contains("same id"))
} finally {
query1.stop()
}
assert(e.getMessage.contains("same id"))
} finally {
query1.stop()
}
}
}

testQuietly("new instance of the same streaming query stops old query in the same session") {
withSQLConf(SQLConf.STOP_RUNNING_DUPLICATE_STREAM.key -> "true") {
withTempDir { dir =>
val (ms1, ds1) = makeDataset
val (ms2, ds2) = makeDataset
val chkLocation = new File(dir, "_checkpoint").getCanonicalPath
val dataLocation = new File(dir, "data").getCanonicalPath

val query1 = ds1.writeStream.format("parquet")
.option("checkpointLocation", chkLocation).start(dataLocation)
ms1.addData(1, 2, 3)
val query2 = ds2.writeStream.format("parquet")
.option("checkpointLocation", chkLocation).start(dataLocation)
ms2.addData(1, 2, 3)
query2.processAllAvailable()

assert(!query1.isActive, "First query should have stopped before starting the second query")
}
}
}

testQuietly(
"can't start multiple instances of the same streaming query in the different sessions") {
withTempDir { dir =>
val session2 = spark.cloneSession()

val ms1 = MemoryStream(Encoders.INT, spark.sqlContext)
val ds2 = MemoryStream(Encoders.INT, session2.sqlContext).toDS()
val chkLocation = new File(dir, "_checkpoint").getCanonicalPath
val dataLocation = new File(dir, "data").getCanonicalPath
withSQLConf(SQLConf.STOP_RUNNING_DUPLICATE_STREAM.key -> "false") {
withTempDir { dir =>
val session2 = spark.cloneSession()

val ms1 = MemoryStream(Encoders.INT, spark.sqlContext)
val ds2 = MemoryStream(Encoders.INT, session2.sqlContext).toDS()
val chkLocation = new File(dir, "_checkpoint").getCanonicalPath
val dataLocation = new File(dir, "data").getCanonicalPath

val query1 = ms1.toDS().writeStream.format("parquet")
.option("checkpointLocation", chkLocation).start(dataLocation)
ms1.addData(1, 2, 3)
try {
val e = intercept[IllegalStateException] {
ds2.writeStream.format("parquet")
.option("checkpointLocation", chkLocation).start(dataLocation)
}
assert(e.getMessage.contains("same id"))
} finally {
query1.stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop all active streams

}
}
}
}

val query1 = ms1.toDS().writeStream.format("parquet")
.option("checkpointLocation", chkLocation).start(dataLocation)
ms1.addData(1, 2, 3)
try {
val e = intercept[IllegalStateException] {
ds2.writeStream.format("parquet")
.option("checkpointLocation", chkLocation).start(dataLocation)
testQuietly(
"new instance of the same streaming query stops old query in a different session") {
withSQLConf(SQLConf.STOP_RUNNING_DUPLICATE_STREAM.key -> "true") {
withTempDir { dir =>
val session2 = spark.cloneSession()

val ms1 = MemoryStream(Encoders.INT, spark.sqlContext)
val ds2 = MemoryStream(Encoders.INT, session2.sqlContext).toDS()
val chkLocation = new File(dir, "_checkpoint").getCanonicalPath
val dataLocation = new File(dir, "data").getCanonicalPath

val query1 = ms1.toDS().writeStream.format("parquet")
.option("checkpointLocation", chkLocation).start(dataLocation)
ms1.addData(1, 2, 3)
val query2 = ds2.writeStream.format("parquet")
.option("checkpointLocation", chkLocation).start(dataLocation)
try {
ms1.addData(1, 2, 3)
query2.processAllAvailable()

assert(!query1.isActive,
"First query should have stopped before starting the second query")
} finally {
query2.stop()
}
assert(e.getMessage.contains("same id"))
} finally {
query1.stop()
}
}
}
Expand Down