[SPARK-30669][SS] Introduce AdmissionControl APIs for StructuredStreaming #27380

brkyvz · 2020-01-29T02:23:42Z

What changes were proposed in this pull request?

We propose to add a new interface SupportsAdmissionControl and ReadLimit. A ReadLimit defines how much data should be read in the next micro-batch. SupportsAdmissionControl specifies that a source can rate limit its ingest into the system. The source can tell the system what the user specified as a read limit, and the system can enforce this limit within each micro-batch or impose its own limit if the Trigger is Trigger.Once() for example.

We then use this interface in FileStreamSource, KafkaSource, and KafkaMicroBatchStream.

Why are the changes needed?

Sources currently have no information around execution semantics such as whether the stream is being executed in Trigger.Once() mode. This interface will pass this information into the sources as part of planning. With a trigger like Trigger.Once(), the semantics are to process all the data available to the datasource in a single micro-batch. However, this semantic can be broken when data source options such as maxOffsetsPerTrigger (in the Kafka source) rate limit the amount of data read for that micro-batch without this interface.

Does this PR introduce any user-facing change?

DataSource developers can extend this interface for their streaming sources to add admission control into their system and correctly support Trigger.Once().

How was this patch tested?

Existing tests, as this API is mostly internal

SparkQA · 2020-01-29T02:32:23Z

Test build #117500 has finished for PR 27380 at commit 980719a.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds no public classes.

brkyvz · 2020-01-29T02:46:02Z

cc @tdas and @zsxwing

SparkQA · 2020-01-29T04:34:14Z

Test build #117503 has finished for PR 27380 at commit 4ed77e6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-01-29T04:40:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

        }
-    }.toMap
+      case (s, _) =>
+        // for some reason, the compiler is unhappy


Do you mean Match is not exhaustive?

SparkQA · 2020-01-29T05:35:57Z

Test build #117502 has finished for PR 27380 at commit 9563c5b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-29T12:24:11Z

Test build #117505 has finished for PR 27380 at commit 06256bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-29T12:42:57Z

Test build #117508 has finished for PR 27380 at commit a9c6897.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

LGTM except one nit.

zsxwing · 2020-01-30T00:11:47Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

    }
  }

+  test("maxFilesPerTrigger: ignored when using Trigger.Once") {


super nit: missing the jira id.

HeartSaVioR · 2020-01-30T22:20:32Z

I'm seeing the indentation of Java files are done by 4 spaces, whereas we seem to use 2 spaces also for Java files as well (checked some sample of java files in sql/core).

https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/connector/read/V1Scan.java
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/connector/write/V1WriteBuilder.java

While it should be even better if we can catch it in checkstyle (linter), could we make a change here for now?

brkyvz · 2020-01-30T22:49:13Z

Thanks @HeartSaVioR . Addressed

HeartSaVioR

Found a couple of broken indentations.

Btw I haven't looked at the details for the changes, will catch up the change soon but please go ahead once the indentation is fixed.

HeartSaVioR · 2020-01-30T22:59:47Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/streaming/ReadLimit.java


-  def latestOffset(start: Offset): Offset
+  static ReadLimit allAvailable() {
+      return ReadAllAvailable.SINGLETON;


nit: still have 4 spaces

HeartSaVioR · 2020-01-30T23:00:24Z

...st/src/main/java/org/apache/spark/sql/connector/read/streaming/SupportsAdmissionControl.java

+   * the data source.
+   */
+  default ReadLimit getDefaultReadLimit() {
+      return ReadLimit.allAvailable();


nit: still have 4 spaces

dongjoon-hyun · 2020-01-31T00:07:58Z

...rnal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala

+  }
+
+  override def latestOffset(): Offset = {
+    throw new IllegalStateException(


Is UnsupportedOperationException better in this context?

dongjoon-hyun · 2020-01-31T00:09:49Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala


  /** Returns the maximum available offset for this source. */
  override def getOffset: Option[Offset] = {
+    throw new IllegalStateException(


UnsupportedOperationException?

dongjoon-hyun · 2020-01-31T00:16:05Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/streaming/ReadAllAvailable.java

+ */
+@Evolving
+public final class ReadAllAvailable implements ReadLimit {
+  static final ReadAllAvailable SINGLETON = new ReadAllAvailable();


If you don't mind, shall we use INSTANCE instead of SINGLETON?

$ git grep " SINGLETON =" | wc -l 0 $ git grep " INSTANCE =" | wc -l 5

spark/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableChange.java

Line 315 in 5916c7d

private static final First SINGLETON = new First();

I believe that we can change that,too. I'll make a follow-up for that.

On the master, that is the only one, right?

dongjoon-hyun · 2020-01-31T00:18:31Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/streaming/ReadMaxFiles.java

+ */
+@Evolving
+public class ReadMaxFiles implements ReadLimit {
+  private int files;


files -> maxFiles? Or, can we have a better name?

I think it's consistent this way with rows and maxRows

dongjoon-hyun · 2020-01-31T00:26:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

+    val batchFiles = limit match {
+      case files: ReadMaxFiles =>
+        newFiles.take(files.maxFiles())
+      case all: ReadAllAvailable =>


nit.

- case all: ReadAllAvailable => + case _: ReadAllAvailable =>

dongjoon-hyun · 2020-01-31T00:27:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala


-  override def getOffset: Option[Offset] = Some(fetchMaxOffset()).filterNot(_.logOffset == -1)
+  override def getOffset: Option[Offset] = {
+    throw new IllegalStateException(


UnsupportedOperationException?

dongjoon-hyun · 2020-01-31T00:30:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+      case source: SupportsAdmissionControl =>
+        val limit = source.getDefaultReadLimit
+        if (trigger == OneTimeTrigger && limit != ReadLimit.allAvailable()) {
+          logWarning(s"The read limit $limit for $source is ignored when Trigger.Once() is used.")


I'm wondering if we can do this at Analyzer?

Triggers are a property of the system, not the query, so I don't think it fits into analysis

SparkQA · 2020-01-31T02:30:25Z

Test build #117592 has finished for PR 27380 at commit 0548ba9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-31T03:07:10Z

Test build #117597 has finished for PR 27380 at commit 895fb87.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2020-01-31T04:21:43Z

Thanks all for the review. Merging to master!

brkyvz · 2020-01-31T04:22:40Z

wait, seems like the latest tests haven't finished running. Holding off for now

dongjoon-hyun · 2020-01-31T04:29:08Z

Thank you for updating, @brkyvz .

SparkQA · 2020-01-31T04:56:46Z

Test build #117606 has finished for PR 27380 at commit bdbfa11.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-31T05:46:00Z

Test build #117607 has finished for PR 27380 at commit de5f486.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

aminh73 · 2020-11-25T08:45:37Z

We need to use maxOffsetsPerTrigger in the Kafka source with Trigger.Once() but it seems reads allAvailable in spark 3. Is there a way for achieving rate limit in this situation?

GrigorievNick · 2021-03-24T08:05:45Z

Hi,
I know that these changes already in spark 3.
But I have a question.
How can I configure backpressure to my job when I want to use TriggerOnce?
In spark 2.4 I have a use case, to backfill some data and then start the stream.
So I use trigger once, but my backfill scenario can be very very big and sometimes create too big a load on my disks because of shuffles and to driver memory because FileIndex cached there.
SO I use max maxOffsetsPerTrigger and maxFilesPerTrigger to control how much data my spark can process. that's how I configure backpressure.

And now you remove this ability, so assume you can suggest a better way to go?

lior-k · 2021-07-20T09:37:51Z

Joining the above question. how can we achieve rate limit in Trigger.Once?

brkyvz added 3 commits January 28, 2020 13:29

save so far

cfeb6f8

Implement API in Kafka

980719a

pull master

18314a1

fix issues

9563c5b

brkyvz requested a review from zsxwing January 29, 2020 02:45

Forgot to implement old code path

4ed77e6

dongjoon-hyun added the STRUCTURED STREAMING label Jan 29, 2020

dongjoon-hyun reviewed Jan 29, 2020

View reviewed changes

brkyvz added 4 commits January 29, 2020 00:18

Kafka test passes

62d12ed

add file source test as well

06256bb

run a second batch as well

6d64106

update comment

a9c6897

zsxwing approved these changes Jan 30, 2020

View reviewed changes

Update FileStreamSourceSuite.scala

0548ba9

brkyvz added 5 commits January 30, 2020 14:46

Update ReadAllAvailable.java

579b851

Update ReadLimit.java

d383daf

Update ReadMaxFiles.java

d3f8cd1

Update ReadMaxRows.java

2deef07

Update SupportsAdmissionControl.java

895fb87

HeartSaVioR reviewed Jan 30, 2020

View reviewed changes

dongjoon-hyun reviewed Jan 31, 2020

View reviewed changes

dongjoon-hyun mentioned this pull request Jan 31, 2020

[SPARK-30192][SQL] support column position in DS v2 #26817

Closed

brkyvz added 2 commits January 30, 2020 17:20

address

bdbfa11

fix indent

de5f486

asfgit closed this in 1cd19ad Jan 31, 2020

singhpk234 mentioned this pull request Apr 7, 2022

Spark 3.3: support rate limit in Spark Streaming apache/iceberg#4479

Merged

melin mentioned this pull request Mar 8, 2023

[SUPPORT] support rate limit in Spark Streaming apache/hudi#8119

Closed

[SPARK-30669][SS] Introduce AdmissionControl APIs for StructuredStreaming #27380

[SPARK-30669][SS] Introduce AdmissionControl APIs for StructuredStreaming #27380

Uh oh!

Conversation

brkyvz commented Jan 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jan 29, 2020

Uh oh!

brkyvz commented Jan 29, 2020

Uh oh!

SparkQA commented Jan 29, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 29, 2020

Uh oh!

SparkQA commented Jan 29, 2020

Uh oh!

SparkQA commented Jan 29, 2020

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Jan 30, 2020

Uh oh!

brkyvz commented Jan 30, 2020

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jan 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 31, 2020

Uh oh!

SparkQA commented Jan 31, 2020

Uh oh!

brkyvz commented Jan 31, 2020

Uh oh!

brkyvz commented Jan 31, 2020

Uh oh!

dongjoon-hyun commented Jan 31, 2020

Uh oh!

SparkQA commented Jan 31, 2020

Uh oh!

SparkQA commented Jan 31, 2020

brkyvz commented Jan 29, 2020 •

edited

Loading

dongjoon-hyun Jan 31, 2020 •

edited

Loading