-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30669][SS] Introduce AdmissionControl APIs for StructuredStreaming #27380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #117500 has finished for PR 27380 at commit
|
|
Test build #117503 has finished for PR 27380 at commit
|
| } | ||
| }.toMap | ||
| case (s, _) => | ||
| // for some reason, the compiler is unhappy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean Match is not exhaustive?
|
Test build #117502 has finished for PR 27380 at commit
|
|
Test build #117505 has finished for PR 27380 at commit
|
|
Test build #117508 has finished for PR 27380 at commit
|
zsxwing
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one nit.
| } | ||
| } | ||
|
|
||
| test("maxFilesPerTrigger: ignored when using Trigger.Once") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: missing the jira id.
|
I'm seeing the indentation of Java files are done by 4 spaces, whereas we seem to use 2 spaces also for Java files as well (checked some sample of java files in https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/connector/read/V1Scan.java While it should be even better if we can catch it in checkstyle (linter), could we make a change here for now? |
|
Thanks @HeartSaVioR . Addressed |
HeartSaVioR
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a couple of broken indentations.
Btw I haven't looked at the details for the changes, will catch up the change soon but please go ahead once the indentation is fixed.
|
|
||
| def latestOffset(start: Offset): Offset | ||
| static ReadLimit allAvailable() { | ||
| return ReadAllAvailable.SINGLETON; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: still have 4 spaces
| * the data source. | ||
| */ | ||
| default ReadLimit getDefaultReadLimit() { | ||
| return ReadLimit.allAvailable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: still have 4 spaces
| } | ||
|
|
||
| override def latestOffset(): Offset = { | ||
| throw new IllegalStateException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is UnsupportedOperationException better in this context?
|
|
||
| /** Returns the maximum available offset for this source. */ | ||
| override def getOffset: Option[Offset] = { | ||
| throw new IllegalStateException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnsupportedOperationException?
| */ | ||
| @Evolving | ||
| public final class ReadAllAvailable implements ReadLimit { | ||
| static final ReadAllAvailable SINGLETON = new ReadAllAvailable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't mind, shall we use INSTANCE instead of SINGLETON?
$ git grep " SINGLETON =" | wc -l
0
$ git grep " INSTANCE =" | wc -l
5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableChange.java
Line 315 in 5916c7d
| private static final First SINGLETON = new First(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that we can change that,too. I'll make a follow-up for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the master, that is the only one, right?
| */ | ||
| @Evolving | ||
| public class ReadMaxFiles implements ReadLimit { | ||
| private int files; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
files -> maxFiles? Or, can we have a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's consistent this way with rows and maxRows
| val batchFiles = limit match { | ||
| case files: ReadMaxFiles => | ||
| newFiles.take(files.maxFiles()) | ||
| case all: ReadAllAvailable => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit.
- case all: ReadAllAvailable =>
+ case _: ReadAllAvailable =>|
|
||
| override def getOffset: Option[Offset] = Some(fetchMaxOffset()).filterNot(_.logOffset == -1) | ||
| override def getOffset: Option[Offset] = { | ||
| throw new IllegalStateException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnsupportedOperationException?
| case source: SupportsAdmissionControl => | ||
| val limit = source.getDefaultReadLimit | ||
| if (trigger == OneTimeTrigger && limit != ReadLimit.allAvailable()) { | ||
| logWarning(s"The read limit $limit for $source is ignored when Trigger.Once() is used.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we can do this at Analyzer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Triggers are a property of the system, not the query, so I don't think it fits into analysis
|
Test build #117592 has finished for PR 27380 at commit
|
|
Test build #117597 has finished for PR 27380 at commit
|
|
Thanks all for the review. Merging to master! |
|
wait, seems like the latest tests haven't finished running. Holding off for now |
|
Thank you for updating, @brkyvz . |
|
Test build #117606 has finished for PR 27380 at commit
|
|
Test build #117607 has finished for PR 27380 at commit
|
|
We need to use |
|
Hi, And now you remove this ability, so assume you can suggest a better way to go? |
|
Joining the above question. how can we achieve rate limit in Trigger.Once? |
What changes were proposed in this pull request?
We propose to add a new interface
SupportsAdmissionControlandReadLimit. A ReadLimit defines how much data should be read in the next micro-batch.SupportsAdmissionControlspecifies that a source can rate limit its ingest into the system. The source can tell the system what the user specified as a read limit, and the system can enforce this limit within each micro-batch or impose its own limit if the Trigger is Trigger.Once() for example.We then use this interface in FileStreamSource, KafkaSource, and KafkaMicroBatchStream.
Why are the changes needed?
Sources currently have no information around execution semantics such as whether the stream is being executed in Trigger.Once() mode. This interface will pass this information into the sources as part of planning. With a trigger like Trigger.Once(), the semantics are to process all the data available to the datasource in a single micro-batch. However, this semantic can be broken when data source options such as
maxOffsetsPerTrigger(in the Kafka source) rate limit the amount of data read for that micro-batch without this interface.Does this PR introduce any user-facing change?
DataSource developers can extend this interface for their streaming sources to add admission control into their system and correctly support Trigger.Once().
How was this patch tested?
Existing tests, as this API is mostly internal