You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[DOCS][MINOR] Fixed a few typos in the Structured Streaming documentation
Fixed a few typos.
There is one more I'm not sure of:
```
Append mode uses watermark to drop old aggregation state. But the output of a
windowed aggregation is delayed the late threshold specified in `withWatermark()` as by
the modes semantics, rows can be added to the Result Table only once after they are
```
Not sure how to change `is delayed the late threshold`.
Author: Seigneurin, Alexis (CONT) <[email protected]>
Closes#17443 from aseigneurin/typos.
Copy file name to clipboardExpand all lines: docs/structured-streaming-programming-guide.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -717,11 +717,11 @@ However, to run this query for days, it's necessary for the system to bound the
717
717
intermediate in-memory state it accumulates. This means the system needs to know when an old
718
718
aggregate can be dropped from the in-memory state because the application is not going to receive
719
719
late data for that aggregate any more. To enable this, in Spark 2.1, we have introduced
720
-
**watermarking**, which let's the engine automatically track the current event time in the data and
720
+
**watermarking**, which lets the engine automatically track the current event time in the data
721
721
and attempt to clean up old state accordingly. You can define the watermark of a query by
722
-
specifying the event time column and the threshold on how late the data is expected be in terms of
722
+
specifying the event time column and the threshold on how late the data is expected to be in terms of
723
723
event time. For a specific window starting at time `T`, the engine will maintain state and allow late
724
-
data to be update the state until `(max event time seen by the engine - late threshold > T)`.
724
+
data to update the state until `(max event time seen by the engine - late threshold > T)`.
725
725
In other words, late data within the threshold will be aggregated,
726
726
but data later than the threshold will be dropped. Let's understand this with an example. We can
727
727
easily define watermarking on the previous example using `withWatermark()` as shown below.
@@ -792,7 +792,7 @@ This watermark lets the engine maintain intermediate state for additional 10 min
792
792
data to be counted. For example, the data `(12:09, cat)` is out of order and late, and it falls in
793
793
windows `12:05 - 12:15` and `12:10 - 12:20`. Since, it is still ahead of the watermark `12:04` in
794
794
the trigger, the engine still maintains the intermediate counts as state and correctly updates the
795
-
counts of the related windows. However, when the watermark is updated to 12:11, the intermediate
795
+
counts of the related windows. However, when the watermark is updated to `12:11`, the intermediate
796
796
state for window `(12:00 - 12:10)` is cleared, and all subsequent data (e.g. `(12:04, donkey)`)
797
797
is considered "too late" and therefore ignored. Note that after every trigger,
798
798
the updated counts (i.e. purple rows) are written to sink as the trigger output, as dictated by
@@ -825,7 +825,7 @@ section for detailed explanation of the semantics of each output mode.
825
825
same column as the timestamp column used in the aggregate. For example,
826
826
`df.withWatermark("time", "1 min").groupBy("time2").count()` is invalid
827
827
in Append output mode, as watermark is defined on a different column
828
-
as the aggregation column.
828
+
from the aggregation column.
829
829
830
830
-`withWatermark` must be called before the aggregation for the watermark details to be used.
831
831
For example, `df.groupBy("time").count().withWatermark("time", "1 min")` is invalid in Append
@@ -909,7 +909,7 @@ track of all the data received in the stream. This is therefore fundamentally ha
909
909
efficiently.
910
910
911
911
## Starting Streaming Queries
912
-
Once you have defined the final result DataFrame/Dataset, all that is left is for you start the streaming computation. To do that, you have to use the `DataStreamWriter`
912
+
Once you have defined the final result DataFrame/Dataset, all that is left is for you to start the streaming computation. To do that, you have to use the `DataStreamWriter`
0 commit comments