You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/streaming-kinesis-integration.md
+37-20Lines changed: 37 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -51,52 +51,67 @@ A Kinesis stream can be set up at one of the valid Kinesis endpoints with 1 or m
51
51
52
52
</div>
53
53
</div>
54
-
`streamingContext`: StreamingContext containg an application name used by Kinesis to tie this Kinesis application to the Kinesis stream
55
54
56
-
`[Kinesis stream name]`: The Kinesis stream that this streaming application receives from
55
+
- `streamingContext`: StreamingContext containg an application name used by Kinesis to tie this Kinesis application to the Kinesis stream
57
56
58
-
*Points to remember:*
57
+
- `[Kinesis stream name]`: The Kinesis stream that this streaming application receives from
58
+
- The application name used in the streaming context becomes the Kinesis application name
59
+
- The application name must be unique for a given account and region.
60
+
- The Kinesis backend automatically associates the application name to the Kinesis stream using a DynamoDB table (always in the us-east-1 region) created during Kinesis Client Library initialization.
61
+
- Changing the application name or stream name can lead to Kinesis errors in some cases. If you see errors, you may need to manually delete the DynamoDB table.
59
62
60
-
- The application name used in the streaming context becomes the Kinesis application name
61
-
- The application name must be unique for a given account and region.
62
-
- The Kinesis backend automatically associates the application name to the Kinesis stream using a DynamoDB table (always in the us-east-1 region) created during Kinesis Client Library initialization.
63
-
- Changing the application name or stream name can lead to Kinesis errors in some cases. If you see errors, you may need to manually delete the DynamoDB table.
64
63
65
-
`[endpoint URL]`: Valid Kinesis endpoints URL can be found [here](http://docs.aws.amazon.com/general/latest/gr/rande.html#ak_region).
64
+
- `[endpoint URL]`: Valid Kinesis endpoints URL can be found [here](http://docs.aws.amazon.com/general/latest/gr/rande.html#ak_region).
66
65
67
-
`[checkpoint interval]`: The interval (ie. Duration(2000) = 2 seconds) at which the Kinesis Client Library saves its position in the stream. For starters, set it to the same as the batch interval of the streaming application.
66
+
- `[checkpoint interval]`: The interval (e.g., Duration(2000) = 2 seconds) at which the Kinesis Client Library saves its position in the stream. For starters, set it to the same as the batch interval of the streaming application.
68
67
69
-
`[initial position]`: Can be either `InitialPositionInStream.TRIM_HORIZON` or `InitialPositionInStream.LATEST` (see Kinesis Checkpointing section and Amazon Kinesis API documentation for more details).
68
+
- `[initial position]`: Can be either `InitialPositionInStream.TRIM_HORIZON` or `InitialPositionInStream.LATEST` (see Kinesis Checkpointing section and Amazon Kinesis API documentation for more details).
70
69
71
-
3.**Deploying:** Package `spark-streaming-flume_{{site.SCALA_BINARY_VERSION}}` and its dependencies (except `spark-core_{{site.SCALA_BINARY_VERSION}}` and `spark-streaming_{{site.SCALA_BINARY_VERSION}}` which are provided by `spark-submit`) into the application JAR. Then use `spark-submit` to launch your application (see [Deploying section](streaming-programming-guide.html#deploying-applications) in the main programming guide).
70
+
71
+
3.**Deploying:** Package `spark-streaming-kinesis-asl_{{site.SCALA_BINARY_VERSION}}` and its dependencies (except `spark-core_{{site.SCALA_BINARY_VERSION}}` and `spark-streaming_{{site.SCALA_BINARY_VERSION}}` which are provided by `spark-submit`) into the application JAR. Then use `spark-submit` to launch your application (see [Deploying section](streaming-programming-guide.html#deploying-applications) in the main programming guide).
72
72
73
73
*Points to remember at runtime:*
74
+
74
75
- Kinesis data processing is ordered per partition and occurs at-least once per message.
76
+
75
77
- Multiple applications can read from the same Kinesis stream. Kinesis will maintain the application-specific shard and checkpoint info in DynamodDB.
78
+
76
79
- A single Kinesis stream shard is processed by one input DStream at a time.
80
+
77
81
<p style="text-align: center;">
78
-
<img src="img/streaming-kinesis-arch.png"
79
-
title="Spark Streaming Kinesis Architecture"
80
-
alt="Spark Streaming Kinesis Architecture"
81
-
width="60%" />
82
-
<!-- Images are downsized intentionally to improve quality on retina displays -->
83
-
</p>
82
+
<img src="img/streaming-kinesis-arch.png"
83
+
title="Spark Streaming Kinesis Architecture"
84
+
alt="Spark Streaming Kinesis Architecture"
85
+
width="60%"
86
+
/>
87
+
<!-- Images are downsized intentionally to improve quality on retina displays -->
88
+
</p>
89
+
84
90
- A single Kinesis input DStream can read from multiple shards of a Kinesis stream by creating multiple KinesisRecordProcessor threads.
91
+
85
92
- Multiple input DStreams running in separate processes/instances can read from a Kinesis stream.
93
+
86
94
- You never need more Kinesis input DStreams than the number of Kinesis stream shards as each input DStream will create at least one KinesisRecordProcessor thread that handles a single shard.
95
+
87
96
- Horizontal scaling is achieved by adding/removing Kinesis input DStreams (within a single process or across multiple processes/instances) - up to the total number of Kinesis stream shards per the previous point.
97
+
88
98
- The Kinesis input DStream will balance the load between all DStreams - even across processes/instances.
99
+
89
100
- The Kinesis input DStream will balance the load during re-shard events (merging and splitting) due to changes in load.
101
+
90
102
- As a best practice, it's recommended that you avoid re-shard jitter by over-provisioning when possible.
91
-
- Each Kinesis input DStream maintains its own checkpoint info. See the Kinesis Checkpointing section for more details.
103
+
104
+
- Each Kinesis input DStream maintains its own checkpoint info. See the Kinesis Checkpointing section for more details.
105
+
92
106
- There is no correlation between the number of Kinesis stream shards and the number of RDD partitions/shards created across the Spark cluster during input DStream processing. These are 2 independent partitioning schemes.
93
107
94
108
#### Running the Example
95
109
To run the example,
96
110
97
111
- Download Spark source and follow the [instructions](building-with-maven.html) to build Spark with profile *-Pkinesis-asl*.
98
112
99
-
mvn -Pkinesis-asl -DskipTests clean package
113
+
mvn -Pkinesis-asl -DskipTests clean package
114
+
100
115
101
116
- Set up Kinesis stream (see earlier section) within AWS. Note the name of the Kinesis stream and the endpoint URL corresponding to the region where the stream was created.
102
117
@@ -126,8 +141,10 @@ To run the example,
126
141
This will push 1000 lines per second of 10 random numbers per line to the Kinesis stream. This data should then be received and processed by the running example.
127
142
128
143
#### Kinesis Checkpointing
129
-
- Each Kinesis input DStream periodically stores the current position of the stream in the backing DynamoDB table. This allows the system to recover from failures and continue processing where the DStream left off.
144
+
- Each Kinesis input DStream periodically stores the current position of the stream in the backing DynamoDB table. This allows the system to recover from failures and continue processing where the DStream left off.
145
+
130
146
- Checkpointing too frequently will cause excess load on the AWS checkpoint storage layer and may lead to AWS throttling. The provided example handles this throttling with a random-backoff-retry strategy.
147
+
131
148
- If no Kinesis checkpoint info exists when the input DStream starts, it will start either from the oldest record available (InitialPositionInStream.TRIM_HORIZON) or from the latest tip (InitialPostitionInStream.LATEST). This is configurable.
132
149
- InitialPositionInStream.LATEST could lead to missed records if data is added to the stream while no input DStreams are running (and no checkpoint info is being stored).
133
150
- InitialPositionInStream.TRIM_HORIZON may lead to duplicate processing of records where the impact is dependent on checkpoint frequency and processing idempotency.
0 commit comments