-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27420][DSTREAMS][Kinesis] KinesisInputDStream should expose a way to configure CloudWatch metrics #24651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…way to configure CloudWatch metrics KinesisInputDStream currently does not provide a way to disable CloudWatch metrics push. Its default level is "DETAILED" which pushes 10s of metrics every 10 seconds. When dealing with multiple streaming jobs this add up pretty quickly, leading to thousands of dollars in cost. To address this problem, this PR adds interfaces for accessing KinesisClientLibConfiguration's `withMetrics` and `withMetricsEnabledDimensions` methods to KinesisInputDStream so that users can configure KCL's metrics levels and dimensions.
|
@brkyvz Would you take a look into this PR? I think you're an expert on Kinesis-Streaming integration and have done several contributions and reviews on it. |
|
ok to test. |
|
Test build #109327 has finished for PR 24651 at commit
|
|
Test build #109330 has finished for PR 24651 at commit
|
|
Hmm, I'm not sure why the CI failed on PySpark, because it succeeds with the same options on my local environment. |
|
Let's retry test on Jenkins and confirm whether the failure is caused by flaky tests. |
|
retest this please. |
...rnal/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisInputDStream.scala
Outdated
Show resolved
Hide resolved
|
Test build #109590 has finished for PR 24651 at commit
|
|
Test build #109640 has finished for PR 24651 at commit
|
|
retest this please. |
1 similar comment
|
retest this please. |
|
Hello, I am new to this community. How can I start contributing. |
|
If it's still failing one can merge the latest master on top of this change. |
|
Hi @suraj95 , could you refer to the contribution guide? |
gaborgsomogyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically looks good.
Maybe this feature can be mentioned in streaming-kinesis-integration.md.
|
Test build #110004 has finished for PR 24651 at commit
|
|
The last unit test failure might be irrelevant to the |
|
Test build #110008 has finished for PR 24651 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor comment about docs.
Seems OK, just plumbing through another optional config.
| * [[KinesisClientLibConfiguration.DEFAULT_METRICS_ENABLED_DIMENSIONS]] | ||
| * if no custom value is specified. | ||
| * | ||
| * @param metricsEnabledDimensions [[Set[String]]] to specify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a small thing, but I know we have had problems generating scaladoc with references to library code. It might be worth building the doc HTML (only) as in https://github.com/apache/spark/blob/master/docs/README.md to make sure.
You don't really have to link a type like Set.
Also we usually do a continuation indent of two spaces.
| * @param metricsEnabledDimensions [[Set[String]]] to specify | ||
| * the enabled CloudWatch metrics dimensions | ||
| * @return Reference to this [[KinesisInputDStream.Builder]] | ||
| * @see [[https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-kcl.html#metric-levels]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it helps avoid turning off scalastyle, you could just write:
See
[[...]]]
in the main body of the doc above. That seems short enough.
* Kinesis integration documentation * Add explanation and usage examples for the new API * Fix the existing Java grammatical mistakes * Scaladoc * Insert newlines between the @see directives and URLs to avoid disabling scalastyle * Remove unnecessary brackets around a standard class * Use two spaces for a continuation indent
|
@gaborgsomogyi @srowen Thank you for the review! I've just updated the PR following your advice. As @srowen pointed out, some references to other classes (e.g., KinesisClientLibConfiguration and MetricsLevel) in the scaladoc don't seem to work. But generating documents itself succeeds and there is no problem to read them, so I left the brackets as they were. |
|
Test build #110278 has finished for PR 24651 at commit
|
|
Merged to master |
|
We usually don't back-port minor features to maintenance branches. I think we'd want to see it's of broader interest to do so. Can you continue to work around in 2.4? |
…way to configure CloudWatch metrics ## What changes were proposed in this pull request? KinesisInputDStream currently does not provide a way to disable CloudWatch metrics push. Its default level is "DETAILED" which pushes 10s of metrics every 10 seconds. When dealing with multiple streaming jobs this add up pretty quickly, leading to thousands of dollars in cost. To address this problem, this PR adds interfaces for accessing KinesisClientLibConfiguration's `withMetrics` and `withMetricsEnabledDimensions` methods to KinesisInputDStream so that users can configure KCL's metrics levels and dimensions. ## How was this patch tested? By running updated unit tests in KinesisInputDStreamBuilderSuite. In addition, I ran a Streaming job with MetricsLevel.NONE and confirmed: * there's no data point for the "Operation", "Operation, ShardId" and "WorkerIdentifier" dimensions on the AWS management console * there's no DEBUG level message from Amazon KCL, such as "Successfully published xx datums." Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#24651 from sekikn/SPARK-27420. Authored-by: Kengo Seki <[email protected]> Signed-off-by: Sean Owen <[email protected]>
What changes were proposed in this pull request?
KinesisInputDStream currently does not provide a way to disable
CloudWatch metrics push. Its default level is "DETAILED" which pushes
10s of metrics every 10 seconds. When dealing with multiple streaming
jobs this add up pretty quickly, leading to thousands of dollars in cost.
To address this problem, this PR adds interfaces for accessing
KinesisClientLibConfiguration's
withMetricsandwithMetricsEnabledDimensionsmethods to KinesisInputDStreamso that users can configure KCL's metrics levels and dimensions.
How was this patch tested?
By running updated unit tests in KinesisInputDStreamBuilderSuite.
In addition, I ran a Streaming job with MetricsLevel.NONE and confirmed:
Please review http://spark.apache.org/contributing.html before opening a pull request.