Skip to content

Conversation

@sekikn
Copy link
Contributor

@sekikn sekikn commented May 20, 2019

What changes were proposed in this pull request?

KinesisInputDStream currently does not provide a way to disable
CloudWatch metrics push. Its default level is "DETAILED" which pushes
10s of metrics every 10 seconds. When dealing with multiple streaming
jobs this add up pretty quickly, leading to thousands of dollars in cost.
To address this problem, this PR adds interfaces for accessing
KinesisClientLibConfiguration's withMetrics and
withMetricsEnabledDimensions methods to KinesisInputDStream
so that users can configure KCL's metrics levels and dimensions.

How was this patch tested?

By running updated unit tests in KinesisInputDStreamBuilderSuite.
In addition, I ran a Streaming job with MetricsLevel.NONE and confirmed:

  • there's no data point for the "Operation", "Operation, ShardId" and "WorkerIdentifier" dimensions on the AWS management console
  • there's no DEBUG level message from Amazon KCL, such as "Successfully published xx datums."

Please review http://spark.apache.org/contributing.html before opening a pull request.

…way to configure CloudWatch metrics

KinesisInputDStream currently does not provide a way to disable
CloudWatch metrics push. Its default level is "DETAILED" which pushes
10s of metrics every 10 seconds. When dealing with multiple streaming
jobs this add up pretty quickly, leading to thousands of dollars in cost.
To address this problem, this PR adds interfaces for accessing
KinesisClientLibConfiguration's `withMetrics` and
`withMetricsEnabledDimensions` methods to KinesisInputDStream
so that users can configure KCL's metrics levels and dimensions.
@sekikn
Copy link
Contributor Author

sekikn commented Aug 19, 2019

@brkyvz Would you take a look into this PR? I think you're an expert on Kinesis-Streaming integration and have done several contributions and reviews on it.

@sarutak
Copy link
Member

sarutak commented Aug 19, 2019

ok to test.

@SparkQA
Copy link

SparkQA commented Aug 19, 2019

Test build #109327 has finished for PR 24651 at commit c4ada94.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 19, 2019

Test build #109330 has finished for PR 24651 at commit a7800e6.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sekikn
Copy link
Contributor Author

sekikn commented Aug 22, 2019

Hmm, I'm not sure why the CI failed on PySpark, because it succeeds with the same options on my local environment.

$ git checkout master
$ curl -sLO https://github.com/apache/spark/pull/24651.patch
$ git apply 24651.patch
$ build/mvn clean install -Pkinesis-asl -DskipTests

(snip)

[INFO] Spark Project Kinesis Assembly ..................... SUCCESS [  5.044 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  20:39 min
[INFO] Finished at: 2019-08-22T15:20:42+09:00
[INFO] ------------------------------------------------------------------------
$ ENABLE_KINESIS_TESTS=1 python/run-tests --modules=pyspark-streaming,pyspark-mllib,pyspark-ml
Running PySpark tests. Output is in /home/sekikn/repos/spark/python/unit-tests.log
Will test against the following Python executables: ['python2.7', 'python3.6', 'pypy']
Will test the following Python modules: ['pyspark-streaming', 'pyspark-mllib', 'pyspark-ml']
Starting test(pypy): pyspark.streaming.tests.test_dstream
Starting test(pypy): pyspark.streaming.tests.test_listener
Starting test(pypy): pyspark.streaming.tests.test_context
Starting test(pypy): pyspark.streaming.tests.test_kinesis

(snip)

Finished test(pypy): pyspark.streaming.tests.test_kinesis (178s)

(snip)

Tests passed in 1140 seconds

Skipped tests in pyspark.ml.tests.test_image with python2.7:
    test_read_images_multiple_times (pyspark.ml.tests.test_image.ImageFileFormatOnHiveContextTest) ... skipped 'Hive is not available.'

Skipped tests in pyspark.ml.tests.test_image with python3.6:
    test_read_images_multiple_times (pyspark.ml.tests.test_image.ImageFileFormatOnHiveContextTest) ... skipped 'Hive is not available.'
$ echo $?
0

@sarutak
Copy link
Member

sarutak commented Aug 22, 2019

Let's retry test on Jenkins and confirm whether the failure is caused by flaky tests.

@sarutak
Copy link
Member

sarutak commented Aug 22, 2019

retest this please.

@SparkQA
Copy link

SparkQA commented Aug 22, 2019

Test build #109590 has finished for PR 24651 at commit a7800e6.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2019

Test build #109640 has finished for PR 24651 at commit fd2229c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

sarutak
sarutak previously approved these changes Aug 23, 2019
@sekikn
Copy link
Contributor Author

sekikn commented Sep 2, 2019

retest this please.

1 similar comment
@sarutak
Copy link
Member

sarutak commented Sep 2, 2019

retest this please.

@suraj95
Copy link

suraj95 commented Sep 2, 2019

Hello,

I am new to this community. How can I start contributing.

@gaborgsomogyi
Copy link
Contributor

If it's still failing one can merge the latest master on top of this change.

@sarutak
Copy link
Member

sarutak commented Sep 2, 2019

Hi @suraj95 , could you refer to the contribution guide?
Also, please post general questions to the user mailing list next time.
You can register to the maling list by following this instruction.

Copy link
Contributor

@gaborgsomogyi gaborgsomogyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically looks good.
Maybe this feature can be mentioned in streaming-kinesis-integration.md.

@SparkQA
Copy link

SparkQA commented Sep 2, 2019

Test build #110004 has finished for PR 24651 at commit fd2229c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sarutak
Copy link
Member

sarutak commented Sep 2, 2019

The last unit test failure might be irrelevant to the aws-jdk stuff but it's better to merge master and push it. @sekikn , Could you do that?
I also think it's good to document about the new configuration.
@sekikn you can include the documentation within this PR otherwise we will open a followup PR.

@SparkQA
Copy link

SparkQA commented Sep 2, 2019

Test build #110008 has finished for PR 24651 at commit d487679.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sarutak
Copy link
Member

sarutak commented Sep 2, 2019

It LGTM but I want to have one more committer reviews this too because I'm not so familiar with Kinesis.
@srowen @brkyvz Can either of you take a look at this?

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor comment about docs.
Seems OK, just plumbing through another optional config.

* [[KinesisClientLibConfiguration.DEFAULT_METRICS_ENABLED_DIMENSIONS]]
* if no custom value is specified.
*
* @param metricsEnabledDimensions [[Set[String]]] to specify
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a small thing, but I know we have had problems generating scaladoc with references to library code. It might be worth building the doc HTML (only) as in https://github.com/apache/spark/blob/master/docs/README.md to make sure.

You don't really have to link a type like Set.
Also we usually do a continuation indent of two spaces.

* @param metricsEnabledDimensions [[Set[String]]] to specify
* the enabled CloudWatch metrics dimensions
* @return Reference to this [[KinesisInputDStream.Builder]]
* @see [[https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-kcl.html#metric-levels]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it helps avoid turning off scalastyle, you could just write:

See
[[...]]]

in the main body of the doc above. That seems short enough.

* Kinesis integration documentation

  * Add explanation and usage examples for the new API

  * Fix the existing Java grammatical mistakes

* Scaladoc

  * Insert newlines between the @see directives and URLs
    to avoid disabling scalastyle

  * Remove unnecessary brackets around a standard class

  * Use two spaces for a continuation indent
@sekikn
Copy link
Contributor Author

sekikn commented Sep 7, 2019

@gaborgsomogyi @srowen Thank you for the review! I've just updated the PR following your advice.

As @srowen pointed out, some references to other classes (e.g., KinesisClientLibConfiguration and MetricsLevel) in the scaladoc don't seem to work. But generating documents itself succeeds and there is no problem to read them, so I left the brackets as they were.

@SparkQA
Copy link

SparkQA commented Sep 7, 2019

Test build #110278 has finished for PR 24651 at commit c9341a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen srowen closed this in 1f056eb Sep 9, 2019
@srowen
Copy link
Member

srowen commented Sep 9, 2019

Merged to master

@sekikn
Copy link
Contributor Author

sekikn commented Sep 9, 2019

@srowen @sarutak Thank you for reviewing and merging!
BTW, my client who uses v2.4.x is facing this problem, so I'd like to backport this fix into the 2.4 branch. Would you review if I submit a PR for branch-2.4?

@sekikn sekikn deleted the SPARK-27420 branch September 9, 2019 06:15
@srowen
Copy link
Member

srowen commented Sep 9, 2019

We usually don't back-port minor features to maintenance branches. I think we'd want to see it's of broader interest to do so. Can you continue to work around in 2.4?

PavithraRamachandran pushed a commit to PavithraRamachandran/spark that referenced this pull request Sep 15, 2019
…way to configure CloudWatch metrics

## What changes were proposed in this pull request?

KinesisInputDStream currently does not provide a way to disable
CloudWatch metrics push. Its default level is "DETAILED" which pushes
10s of metrics every 10 seconds. When dealing with multiple streaming
jobs this add up pretty quickly, leading to thousands of dollars in cost.
To address this problem, this PR adds interfaces for accessing
KinesisClientLibConfiguration's `withMetrics` and
`withMetricsEnabledDimensions` methods to KinesisInputDStream
so that users can configure KCL's metrics levels and dimensions.

## How was this patch tested?

By running updated unit tests in KinesisInputDStreamBuilderSuite.
In addition, I ran a Streaming job with MetricsLevel.NONE and confirmed:

* there's no data point for the "Operation", "Operation, ShardId" and "WorkerIdentifier" dimensions on the AWS management console
* there's no DEBUG level message from Amazon KCL, such as "Successfully published xx datums."

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes apache#24651 from sekikn/SPARK-27420.

Authored-by: Kengo Seki <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants