-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-45720] Upgrade AWS SDK to v2 for Spark Kinesis connector module #44211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b6e4c48 to
9970a6a
Compare
| <groupId>com.google.guava</groupId> | ||
| <artifactId>guava</artifactId> | ||
| <version>${aws.kinesis.client.guava.version}</version> | ||
| <scope>compile</scope> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, I'm not sure this is okay or not, @junyuc25 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate a bit more on why this is not OK? I think this pattern is also seen in other modules like https://github.com/apache/spark/blob/master/connector/connect/server/pom.xml#L159-L164
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you share more about your concern here @dongjoon-hyun ? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being late.
- Is this only for Kinesis or for all AWS SKD v2?
- Instead of the following, can we use the latest33.0.0-jre like [SPARK-46768][BUILD] Upgrade Guava used by the connect module to 33.0-jre #44795 ?
<aws.kinesis.client.guava.version>32.1.1-jre</aws.kinesis.client.guava.version>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This change is only for the Kinesis connector modules.
- Yes I changed it to use the same Guava version as the one used in other module.
|
Anyway, thank you so much for working on this area, @junyuc25 . |
connector/kinesis-asl/pom.xml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please clarify the relationship between
software.amazon.kinesis:amazon-kinesis-client and software.amazon.awssdk:kinesis?
Seems they don't share the version, and the latter is hygienic (no third-party dependencies other than software.amazon.awssdk:*)
Handling Jackson/Guava/Protobuf dependencies conflict is always painful, is it possible to provide a similar hygienic artifact for the former to make the downstream project easier to consume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pan3793, KCL (software.amazon.kinesis:amazon-kinesis-client) is built on top of Kinesis Data Stream API (software.amazon.awssdk:kinesis), and KCL provides additional functionalities such as load balancing, error recovery etc. According to this doc, generally it is recommended to use KCL over Kinesis Data Stream API:https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html .
Perhaps we could try to raise an issue to the KCL repo and ask if they can release a "clean" version of the library with all the third-party dependencies shaded. But currently I'm not aware there is such a hygienic version of KCL. I guess we probably have to live with this issue if we use KCL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could try to raise an issue to the KCL repo and ask if they can release a "clean" version of the library with all the third-party dependencies shaded.
Would you like to drive this? It will benefit all downstream projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pan3793, I submitted an ticket to the KCL repo: awslabs/amazon-kinesis-client#1245. Let's see what response we would get. On the other hand, it seems to me that this would be a follow-up task, rather than a blocker for this PR? Please correct me if I am wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad to see that Kinesis can use the same protobuf-java as other modules, but is this feasible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @LuciferYang, I saw similar patterns in other modules as well: https://github.com/apache/spark/blob/v3.5.0/connector/connect/server/pom.xml#L173-L175. I believe this should work as long as we shade it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I removed this shaded protobuf-java completely in Apache Spark 4.0.0 independently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dongjoon-hyun for letting me know. Updated the PR to use the same protobuf-java version as specified in <protobuf.version>. I had to change the scope to "compile" to fix ClassNotFoundExceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trivial: omit the empty line #L19
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about r => kinesisServiceMetadata.endpointFor(r).equals(uri)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the URI objects would be different I we make this change. Comparing these two different URI causes test failure.
- The URI created from
kinesisServiceMetadata.endpointFor(Regioin.US_WEST_2): kinesis.us-west-2.amazon.com - The URI created from given endpoint string contains HTTP protocol: https://kinesis.us-west-2.amazon.com/
Stacktrace:
Cause: java.lang.IllegalArgumentException: Could not resolve region for endpoint: https://kinesis.us-west-2.amazonaws.com
at org.apache.spark.streaming.kinesis.KinesisTestUtils$.$anonfun$getRegionNameByEndpoint$3(KinesisTestUtils.scala:237)
at scala.Option.getOrElse(Option.scala:201)
at org.apache.spark.streaming.kinesis.KinesisTestUtils$.getRegionNameByEndpoint(KinesisTestUtils.scala:237)
at org.apache.spark.streaming.kinesis.KinesisStreamTests.<init>(KinesisStreamSuite.scala:51)
at org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite.<init>(KinesisStreamSuite.scala:434)
So to avoid this failure, I would prefer to keep this line. kinesisServiceMetadata.endpointFor(r).toString.equals(uri.getHost)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you correct this comment section? At least no more IRecordProcessor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the comments before methods (such as initialize, processRecords ) are removed in this class? And for the new adding public methods, please add method comments as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should not be omitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This modification will change the API. Please revert it and keep () in method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AWSCredentialsProviderChain was retired, check all terms in all comments in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
c74e184 to
b07bdea
Compare
|
@junyuc25 why you close this PR? And you should remove the |
Looks like I deleted the branch by accident. Updated the title and reopened the PR. |
d0f3157 to
615c370
Compare
615c370 to
eb93c37
Compare
Upgrade codes related to IAM, DDB and Kinesis clients Shade SDKv2 dependencies Use Spark's default protobuf version in the kinesis-asl-assembly module Import mbknor-jackson-jsonschema_2.13 for glue-schema-registry Jira: https://issues.apache.org/jira/browse/SPARK-45720
Fix jaxb version
2767d5d to
28c25ed
Compare
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
…e AWS SDK for Java 1.x dependency ### What changes were proposed in this pull request? This PR proposes to upgrade KCL to 2.7.2 based on junyuc25 's [PR](#44211) with some updates. By upgrading KCL, we can remove AWS SDK for Java 1.x dependency. * Basic migration guide. * https://docs.aws.amazon.com/streams/latest/dev/kcl-migration.html ### Why are the changes needed? * KCL 1.x will reach end-of-life on January 30, 2026. * https://docs.aws.amazon.com/streams/latest/dev/kcl-version-lifecycle-policy.html * Currently, Spark depends on both AWS SDK for Java 1.x and 2.x. 1.x dependency can be removed by this PR. ### Does this PR introduce _any_ user-facing change? Expect the behavior is not changed. ### How was this patch tested? Confirmed all kinesis tests passed with the following commands. * SBT ``` $ ENABLE_KINESIS_TESTS=1 nohup ./build/sbt -Pkinesis-asl 'streaming-kinesis-asl/test' ``` * Maven ``` $ ENABLE_KINESIS_TESTS=1 build/mvn -Pkinesis-asl -Dtest=org.apache.spark.streaming.kinesis.JavaKinesisInputDStreamBuilderSuite -DwildcardSuit\ es=org.apache.spark.streaming.kinesis test ``` Also confirmed existing examples work. ``` # Need to do `build/sbt -Pkinesis-asl package` beforehand # Producer $ bin/run-example streaming.KinesisWordProducerASL kinesis-example-stream https://kinesis.us-west-2.amazonaws.com 10 5 # Consumer $ bin/run-example streaming.KinesisWordCountASL my-stream-app kinesis-example-stream https://kinesis.us-west-2.amazonaws.com ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53256 from sarutak/upgrade-aws-sdk2. Lead-authored-by: Junyu Chen <[email protected]> Co-authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Kousuke Saruta <[email protected]>
What changes were proposed in this pull request?
As Spark is moving to 4.0, one of the major improvement is to upgrade AWS SDK to v2.
Currently other than directly using AWS SDKv1 codes, the Spark Kinesis connector is also
using on these libraries that depends on SDKv1:
The main purpose of this PR is to upgrading AWS SDK to v2 for the Spark Kinesis
conector. While the changes includes upgrading AWS SDK and KCL to v2, we will not
upgrade KPL because it has not yet been migrated to SDKv2.
streaming-kinesis-asltests in Github Action CI #43736Why are the changes needed?
As the GA of AWS SDK v2, the SDKv1 has entered maintenance mode where its future
release are only limited to address critical bug and security issues. More details about the SDK maintenance policy can be found here. To keep Spark’s dependent softwares up to date, we should consider upgrading the SDK to v2.
These changes could keep Spark Kinesis connector up to date, and enable users to
receive continuous support from the above libraries.
Does this PR introduce any user-facing change?
Yes. With this change, the Spark Kinesis connector will no longer work with SDKv1.
Any applications that are running with previous version of Spark Kinesis connector
would require update before migrating to Spark 4.0.
AWS SDKv2 and KCLv2 contain several major changes
that are not backward compatible with their previous versions. And some public classes
in the module (i.e. KinesisInputDStream) are using one of these breaking changes. Thus
these user-facing classes require updates as well.
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
No