Skip to content

Conversation

@mansipp
Copy link
Contributor

@mansipp mansipp commented Aug 2, 2023

Change Logs

Upgrade AWS Java SDK to v2.

Impact

None

Risk level (write none, low medium or high below)

None

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@yihua yihua added release-0.14.0 priority:blocker Production down; release blocker labels Aug 3, 2023
@yihua yihua mentioned this pull request Aug 3, 2023
4 tasks
@mansipp mansipp changed the title Upgrade aws java sdk to v2 [HUDI-6638] Upgrade aws java sdk to v2 Aug 3, 2023
@mansipp mansipp changed the title [HUDI-6638] Upgrade aws java sdk to v2 [HUDI-6638] Upgrade AWS Java SDK to v2 Aug 3, 2023
@mansipp
Copy link
Contributor Author

mansipp commented Aug 3, 2023

Manually tested s3a path using EMR cluster.

spark-shell \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog" \
--conf "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension" \
--conf "spark.sql.hive.convertMetastoreParquet=false" \
--jars /usr/lib/hudi/hudi-aws-bundle-0.13.1-amzn-1-SNAPSHOT.jar,/usr/lib/hudi/hudi-spark3-bundle_2.12-0.13.1-amzn-1-SNAPSHOT.jar
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.hive.MultiPartKeysValueExtractor
import org.apache.hudi.hive.HiveSyncConfig
import org.apache.hudi.sync.common.HoodieSyncConfig

// Create a DataFrame
var tableName = "mansi_s3a_hudi_test"
var tablePath = "s3a://<myBucket>/tables/" + tableName
val inputDF = Seq(
 ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
 ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
 ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
 ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
 ("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
 ("105", "2015-01-02", "2015-01-01T13:51:42.248818Z")
 ).toDF("id", "creation_date", "last_update_time")

//Specify common DataSourceWriteOptions in the single hudiOptions variable 
val hudiOptions = Map[String,String](
  HoodieWriteConfig.TABLE_NAME -> tableName,
  DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE", 
  DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id",
  DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "creation_date",
  DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "last_update_time",
  DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
  DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> tableName,
  DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "creation_date",
  DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName
)

// Write the DataFrame as a Hudi dataset
(inputDF.write
    .format("org.apache.hudi")
    .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
    .options(hudiOptions)
    .mode(SaveMode.Overwrite)
    .save(tablePath))

Image

@mansipp mansipp changed the title [HUDI-6638] Upgrade AWS Java SDK to v2 [HUDI-6638] Upgrade AWS Java SDK to V2 Aug 3, 2023
@yihua yihua self-assigned this Aug 4, 2023
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a couple of minor comments. @mansipp @rahil-c Thanks for making the AWS Java SDK V2 upgrade happen!

Comment on lines +133 to +134
<aws.sdk.httpclient.version>4.5.13</aws.sdk.httpclient.version>
<aws.sdk.httpcore.version>4.4.13</aws.sdk.httpcore.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we pick different versions here?

if (CollectionUtils.nonEmpty(response.errors())) {
if (response.errors().stream()
.allMatch(
(error) -> "AlreadyExistsException".equals(error.errorDetail().errorCode()))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the error name is the same here and this should still work. Is the error case tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't changed anything here other than making it compatible with sdk v2. Haven't run any test to specifically test this error case.

? this.dynamoDBLockConfiguration.getString(DynamoDbBasedLockConfig.DYNAMODB_ENDPOINT_URL)
: DynamoDbClient.serviceMetadata().endpointFor(Region.of(region)).toString();

if (!endpointURL.startsWith("https://") || !endpointURL.startsWith("http://")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could the endpoint URL start without the HTTP prefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are getting this error if we don't specify http:// or https:// in sdk v2. Based on this doc dynamodb supports both http and https so added a check for both. https://docs.aws.amazon.com/general/latest/gr/ddb.html

Caused by: java.lang.NullPointerException: The URI scheme of endpointOverride must not be null.
    at org.apache.hudi.software.amazon.awssdk.utils.Validate.paramNotNull(Validate.java:156)
    at org.apache.hudi.software.amazon.awssdk.core.client.builder.SdkDefaultClientBuilder.endpointOverride(SdkDefaultClientBuilder.java:445)
    at org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.getDynamoDBClient(DynamoDBBasedLockProvider.java:163)
    at org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.<init>(DynamoDBBasedLockProvider.java:87)
    at org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.<init>(DynamoDBBasedLockProvider.java:77)
    ... 65 more

Comment on lines +453 to +471
<!-- AWS Services -->
<!-- https://mvnrepository.com/artifact/software.amazon.awssdk/aws-java-sdk-sqs -->
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>sqs</artifactId>
<version>${aws.sdk.version}</version>
</dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as before: https://github.com/apache/hudi/pull/8441/files#r1164644285
Should this be moved to hudi-aws module (hudi-utilities module has already relied on hudi-aws module)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have added these 2 dependencies back to the hudi-utililes pom because of the TestS3EventsSource java.lang.NoClassDefFoundError: org/apache/http/impl/client/DefaultClientConnectionReuseStrategy failure. We still need to identify that why these dependencies are required to be specified in hudi-utilities even though hudi-aws is pulling them in hudi-utililes

<dependency>
      <groupId>org.apache.httpcomponents</groupId>
      <artifactId>httpclient</artifactId>
      <version>${aws.sdk.httpclient.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.httpcomponents</groupId>
      <artifactId>httpcore</artifactId>
      <version>${aws.sdk.httpcore.version}</version>
    </dependency>

@yihua
Copy link
Contributor

yihua commented Aug 4, 2023

@mansipp could you also check the Azure CI failure?

@mansipp mansipp force-pushed the mansipp/aws-sdk-v2-upgrade-oss branch from 5c13915 to c248258 Compare August 4, 2023 04:07
@mansipp
Copy link
Contributor Author

mansipp commented Aug 4, 2023

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

hudi-bot commented Aug 4, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua
Copy link
Contributor

yihua commented Aug 4, 2023

Azure CI failures is due to a flaky test in hudi-flink which is irrelevant to this PR. Merging this now.

@yihua yihua merged commit e9dd73f into apache:master Aug 4, 2023
@parisni parisni mentioned this pull request Nov 7, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker release-0.14.0

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

3 participants