-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-6638] Upgrade AWS Java SDK to V2 #9347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Manually tested s3a path using EMR cluster. spark-shell \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog" \
--conf "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension" \
--conf "spark.sql.hive.convertMetastoreParquet=false" \
--jars /usr/lib/hudi/hudi-aws-bundle-0.13.1-amzn-1-SNAPSHOT.jar,/usr/lib/hudi/hudi-spark3-bundle_2.12-0.13.1-amzn-1-SNAPSHOT.jarimport org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.hive.MultiPartKeysValueExtractor
import org.apache.hudi.hive.HiveSyncConfig
import org.apache.hudi.sync.common.HoodieSyncConfig
// Create a DataFrame
var tableName = "mansi_s3a_hudi_test"
var tablePath = "s3a://<myBucket>/tables/" + tableName
val inputDF = Seq(
("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
("105", "2015-01-02", "2015-01-01T13:51:42.248818Z")
).toDF("id", "creation_date", "last_update_time")
//Specify common DataSourceWriteOptions in the single hudiOptions variable
val hudiOptions = Map[String,String](
HoodieWriteConfig.TABLE_NAME -> tableName,
DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id",
DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "creation_date",
DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "last_update_time",
DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> tableName,
DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "creation_date",
DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName
)
// Write the DataFrame as a Hudi dataset
(inputDF.write
.format("org.apache.hudi")
.option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
.options(hudiOptions)
.mode(SaveMode.Overwrite)
.save(tablePath))
|
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| <aws.sdk.httpclient.version>4.5.13</aws.sdk.httpclient.version> | ||
| <aws.sdk.httpcore.version>4.4.13</aws.sdk.httpcore.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason we pick different versions here?
| if (CollectionUtils.nonEmpty(response.errors())) { | ||
| if (response.errors().stream() | ||
| .allMatch( | ||
| (error) -> "AlreadyExistsException".equals(error.errorDetail().errorCode()))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the error name is the same here and this should still work. Is the error case tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haven't changed anything here other than making it compatible with sdk v2. Haven't run any test to specifically test this error case.
| ? this.dynamoDBLockConfiguration.getString(DynamoDbBasedLockConfig.DYNAMODB_ENDPOINT_URL) | ||
| : DynamoDbClient.serviceMetadata().endpointFor(Region.of(region)).toString(); | ||
|
|
||
| if (!endpointURL.startsWith("https://") || !endpointURL.startsWith("http://")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could the endpoint URL start without the HTTP prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are getting this error if we don't specify http:// or https:// in sdk v2. Based on this doc dynamodb supports both http and https so added a check for both. https://docs.aws.amazon.com/general/latest/gr/ddb.html
Caused by: java.lang.NullPointerException: The URI scheme of endpointOverride must not be null.
at org.apache.hudi.software.amazon.awssdk.utils.Validate.paramNotNull(Validate.java:156)
at org.apache.hudi.software.amazon.awssdk.core.client.builder.SdkDefaultClientBuilder.endpointOverride(SdkDefaultClientBuilder.java:445)
at org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.getDynamoDBClient(DynamoDBBasedLockProvider.java:163)
at org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.<init>(DynamoDBBasedLockProvider.java:87)
at org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.<init>(DynamoDBBasedLockProvider.java:77)
... 65 more
| <!-- AWS Services --> | ||
| <!-- https://mvnrepository.com/artifact/software.amazon.awssdk/aws-java-sdk-sqs --> | ||
| <dependency> | ||
| <groupId>software.amazon.awssdk</groupId> | ||
| <artifactId>sqs</artifactId> | ||
| <version>${aws.sdk.version}</version> | ||
| </dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as before: https://github.com/apache/hudi/pull/8441/files#r1164644285
Should this be moved to hudi-aws module (hudi-utilities module has already relied on hudi-aws module)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have added these 2 dependencies back to the hudi-utililes pom because of the TestS3EventsSource java.lang.NoClassDefFoundError: org/apache/http/impl/client/DefaultClientConnectionReuseStrategy failure. We still need to identify that why these dependencies are required to be specified in hudi-utilities even though hudi-aws is pulling them in hudi-utililes
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>${aws.sdk.httpclient.version}</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>${aws.sdk.httpcore.version}</version>
</dependency>
|
@mansipp could you also check the Azure CI failure? |
5c13915 to
c248258
Compare
|
@hudi-bot run azure |
|
Azure CI failures is due to a flaky test in |

Change Logs
Upgrade AWS Java SDK to v2.
Impact
None
Risk level (write none, low medium or high below)
None
Documentation Update
None
Contributor's checklist