[HUDI-1897] Deltastreamer source for AWS S3 #3433

codope · 2021-08-09T07:29:20Z

HUDI-1896 intial source for Cloud Dfs
update with changes, added for fileMap support HUDI-1896
update with changes, added for fileMap support HUDI-1896
s3 meta source HUDI-1896
adding hoodie cloud object source class
adding hoodie cloud object source class
[HUDI-1896] adding selector test cases
[HUDI-1896] Intial source for Cloud Dfs and test cases
[HUDI-1896] Intial source for Cloud Dfs and test cases
[HUDI-1896] Intial source for Cloud Dfs and test cases

Resolve conflicts and rename opt keys

Minor refactoring in CloudObjectsDfsSelector

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

hudi-bot · 2021-08-09T07:33:21Z

CI report:

5ffdc82 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run travis re-run the last Travis build
@hudi-bot run azure re-run the last Azure build

codope · 2021-08-10T02:33:01Z

@hudi-bot run azure

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsMetaSource.java

...-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java

...lities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java

...-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java

nsivabalan

Reviewed the 2 stage pipeline

...lities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java

nsivabalan · 2021-08-12T15:55:20Z

...-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java

if we can set numMessagesToProcess = min(approxMessagesAvailable, maxMessageEachBatch), we can avoid line 159 to 161.

Done. But we will still need to check messages.isEmpty() and break off the loop because the the value of ApproximateNumberOfMessages returned by SQS is eventually consistent. So, in case this is some positive value but actually there are no messages, we don't want to run the loop again.

...-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java

nsivabalan

Few high level questions/clarifications:

Do these 2 stage pipeline handle (S3 object) deletes as well or just for immutable data?
Some of the info can be out of sync at diff points in time. Can you go through how do we handle each of them.
a. Events in SQS is out of sync w/ actual S3 state. for eg, you could find a PUT entry in SQS, but actual object in S3 is deleted?
b. Similarly, if an object was updated twice at t0 and t10. but SQS has info only about t0. I assume when we process t0 event, 2nd stage would fetch entire file from S3 and dump to hudi, so likely it will take latest state of the file of interest. Do we ignore the event from SQS when we process t10 event?
c. object of interest was active in S3 during 1st stage. But during 2nd stage, the S3 object got deleted.
d. Again, similar to (b), during first stage event was referring to t0 and hoodie cloud meta table also has t0 event info. during 2nd stage, lets say file got updated and is in version1(but t0 was referring to version0). What happens here?
all cloud provider related code needs to be abstracted out. guess we don't have much time. But once we land this patch, lets ensure we abstract it out so that we can support other cloud providers (GCS, etc) as well.
can we abstract out the source format. so that we can support any file format to read from S3. Basically any file that can be read using spark.read.format("abc") should be doable. Or is there any tight couplings.

...-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java

nsivabalan · 2021-08-12T18:34:52Z

...-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java

Can you help me understand, how exactly deletes in S3 are handled by these 2 sources?

We are not handling delete right now. It will need some work. I was thinking to capture the delete events and add a column in event meta table like is_deleted or something. I or Satish can take it up as a followup task.

...-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java

vinothchandar

Bunch of naming comments.

hudi-utilities/pom.xml

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsDfsSource.java

...-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsMetaSource.java

* HUDI-1896 intial source for Cloud Dfs * update with changes, added for fileMap support HUDI-1896 * update with changes, added for fileMap support HUDI-1896 * s3 meta source HUDI-1896 * adding hoodie cloud object source class * adding hoodie cloud object source class * [HUDI-1896] adding selector test cases * [HUDI-1896] Intial source for Cloud Dfs and test cases * [HUDI-1896] Intial source for Cloud Dfs and test cases * [HUDI-1896] Intial source for Cloud Dfs and test cases Resolve conflicts and rename opt keys Minor refactoring in CloudObjectsDfsSelector Add region config for cloud sources Fix test failures

vinothchandar · 2021-08-14T04:57:17Z

hudi-utilities/pom.xml

+    <dependency>
+      <groupId>com.amazonaws</groupId>
+      <artifactId>aws-java-sdk-sqs</artifactId>
+      <version>${aws.sdk.version}</version>


we are not bundling this. So we should ensure to document --jars to add this for to work during runtime cc @codope

vinothchandar · 2021-08-14T05:09:00Z

https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=1719&view=results passed. So this can land per se

… default

vinothchandar self-assigned this Aug 10, 2021

codope force-pushed the deltastreamer-s3-source branch from 0761233 to 33f7d78 Compare August 10, 2021 08:35

vinothchandar added the priority:blocker Production down; release blocker label Aug 10, 2021

nsivabalan reviewed Aug 12, 2021

View reviewed changes

nsivabalan requested changes Aug 12, 2021

View reviewed changes

vinothchandar assigned nsivabalan and unassigned vinothchandar Aug 12, 2021

nsivabalan requested changes Aug 12, 2021

View reviewed changes

vinothchandar reviewed Aug 13, 2021

View reviewed changes

data-storyteller and others added 2 commits August 13, 2021 18:47

Handle spark read failure and address minor comments

bd9b7dd

codope force-pushed the deltastreamer-s3-source branch from 33f7d78 to bd9b7dd Compare August 13, 2021 13:18

Add prefix constants for props

ee8fbce

vinothchandar reviewed Aug 14, 2021

View reviewed changes

Minor tweaks, making file existence check configurable, turned off by…

5ffdc82

… default

nsivabalan merged commit 5cc96e8 into apache:master Aug 14, 2021

codope mentioned this pull request Sep 22, 2022

[HUDI-4850] Incremental Ingestion from GCS #6665

Merged

4 tasks

hudi-bot mentioned this pull request Nov 30, 2025

Handle deletes in S3 deltastreamer source #14852

Open

[HUDI-1897] Deltastreamer source for AWS S3 #3433

[HUDI-1897] Deltastreamer source for AWS S3 #3433

Uh oh!

Conversation

codope commented Aug 9, 2021

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

hudi-bot commented Aug 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

codope commented Aug 10, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nsivabalan Aug 12, 2021

Choose a reason for hiding this comment

Uh oh!

codope Aug 13, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nsivabalan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nsivabalan Aug 12, 2021

Choose a reason for hiding this comment

Uh oh!

codope Aug 13, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vinothchandar Aug 14, 2021

Choose a reason for hiding this comment

Uh oh!

codope Aug 14, 2021

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Aug 9, 2021 •

edited

Loading

nsivabalan left a comment •

edited

Loading