From 637bd9ea3d0753da69eccadbec03dd635ad7b454 Mon Sep 17 00:00:00 2001 From: Pramod Biligiri Date: Mon, 13 Feb 2023 14:55:08 +0530 Subject: [PATCH 1/2] docs for GCS Ingestion --- .../version-0.12.2/hoodie_deltastreamer.md | 21 +++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/website/versioned_docs/version-0.12.2/hoodie_deltastreamer.md b/website/versioned_docs/version-0.12.2/hoodie_deltastreamer.md index 42b98a2b90e05..8b6ecde4e387b 100644 --- a/website/versioned_docs/version-0.12.2/hoodie_deltastreamer.md +++ b/website/versioned_docs/version-0.12.2/hoodie_deltastreamer.md @@ -339,6 +339,27 @@ to trigger/processing of new or changed data as soon as it is available on S3. Insert code sample from this blog: https://hudi.apache.org/blog/2021/08/23/s3-events-source/#configuration-and-setup +### GCS Events +Google Cloud Storage (GCS) service provides an event notification mechanism which will post notifications when certain +events happen in your GCS bucket. You can read more at [Pubsub Notifications](https://cloud.google.com/storage/docs/pubsub-notifications/). +GCS will put these events in a Cloud Pubsub topic. Apache Hudi provides a GcsEventsSource that can read from Cloud Pubsub +to trigger/processing of new or changed data as soon as it is available on GCS. + +#### Setup +A detailed guide on [How to use the system](https://docs.google.com/document/d/1VfvtdvhXw6oEHPgZ_4Be2rkPxIzE0kBCNUiVDsXnSAA/edit#heading=h.tpmqk5oj0crt) is available. +A high level overview of the same is provided below. + +1. Configure Cloud Storage Pubsub Notifications for the bucket. Follow Google’s documentation here: [https://cloud.google.com/storage/docs/reporting-changes](reporting changes) +2. Create a Pubsub subscription corresponding to the topic +3. Note the GCS Project Id, the GCS Subscription Id and use them for the following Hoodie configurations: + 1. hoodie.deltastreamer.source.gcs.project.id=GCP_PROJECT_ID + 2. hoodie.deltastreamer.source.gcs.subscription.id=SUSBCRIPTION_ID + 3. Start the GcsEventsSource using the `HoodieDeltaStreamer` utility with --source-class parameter as +`org.apache.hudi.utilities.sources.GcsEventsSource` and hoodie.deltastreamer.source.cloud.meta.ack=true, and path related + configs as described in the detailed guide mentiond above. +5. Start the GcsEventsSource using the `HoodieDeltaStreamer` utility with --source-class parameter as +`org.apache.hudi.utilities.sources.GcsEventsHoodieIncrSource` and other parameters as mentioned in the detailed guide above. + ### JDBC Source Hudi can read from a JDBC source with a full fetch of a table, or Hudi can even read incrementally with checkpointing from a JDBC source. From 8167f49747040e0e643cce6f0f63d3932e6f9e85 Mon Sep 17 00:00:00 2001 From: Pramod Biligiri Date: Mon, 13 Feb 2023 15:05:13 +0530 Subject: [PATCH 2/2] numbering --- website/versioned_docs/version-0.12.2/hoodie_deltastreamer.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/versioned_docs/version-0.12.2/hoodie_deltastreamer.md b/website/versioned_docs/version-0.12.2/hoodie_deltastreamer.md index 8b6ecde4e387b..0ee08d16ddee9 100644 --- a/website/versioned_docs/version-0.12.2/hoodie_deltastreamer.md +++ b/website/versioned_docs/version-0.12.2/hoodie_deltastreamer.md @@ -357,7 +357,7 @@ A high level overview of the same is provided below. 3. Start the GcsEventsSource using the `HoodieDeltaStreamer` utility with --source-class parameter as `org.apache.hudi.utilities.sources.GcsEventsSource` and hoodie.deltastreamer.source.cloud.meta.ack=true, and path related configs as described in the detailed guide mentiond above. -5. Start the GcsEventsSource using the `HoodieDeltaStreamer` utility with --source-class parameter as +4. Start the GcsEventsSource using the `HoodieDeltaStreamer` utility with --source-class parameter as `org.apache.hudi.utilities.sources.GcsEventsHoodieIncrSource` and other parameters as mentioned in the detailed guide above. ### JDBC Source