Added documentation on getting started with GCS #8171

mhmohona · 2023-07-28T04:37:59Z

Added documentation on getting started with GCS with following content:

Setting Up Google Cloud Storage
Configuring Apache Iceberg to Use GCS
Loading Data into Iceberg Tables
Querying Data from Iceberg Tables

Refers to - #7948

nastra

@bryanck since you're very familiar with this, could you please also review whether the described configuration steps make sense?

nastra · 2023-07-28T14:22:21Z

docs/gcs.md

+
+## Configuring Apache Iceberg to Use GCS
+
+Apache Iceberg uses the GCSFileIO to read and write data from/to GCS. To configure this:


Suggested change

Apache Iceberg uses the GCSFileIO to read and write data from/to GCS. To configure this:

Apache Iceberg uses the `GCSFileIO` to read and write data from/to GCS. To configure this:

nastra · 2023-07-28T14:25:02Z

docs/gcs.md

+To load data into Iceberg tables using Apache Spark, you must first add Iceberg to your Spark environment. It can be done using the `--packages` option when starting the Spark shell or Spark SQL:
+
+```bash
+spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1


Suggested change

spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1

spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1

I don't think we want a tutorial for starting Spark here. It is good to know what additional libraries are needed for running GCSFileIO if you are using the runtime bundle though.

nastra · 2023-07-28T14:25:10Z

docs/gcs.md

+
+```bash
+spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1
+spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1


Suggested change

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1

nastra · 2023-07-28T14:25:29Z

docs/gcs.md

+Catalogs in Iceberg are used to track tables. They can be configured using properties under `spark.sql.catalog.(catalog_name)`. Here is an example of how to configure a catalog:
+
+```bash
+spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1\


Suggested change

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1\

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1\

nastra · 2023-07-28T14:26:19Z

docs/gcs.md

+    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
+    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
+    --conf spark.sql.catalog.spark_catalog.type=hive \
+    --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \


I think it would be easier to understand when showing an example where only a single catalog is being initialized (so either local or spark_catalog)

nastra · 2023-07-28T14:28:35Z

docs/gcs.md

+    --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
+    --conf spark.sql.catalog.local.type=hadoop \
+    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse \
+    --conf spark.sql.defaultCatalog=local


shouldn't this have a --conf spark.sql.catalog.local.io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO to properly pick up GCSFileIO?

ResolvingFileIO should pick it based on the gs scheme.

+1 ResolvingFileIO will automatically select GCSFileIO based on the scheme in the next release. It might be better to use the REST catalog as an example as it is easier to set up than a HMS, and no io-impl needs to be set.

nastra · 2023-07-28T14:29:34Z

docs/gcs.md

+    --conf spark.sql.catalog.spark_catalog.type=hive \
+    --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
+    --conf spark.sql.catalog.local.type=hadoop \
+    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse \


I think this would point to the respective gs:... bucket?

Yes, agreed.

nastra · 2023-07-28T14:30:26Z

docs/gcs.md

+
+```java
+Map<String, String> properties = new HashMap<>();
+properties.put("inputDataLocation", "gs://my_bucket/data/");


where is inputDataLocation actually being used?

nastra · 2023-07-28T14:31:29Z

docs/gcs.md

+
+- **Create a Cloud Storage bucket**: Navigate to the Cloud Storage Buckets page in the Google Cloud Console. Click "Create bucket", enter your details, and click "Create".
+
+## Configuring Apache Iceberg to Use GCS


should this maybe show how this can be used via the Java API vs Spark? Most users would likely want to know how to use GCS with Spark

I agree. We don't expect anyone to instantiate these directly. They are provided by Table instances.

rdblue · 2023-08-02T19:45:34Z

docs/gcs.md

+
+## Setting Up Google Cloud Storage (GCS)
+
+### Setting Up a Bucket in GCS


I don't think this is needed in Iceberg docs. This would be more appropriate for a "Using Iceberg with GCS" tutorial, but not the doc on GCS FileIO.

rdblue · 2023-08-02T19:46:40Z

docs/gcs.md

+
+Within these property key-value pairs, inputDataLocation and metadataLocation are the locations in your GCS bucket where your data and metadata are stored. Update `"gs://my_bucket/data/" ` and `"gs://my_bucket/metadata/" ` to reflect the corresponding paths of your GCS bucket.
+
+### Example Use of GCSFileIO


I think this is more useful for a general FileIO doc, not one specific to GCS.

rdblue · 2023-08-02T19:47:34Z

docs/gcs.md

+
+```java
+InputFile inputFile = gcsFileIO.newInputFile("gs://my_bucket/data/my_data.parquet");
+


Nit: unnecessary newline.

Its needed for formatting purpose otherwise it violates this rule - https://github.com/DavidAnson/markdownlint/blob/v0.29.0/doc/md031.md

rdblue · 2023-08-02T19:48:04Z

docs/gcs.md

+
+These steps will allow you to set up GCS as your storage layer for Apache Iceberg and interact with the data stored in GCS using the `GCSFileIO` class.
+
+## Loading Data into Iceberg Tables


I think the main thing here is to set up a warehouse with a gs:// URI for its base warehouse location.

rdblue · 2023-08-02T19:49:32Z

docs/gcs.md

+    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
+    --conf spark.sql.catalog.spark_catalog.type=hive \
+    --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
+    --conf spark.sql.catalog.local.type=hadoop \


Examples should not use the hadoop catalog type. hive is probably better.

bryanck · 2023-08-07T02:21:57Z

docs/gcs.md

+ - limitations under the License.
+ -->
+
+# Iceberg GSC Integration


typo: I think you meant GCS

bryanck · 2023-08-07T02:27:40Z

docs/gcs.md

+- **Add Iceberg to Spark**: If you already have a Spark environment, you can add Iceberg by specifying the `--packages` option when starting Spark. This will download the required Iceberg package and make it available in your Spark session. Here is an example:
+
+```bash
+shell spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1


This currently isn't enough. The runtime doesn't currently contain iceberg-gcp nor does it contain the Google java client libraries. See #8231 for a proposal around this. You might also want to comment on configuring authentication, e.g. setting the GOOGLE_APPLICATION_CREDENTIALS environment variable or setting the GCSFileIO properties for that.

bryanck · 2023-08-07T02:32:38Z

Setting up Iceberg with GCS recently came up on the Slack channel, there are some shorthand notes on getting up and running with the REST catalog in case it is useful as a reference.

github-actions · 2024-09-09T00:15:25Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-09-17T00:12:16Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

added gcs

182c0fc

github-actions bot added the docs label Jul 28, 2023

mhmohona mentioned this pull request Jul 28, 2023

Added documentation on getting started with GCS apache/iceberg-docs#264

Closed

nastra requested a review from bitsondatadev July 28, 2023 14:21

nastra reviewed Jul 28, 2023

View reviewed changes

rdblue reviewed Aug 2, 2023

View reviewed changes

mhmohona added 2 commits August 7, 2023 07:50

Merge branch 'apache:master' into gcs

aeed3e6

updated as per review

a77db2e

mhmohona requested review from nastra and rdblue August 7, 2023 01:51

bryanck reviewed Aug 7, 2023

View reviewed changes

docs/gcs.md

- limitations under the License.

-->

# Iceberg GSC Integration

Copy link

Contributor

bryanck Aug 7, 2023 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: I think you meant GCS

bryanck reviewed Aug 7, 2023

View reviewed changes

github-actions bot added the stale label Sep 9, 2024

github-actions bot closed this Sep 17, 2024


		## Configuring Apache Iceberg to Use GCS

		Apache Iceberg uses the GCSFileIO to read and write data from/to GCS. To configure this:

	Apache Iceberg uses the GCSFileIO to read and write data from/to GCS. To configure this:
	Apache Iceberg uses the `GCSFileIO` to read and write data from/to GCS. To configure this:

	spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1
	spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1

	spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1
	spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1


		- Create a Cloud Storage bucket: Navigate to the Cloud Storage Buckets page in the Google Cloud Console. Click "Create bucket", enter your details, and click "Create".

		## Configuring Apache Iceberg to Use GCS


		## Setting Up Google Cloud Storage (GCS)

		### Setting Up a Bucket in GCS


		Within these property key-value pairs, inputDataLocation and metadataLocation are the locations in your GCS bucket where your data and metadata are stored. Update `"gs://my_bucket/data/" ` and `"gs://my_bucket/metadata/" ` to reflect the corresponding paths of your GCS bucket.

		### Example Use of GCSFileIO


		```java
		InputFile inputFile = gcsFileIO.newInputFile("gs://my_bucket/data/my_data.parquet");


		These steps will allow you to set up GCS as your storage layer for Apache Iceberg and interact with the data stored in GCS using the `GCSFileIO` class.

		## Loading Data into Iceberg Tables

Added documentation on getting started with GCS #8171

Added documentation on getting started with GCS #8171

Uh oh!

Conversation

mhmohona commented Jul 28, 2023

Uh oh!

nastra left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhmohona Aug 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bryanck Aug 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bryanck commented Aug 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 9, 2024

Uh oh!

github-actions bot commented Sep 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mhmohona Aug 6, 2023 •

edited

Loading

bryanck Aug 7, 2023 •

edited

Loading

bryanck commented Aug 7, 2023 •

edited

Loading