Skip to content

Conversation

@mhmohona
Copy link

Added documentation on getting started with GCS with following content:

  • Setting Up Google Cloud Storage
  • Configuring Apache Iceberg to Use GCS
  • Loading Data into Iceberg Tables
  • Querying Data from Iceberg Tables

Refers to - #7948

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryanck since you're very familiar with this, could you please also review whether the described configuration steps make sense?

docs/gcs.md Outdated

## Configuring Apache Iceberg to Use GCS

Apache Iceberg uses the GCSFileIO to read and write data from/to GCS. To configure this:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Apache Iceberg uses the GCSFileIO to read and write data from/to GCS. To configure this:
Apache Iceberg uses the `GCSFileIO` to read and write data from/to GCS. To configure this:

docs/gcs.md Outdated
To load data into Iceberg tables using Apache Spark, you must first add Iceberg to your Spark environment. It can be done using the `--packages` option when starting the Spark shell or Spark SQL:

```bash
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want a tutorial for starting Spark here. It is good to know what additional libraries are needed for running GCSFileIO if you are using the runtime bundle though.

docs/gcs.md Outdated

```bash
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1

docs/gcs.md Outdated
Catalogs in Iceberg are used to track tables. They can be configured using properties under `spark.sql.catalog.(catalog_name)`. Here is an example of how to configure a catalog:

```bash
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1\
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1\

--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to understand when showing an example where only a single catalog is being initialized (so either local or spark_catalog)

--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.local.type=hadoop \
--conf spark.sql.catalog.local.warehouse=$PWD/warehouse \
--conf spark.sql.defaultCatalog=local
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this have a --conf spark.sql.catalog.local.io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO to properly pick up GCSFileIO?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResolvingFileIO should pick it based on the gs scheme.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 ResolvingFileIO will automatically select GCSFileIO based on the scheme in the next release. It might be better to use the REST catalog as an example as it is easier to set up than a HMS, and no io-impl needs to be set.

docs/gcs.md Outdated
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.local.type=hadoop \
--conf spark.sql.catalog.local.warehouse=$PWD/warehouse \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would point to the respective gs:... bucket?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed.

docs/gcs.md Outdated

```java
Map<String, String> properties = new HashMap<>();
properties.put("inputDataLocation", "gs://my_bucket/data/");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is inputDataLocation actually being used?

docs/gcs.md Outdated

- **Create a Cloud Storage bucket**: Navigate to the Cloud Storage Buckets page in the Google Cloud Console. Click "Create bucket", enter your details, and click "Create".

## Configuring Apache Iceberg to Use GCS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this maybe show how this can be used via the Java API vs Spark? Most users would likely want to know how to use GCS with Spark

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. We don't expect anyone to instantiate these directly. They are provided by Table instances.

docs/gcs.md Outdated

## Setting Up Google Cloud Storage (GCS)

### Setting Up a Bucket in GCS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed in Iceberg docs. This would be more appropriate for a "Using Iceberg with GCS" tutorial, but not the doc on GCS FileIO.

docs/gcs.md Outdated

Within these property key-value pairs, inputDataLocation and metadataLocation are the locations in your GCS bucket where your data and metadata are stored. Update `"gs://my_bucket/data/" ` and `"gs://my_bucket/metadata/" ` to reflect the corresponding paths of your GCS bucket.

### Example Use of GCSFileIO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more useful for a general FileIO doc, not one specific to GCS.


```java
InputFile inputFile = gcsFileIO.newInputFile("gs://my_bucket/data/my_data.parquet");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unnecessary newline.

Copy link
Author

@mhmohona mhmohona Aug 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its needed for formatting purpose otherwise it violates this rule - https://github.com/DavidAnson/markdownlint/blob/v0.29.0/doc/md031.md


These steps will allow you to set up GCS as your storage layer for Apache Iceberg and interact with the data stored in GCS using the `GCSFileIO` class.

## Loading Data into Iceberg Tables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main thing here is to set up a warehouse with a gs:// URI for its base warehouse location.

docs/gcs.md Outdated
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.local.type=hadoop \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples should not use the hadoop catalog type. hive is probably better.

@mhmohona mhmohona requested review from nastra and rdblue August 7, 2023 01:51
- limitations under the License.
-->

# Iceberg GSC Integration
Copy link
Contributor

@bryanck bryanck Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: I think you meant GCS

- **Add Iceberg to Spark**: If you already have a Spark environment, you can add Iceberg by specifying the `--packages` option when starting Spark. This will download the required Iceberg package and make it available in your Spark session. Here is an example:

```bash
shell spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently isn't enough. The runtime doesn't currently contain iceberg-gcp nor does it contain the Google java client libraries. See #8231 for a proposal around this. You might also want to comment on configuring authentication, e.g. setting the GOOGLE_APPLICATION_CREDENTIALS environment variable or setting the GCSFileIO properties for that.

@bryanck
Copy link
Contributor

bryanck commented Aug 7, 2023

Setting up Iceberg with GCS recently came up on the Slack channel, there are some shorthand notes on getting up and running with the REST catalog in case it is useful as a reference.

@github-actions
Copy link

github-actions bot commented Sep 9, 2024

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Sep 9, 2024
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants