-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Added documentation on getting started with GCS #8171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
nastra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bryanck since you're very familiar with this, could you please also review whether the described configuration steps make sense?
docs/gcs.md
Outdated
|
|
||
| ## Configuring Apache Iceberg to Use GCS | ||
|
|
||
| Apache Iceberg uses the GCSFileIO to read and write data from/to GCS. To configure this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Apache Iceberg uses the GCSFileIO to read and write data from/to GCS. To configure this: | |
| Apache Iceberg uses the `GCSFileIO` to read and write data from/to GCS. To configure this: |
docs/gcs.md
Outdated
| To load data into Iceberg tables using Apache Spark, you must first add Iceberg to your Spark environment. It can be done using the `--packages` option when starting the Spark shell or Spark SQL: | ||
|
|
||
| ```bash | ||
| spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1 | |
| spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want a tutorial for starting Spark here. It is good to know what additional libraries are needed for running GCSFileIO if you are using the runtime bundle though.
docs/gcs.md
Outdated
|
|
||
| ```bash | ||
| spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1 | ||
| spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1 | |
| spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1 |
docs/gcs.md
Outdated
| Catalogs in Iceberg are used to track tables. They can be configured using properties under `spark.sql.catalog.(catalog_name)`. Here is an example of how to configure a catalog: | ||
|
|
||
| ```bash | ||
| spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1\ | |
| spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1\ |
| --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ | ||
| --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \ | ||
| --conf spark.sql.catalog.spark_catalog.type=hive \ | ||
| --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be easier to understand when showing an example where only a single catalog is being initialized (so either local or spark_catalog)
| --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \ | ||
| --conf spark.sql.catalog.local.type=hadoop \ | ||
| --conf spark.sql.catalog.local.warehouse=$PWD/warehouse \ | ||
| --conf spark.sql.defaultCatalog=local |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this have a --conf spark.sql.catalog.local.io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO to properly pick up GCSFileIO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ResolvingFileIO should pick it based on the gs scheme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 ResolvingFileIO will automatically select GCSFileIO based on the scheme in the next release. It might be better to use the REST catalog as an example as it is easier to set up than a HMS, and no io-impl needs to be set.
docs/gcs.md
Outdated
| --conf spark.sql.catalog.spark_catalog.type=hive \ | ||
| --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \ | ||
| --conf spark.sql.catalog.local.type=hadoop \ | ||
| --conf spark.sql.catalog.local.warehouse=$PWD/warehouse \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would point to the respective gs:... bucket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed.
docs/gcs.md
Outdated
|
|
||
| ```java | ||
| Map<String, String> properties = new HashMap<>(); | ||
| properties.put("inputDataLocation", "gs://my_bucket/data/"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is inputDataLocation actually being used?
docs/gcs.md
Outdated
|
|
||
| - **Create a Cloud Storage bucket**: Navigate to the Cloud Storage Buckets page in the Google Cloud Console. Click "Create bucket", enter your details, and click "Create". | ||
|
|
||
| ## Configuring Apache Iceberg to Use GCS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this maybe show how this can be used via the Java API vs Spark? Most users would likely want to know how to use GCS with Spark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. We don't expect anyone to instantiate these directly. They are provided by Table instances.
docs/gcs.md
Outdated
|
|
||
| ## Setting Up Google Cloud Storage (GCS) | ||
|
|
||
| ### Setting Up a Bucket in GCS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed in Iceberg docs. This would be more appropriate for a "Using Iceberg with GCS" tutorial, but not the doc on GCS FileIO.
docs/gcs.md
Outdated
|
|
||
| Within these property key-value pairs, inputDataLocation and metadataLocation are the locations in your GCS bucket where your data and metadata are stored. Update `"gs://my_bucket/data/" ` and `"gs://my_bucket/metadata/" ` to reflect the corresponding paths of your GCS bucket. | ||
|
|
||
| ### Example Use of GCSFileIO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more useful for a general FileIO doc, not one specific to GCS.
|
|
||
| ```java | ||
| InputFile inputFile = gcsFileIO.newInputFile("gs://my_bucket/data/my_data.parquet"); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: unnecessary newline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its needed for formatting purpose otherwise it violates this rule - https://github.com/DavidAnson/markdownlint/blob/v0.29.0/doc/md031.md
|
|
||
| These steps will allow you to set up GCS as your storage layer for Apache Iceberg and interact with the data stored in GCS using the `GCSFileIO` class. | ||
|
|
||
| ## Loading Data into Iceberg Tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main thing here is to set up a warehouse with a gs:// URI for its base warehouse location.
docs/gcs.md
Outdated
| --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \ | ||
| --conf spark.sql.catalog.spark_catalog.type=hive \ | ||
| --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \ | ||
| --conf spark.sql.catalog.local.type=hadoop \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Examples should not use the hadoop catalog type. hive is probably better.
| - limitations under the License. | ||
| --> | ||
|
|
||
| # Iceberg GSC Integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: I think you meant GCS
| - **Add Iceberg to Spark**: If you already have a Spark environment, you can add Iceberg by specifying the `--packages` option when starting Spark. This will download the required Iceberg package and make it available in your Spark session. Here is an example: | ||
|
|
||
| ```bash | ||
| shell spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This currently isn't enough. The runtime doesn't currently contain iceberg-gcp nor does it contain the Google java client libraries. See #8231 for a proposal around this. You might also want to comment on configuring authentication, e.g. setting the GOOGLE_APPLICATION_CREDENTIALS environment variable or setting the GCSFileIO properties for that.
|
Setting up Iceberg with GCS recently came up on the Slack channel, there are some shorthand notes on getting up and running with the REST catalog in case it is useful as a reference. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Added documentation on getting started with GCS with following content:
Refers to - #7948