Skip to content
This repository was archived by the owner on Feb 6, 2024. It is now read-only.

Conversation

@samredai
Copy link
Contributor

@samredai samredai commented Apr 25, 2022

This adds a Spark Quickstart page that effectively replaces the Spark Getting-Started page in the docs site.

This quickstart uses the tabulario/spark-iceberg docker image and all code snippets aim to be directly copy-pastable and run successfully in a fresh docker-compose up. Furthermore all code snippets are provided with tabs to see the equivalent logic in SQL, Scala, or Python.

This is dependent on PR #73 and is part of the broader initiative outlined in this issue in the iceberg repo.

note: I first added the current getting-started page in a commit as-is so that the diff will be visible in the next commit 7d643ba.

UPDATED

### Creating a table

To create your first Iceberg table in Spark, run a [`CREATE TABLE`](../spark-ddl#create-table) command. In the following example, we'll create a table
using `prod.nyc.taxis` where `prod` is the catalog name, `nyc` is the schema name, and `taxis` is the table name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use test or a different catalog name? This isn't a prod catalog, so it seems odd to call it prod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing this. I've changed the catalog name everywhere to demo which is what we use in the docker image and seems to be sticking as a catalog name we use in examples elsewhere.

@samredai
Copy link
Contributor Author

Thanks @rdblue! Sorry for taking a while to address the comments here. I have some more structural theme changes that I'm getting in now and this should be fully ready for a review+merge.

@samredai samredai marked this pull request as ready for review May 17, 2022 22:37
@samredai
Copy link
Contributor Author

This is ready for review. I've added only the Quickstarts menu item in the top navbar. The Concepts section is hidden until we have some content to fill that out.

quickstart.mp4

@samredai
Copy link
Contributor Author

Rebased and squashed all commits! (now that PR #73 has been merged)

{{% codetabs "AddIcebergToSpark" %}}
{{% addtab "SparkSQL" checked %}}
{{% addtab "SparkShell" %}}
{{% addtab "PySpark" %}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be a good idea to use tabs for Spark version instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this potentially be confusing for someone trying out Iceberg for the first time? The quickstart begins with a docker image which we'd have to keep in sync with the examples here. If we also include examples for other Spark versions, someone trying this out would have to concern themselves with which version of Spark we've included in the example image to make sure they're using the correct code snippets.

@samredai
Copy link
Contributor Author

This is ready for another review. I've rebased this to use the new iceberg-theme and also included a "quickstart" shortcode that renders a "More Quickstarts" dropdown menu at the top of each quickstart guide. The shortcode will also exclude the current quickstart page you're on from the dropdown which means until we have a second one, the dropdown will be empty. Here's a video that generally shows what this looks like where I also add in a handful of entries to the quickstart menu to show what the dropdown looks like.

quickstart-cards.mp4

The fastest way to get started is to use a docker-compose file that uses the the [tabulario/spark-iceberg](https://hub.docker.com/r/tabulario/spark-iceberg) image
which contains a local Spark cluster with a configured Iceberg catalog. To use this, you'll need to install the [Docker CLI](https://docs.docker.com/get-docker/) as well as the [Docker Compose CLI](https://github.com/docker/compose-cli/blob/main/INSTALL.md).

Once you have those, save the yaml below into a file named `docker-compose.yml`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later, we should consider adding a Quickstart folder in the Iceberg repo. We could have a tree like this:

  quickstart/
  |- README.md
  |- spark/
  |  |- README.md
  |  |- quickstart.ipynb
  |  `- docker-compose.yml
  `- flink/
     |- README.md
     |- quickstart.ipynb
     `- docker-compose.yml

- [Adding A Catalog](#adding-a-catalog)
- [Next Steps](#next-steps)

### Docker-Compose
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later, we should consider keeping the Spark shell instructions, in case anyone already has Spark.

- "quickstarts"
- "getting-started"
disableSidebar: true
disableToc: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we include a TOC at the top of the quickstart, it felt odd seeing both. Would you rather have the fixed TOC on the right and remove it from the intro section?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably include it eventually, but that would happen when there's more content.

{{% /tabcontent %}}
{{% /codetabs %}}
{{< hint info >}}
You can also launch a notebook server by running `docker exec -it spark-iceberg notebook`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "If you prefer a PySpark notebook, you can start a notebook server by ..."? That way it is clear that the notebooks will be PySpark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notebooks actually can be python or java (both kernels are available) although I know it's less likely that Java devs would want to use the java kernel to run a spark app. I've been meaning to add a scala kernel as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can then run any of the following commands to start a Spark session.

{{% codetabs "LaunchSparkClient" %}}
{{% addtab "SparkSQL" checked %}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to change all of the tabs at once, depending on what is checked for any of them? That would be really helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to do this with a small amount of javascript and introducing a concept of "groups" to the shortcodes behind this. It's described in the readme that I add in PR #110 and the logic is added to iceberg-theme.js.

{{% /tabcontent %}}
{{% /codetabs %}}

### Adding Iceberg to Spark
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't fit with the flow of the quickstart page. It goes directly from an example of reading to a shell command to start Spark.

I think it would help to have more content explaining what is happening and giving context. This should also have an outline that is more clear, with higher-level sections like "Interacting with tables" where creating, reading, and writing sections will go.

I think this content should be in "Next steps" because it covers how to get Iceberg installed outside of the quickstart. Maybe that should be a Spark page and not a quickstart page, so Next steps just links to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the section definitely breaks the flow of the guide. I moved it to the end and now the order looks like:

  • Docker-Compose
  • Creating a table
  • Writing Data to a Table
  • Reading Data from a Table
  • Adding A Catalog
  • Next Steps

This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants