-
Notifications
You must be signed in to change notification settings - Fork 29k
[Doc] [SQL] Addes Hive metastore Parquet table conversion section #5348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
50675db
756e660
4c9847d
42ae0d0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,7 +22,7 @@ The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark. | |
| All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`, `pyspark` shell, or `sparkR` shell. | ||
|
|
||
|
|
||
| ## Starting Point: `SQLContext` | ||
| ## Starting Point: SQLContext | ||
|
|
||
| <div class="codetabs"> | ||
| <div data-lang="scala" markdown="1"> | ||
|
|
@@ -1036,6 +1036,15 @@ for (teenName in collect(teenNames)) { | |
|
|
||
| </div> | ||
|
|
||
| <div data-lang="python" markdown="1"> | ||
|
|
||
| {% highlight python %} | ||
| # sqlContext is an existing HiveContext | ||
| sqlContext.sql("REFRESH TABLE my_table") | ||
| {% endhighlight %} | ||
|
|
||
| </div> | ||
|
|
||
| <div data-lang="sql" markdown="1"> | ||
|
|
||
| {% highlight sql %} | ||
|
|
@@ -1054,7 +1063,7 @@ SELECT * FROM parquetTable | |
|
|
||
| </div> | ||
|
|
||
| ### Partition discovery | ||
| ### Partition Discovery | ||
|
|
||
| Table partitioning is a common optimization approach used in systems like Hive. In a partitioned | ||
| table, data are usually stored in different directories, with partitioning column values encoded in | ||
|
|
@@ -1108,7 +1117,7 @@ can be configured by `spark.sql.sources.partitionColumnTypeInference.enabled`, w | |
| `true`. When type inference is disabled, string type will be used for the partitioning columns. | ||
|
|
||
|
|
||
| ### Schema merging | ||
| ### Schema Merging | ||
|
|
||
| Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with | ||
| a simple schema, and gradually add more columns to the schema as needed. In this way, users may end | ||
|
|
@@ -1208,6 +1217,79 @@ printSchema(df3) | |
|
|
||
| </div> | ||
|
|
||
| ### Hive metastore Parquet table conversion | ||
|
|
||
| When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own | ||
| Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the | ||
| `spark.sql.hive.convertMetastoreParquet` configuration, and is turned on by default. | ||
|
|
||
| #### Hive/Parquet Schema Reconciliation | ||
|
|
||
| There are two key differences between Hive and Parquet from the perspective of table schema | ||
| processing. | ||
|
|
||
| 1. Hive is case insensitive, while Parquet is not | ||
| 1. Hive considers all columns nullable, while nullability in Parquet is significant | ||
|
|
||
| Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a | ||
| Hive metastore Parquet table to a Spark SQL Parquet table. The reconciliation rules are: | ||
|
|
||
| 1. Fields that have the same name in both schema must have the same data type regardless of | ||
| nullability. The reconciled field should have the data type of the Parquet side, so that | ||
| nullability is respected. | ||
|
|
||
| 1. The reconciled schema contains exactly those fields defined in Hive metastore schema. | ||
|
|
||
| - Any fields that only appear in the Parquet schema are dropped in the reconciled schema. | ||
| - Any fileds that only appear in the Hive metastore schema are added as nullable field in the | ||
| reconciled schema. | ||
|
|
||
| #### Metadata Refreshing | ||
|
|
||
| Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, I was wondering if we have a section to explain what is a data source table?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree, missing such a section is part of the reason why I put the metadata refreshing section here... |
||
| conversion is enabled, metadata of those converted tables are also cached. If these tables are | ||
| updated by Hive or other external tools, you need to refresh them manually to ensure consistent | ||
| metadata. | ||
|
|
||
| <div class="codetabs"> | ||
|
|
||
| <div data-lang="scala" markdown="1"> | ||
|
|
||
| {% highlight scala %} | ||
| // sqlContext is an existing HiveContext | ||
| sqlContext.refreshTable("my_table") | ||
| {% endhighlight %} | ||
|
|
||
| </div> | ||
|
|
||
| <div data-lang="java" markdown="1"> | ||
|
|
||
| {% highlight java %} | ||
| // sqlContext is an existing HiveContext | ||
| sqlContext.refreshTable("my_table") | ||
| {% endhighlight %} | ||
|
|
||
| </div> | ||
|
|
||
| <div data-lang="python" markdown="1"> | ||
|
|
||
| {% highlight python %} | ||
| # sqlContext is an existing HiveContext | ||
| sqlContext.refreshTable("my_table") | ||
| {% endhighlight %} | ||
|
|
||
| </div> | ||
|
|
||
| <div data-lang="sql" markdown="1"> | ||
|
|
||
| {% highlight sql %} | ||
| REFRESH TABLE my_table; | ||
| {% endhighlight %} | ||
|
|
||
| </div> | ||
|
|
||
| </div> | ||
|
|
||
| ### Configuration | ||
|
|
||
| Configuration of Parquet can be done using the `setConf` method on `SQLContext` or by running | ||
|
|
@@ -1445,8 +1527,8 @@ This command builds a new assembly jar that includes Hive. Note that this Hive a | |
| on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries | ||
| (SerDes) in order to access data stored in Hive. | ||
|
|
||
| Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`. Please note when running | ||
| the query on a YARN cluster (`yarn-cluster` mode), the `datanucleus` jars under the `lib_managed/jars` directory | ||
| Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`. Please note when running | ||
| the query on a YARN cluster (`yarn-cluster` mode), the `datanucleus` jars under the `lib_managed/jars` directory | ||
| and `hive-site.xml` under `conf/` directory need to be available on the driver and all executors launched by the | ||
| YARN cluster. The convenient way to do this is adding them through the `--jars` option and `--file` option of the | ||
| `spark-submit` command. | ||
|
|
@@ -1889,7 +1971,7 @@ options. | |
| #### DataFrame data reader/writer interface | ||
|
|
||
| Based on user feedback, we created a new, more fluid API for reading data in (`SQLContext.read`) | ||
| and writing data out (`DataFrame.write`), | ||
| and writing data out (`DataFrame.write`), | ||
| and deprecated the old APIs (e.g. `SQLContext.parquetFile`, `SQLContext.jsonFile`). | ||
|
|
||
| See the API docs for `SQLContext.read` ( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it is under
Hive metastore Parquet table conversion. However, users may need to callrefresh tablein other cases, right? For example, when they manually copy data to the dir of a data source table.