-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-22833 [Improvement] in SparkHive Scala Examples #20018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
9d9b42b
0bbad8c
ee53208
2f98a3c
69a4145
b95587d
9b8d188
c3dda1b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,8 +19,7 @@ package org.apache.spark.examples.sql.hive | |
| // $example on:spark_hive$ | ||
| import java.io.File | ||
|
|
||
| import org.apache.spark.sql.Row | ||
| import org.apache.spark.sql.SparkSession | ||
| import org.apache.spark.sql.{Row, SaveMode, SparkSession} | ||
| // $example off:spark_hive$ | ||
|
|
||
| object SparkHiveExample { | ||
|
|
@@ -104,6 +103,60 @@ object SparkHiveExample { | |
| // ... | ||
| // $example off:spark_hive$ | ||
|
|
||
| // to save DataFrame to Hive Managed table as Parquet format | ||
|
|
||
| /* | ||
|
||
| * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default | ||
| * warehouse location will be used to store Hive table Data. | ||
| * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; | ||
| * You don't have to explicitly give location for each table, every tables under specified schema will be located at | ||
| * location given while creating schema. | ||
| * 2. Create Hive Managed table with storage format as 'Parquet' | ||
| * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; | ||
| */ | ||
| val hiveTableDF = sql("SELECT * FROM records").toDF() | ||
|
||
| hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") | ||
|
|
||
| // to save DataFrame to Hive External table as compatible parquet format. | ||
| /* | ||
| * 1. Create Hive External table with storage format as parquet. | ||
| * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; | ||
|
||
| * Since we are not explicitly providing hive database location, it automatically takes default warehouse location | ||
| * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport(). | ||
| * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories | ||
| * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'. | ||
| */ | ||
|
|
||
| // to make Hive parquet format compatible with spark parquet format | ||
|
||
| spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") | ||
| // Multiple parquet files could be created accordingly to volume of data under directory given. | ||
|
||
| val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records" | ||
| hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) | ||
|
|
||
| // turn on flag for Dynamic Partitioning | ||
|
||
| spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") | ||
| spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") | ||
| // You can create partitions in Hive table, so downstream queries run much faster. | ||
| hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") | ||
| .parquet(hiveExternalTableLocation) | ||
| /* | ||
| If Data volume is very huge, then every partitions would have many small-small files which may harm | ||
|
||
| downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O. | ||
| To improve performance you can create single parquet file under each partition directory using 'repartition' | ||
| on partitioned key for Hive table. | ||
| */ | ||
| hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not a standard usage, let's not put it in the example.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan removed all comments , as discussed with @srowen it does really make sense to have at docs with removed inconsitency. |
||
| .partitionBy("key").parquet(hiveExternalTableLocation) | ||
|
|
||
| /* | ||
| You can also do coalesce to control number of files under each partitions, repartition does full shuffle and equal | ||
| data distribution to all partitions. here coalesce can reduce number of files to given 'Int' argument without | ||
|
||
| full data shuffle. | ||
| */ | ||
| // coalesce of 10 could create 10 parquet files under each partitions, | ||
| // if data is huge and make sense to do partitioning. | ||
| hiveTableDF.coalesce(10).write.mode(SaveMode.Overwrite) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto |
||
| .partitionBy("key").parquet(hiveExternalTableLocation) | ||
| spark.stop() | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you not want the code below to render in the docs as part of the example? maybe not, just checking if that's intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Thank you for valueable feedback review, I have added that so it can help other develoeprs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Can you please review this cc\ @holdenk @sameeragarwal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen I have updated DDL when storing data with parititoning in Hive.
cc\ @HyukjinKwon @mgaido91 @markgrover @markhamstra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you turn the example listing off then on again? just remove those two lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen I mis-understood your first comment. I have reverted as suggested. Please check now