Skip to content

[SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's convertible#35528

Closed
AngersZhuuuu wants to merge 4 commits intoapache:masterfrom
AngersZhuuuu:SPARK-38215
Closed

[SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's convertible#35528
AngersZhuuuu wants to merge 4 commits intoapache:masterfrom
AngersZhuuuu:SPARK-38215

Conversation

@AngersZhuuuu
Copy link
Contributor

What changes were proposed in this pull request?

Currently spark sql

INSERT OVERWRITE DIRECTORY 'path'
STORED AS PARQUET
query

can't be converted to use InsertIntoDataSourceCommand, still use Hive SerDe to write data, this cause we can't use feature provided by new parquet/orc version, such as zstd compress.

spark-sql> INSERT OVERWRITE DIRECTORY 'hdfs://nameservice/user/hive/warehouse/test_zstd_dir'
         > stored as parquet
         > select 1 as id;
[Stage 5:>                                                          (0 + 1) / 1]22/02/15 16:49:31 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, ip-xx-xx-xx-xx, executor 21): org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: No enum constant parquet.hadoop.metadata.CompressionCodecName.ZSTD
	at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
	at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:123)
	at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:269)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:203)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:202)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Why are the changes needed?

Convert InsertIntoHiveDirCommand to InsertIntoDataSourceCommand can support more features of parquet/orc

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT

@github-actions github-actions bot added the SQL label Feb 15, 2022
@AngersZhuuuu
Copy link
Contributor Author

Gentle ping @cloud-fan @viirya Could you take a review? it's a useful feature.

@AngersZhuuuu
Copy link
Contributor Author

Also ping @dongjoon-hyun @HyukjinKwon


private def convertProvider(storage: CatalogStorageFormat): String = {
val serde = storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
Some("parquet").filter(serde.contains).getOrElse("orc")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

if (serde.contains("parquet")) parquet else orc

is much simpler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (serde.contains("parquet")) parquet else orc

updated

* - When writing to partitioned Hive-serde Parquet/Orc tables when
* `spark.sql.hive.convertInsertingPartitionedTable` is true
* - When writing to directory with Hive-serde
* - When writing to non-partitioned Hive-serde Parquet/ORC tables using CTAS
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Update the comment of this rule, also add comment about CTAS

@AngersZhuuuu
Copy link
Contributor Author

Gentle ping @cloud-fan GA passed.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in a92f873 Feb 18, 2022
@PengleiShi
Copy link

@AngersZhuuuu Hi,in the case of inserted dir has same path as selected table location, this may cause error. https://issues.apache.org/jira/browse/SPARK-38215

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants