[SPARK-27592][SQL] Set the bucketed data source table SerDe correctly by wangyum · Pull Request #24486 · apache/spark

wangyum · 2019-04-29T10:06:47Z

What changes were proposed in this pull request?

Hive using incorrect InputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat) to read Spark's Parquet bucketed data source table.
Spark side:

spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) SORTED BY (c1) INTO 2 BUCKETS;
2019-04-29 17:52:05 WARN  HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
spark-sql> DESC FORMATTED t;
c1	int	NULL
c2	int	NULL

# Detailed Table Information
Database	default
Table	t
Owner	yumwang
Created Time	Mon Apr 29 17:52:05 CST 2019
Last Access	Thu Jan 01 08:00:00 CST 1970
Created By	Spark 2.4.0
Type	MANAGED
Provider	parquet
Num Buckets	2
Bucket Columns	[`c1`]
Sort Columns	[`c1`]
Table Properties	[transient_lastDdlTime=1556531525]
Location	file:/user/hive/warehouse/t
Serde Library	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat	org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat	org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties	[serialization.format=1]

Hive side:

hive> DESC FORMATTED t;
OK
# col_name            	data_type           	comment

c1                  	int
c2                  	int

# Detailed Table Information
Database:           	default
Owner:              	root
CreateTime:         	Wed May 08 03:38:46 GMT-07:00 2019
LastAccessTime:     	UNKNOWN
Retention:          	0
Location:           	file:/user/hive/warehouse/t
Table Type:         	MANAGED_TABLE
Table Parameters:
	bucketing_version   	spark
	spark.sql.create.version	3.0.0-SNAPSHOT
	spark.sql.sources.provider	parquet
	spark.sql.sources.schema.bucketCol.0	c1
	spark.sql.sources.schema.numBucketCols	1
	spark.sql.sources.schema.numBuckets	2
	spark.sql.sources.schema.numParts	1
	spark.sql.sources.schema.numSortCols	1
	spark.sql.sources.schema.part.0	{\"type\":\"struct\",\"fields\":[{\"name\":\"c1\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"c2\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}
	spark.sql.sources.schema.sortCol.0	c1
	transient_lastDdlTime	1557311926

# Storage Information
SerDe Library:      	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:        	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:       	org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed:         	No
Num Buckets:        	-1
Bucket Columns:     	[]
Sort Columns:       	[]
Storage Desc Params:
	path                	file:/user/hive/warehouse/t
	serialization.format	1

So it's non-bucketed table at Hive side. This pr set the SerDe correctly so Hive can read these tables.

Related code:

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

Lines 976 to 990 in 33f3c48

    
           table.bucketSpec match { 
        
             case Some(bucketSpec) if DDLUtils.isHiveTable(table) => 
        
               hiveTable.setNumBuckets(bucketSpec.numBuckets) 
        
               hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava) 
        
               if (bucketSpec.sortColumnNames.nonEmpty) { 
        
                 hiveTable.setSortCols( 
        
                   bucketSpec.sortColumnNames 
        
                     .map(col => new Order(col, HIVE_COLUMN_ORDER_ASC)) 
        
                     .toList 
        
                     .asJava 
        
                 ) 
        
               } 
        
             case _ => 
        
           }

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

Lines 444 to 459 in f9776e3

    
           if (bucketSpec.isDefined) { 
        
             val BucketSpec(numBuckets, bucketColumnNames, sortColumnNames) = bucketSpec.get 
        
             properties.put(DATASOURCE_SCHEMA_NUMBUCKETS, numBuckets.toString) 
        
             properties.put(DATASOURCE_SCHEMA_NUMBUCKETCOLS, bucketColumnNames.length.toString) 
        
             bucketColumnNames.zipWithIndex.foreach { case (bucketCol, index) => 
        
               properties.put(s"$DATASOURCE_SCHEMA_BUCKETCOL_PREFIX$index", bucketCol) 
        
             } 
        
             if (sortColumnNames.nonEmpty) { 
        
               properties.put(DATASOURCE_SCHEMA_NUMSORTCOLS, sortColumnNames.length.toString) 
        
               sortColumnNames.zipWithIndex.foreach { case (sortCol, index) => 
        
                 properties.put(s"$DATASOURCE_SCHEMA_SORTCOL_PREFIX$index", sortCol) 
        
               } 
        
             } 
        
           }

How was this patch tested?

unit tests

SparkQA · 2019-04-29T10:09:43Z

Test build #104987 has finished for PR 24486 at commit 6ce0a32.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2019-04-29T13:18:06Z

retest this please

SparkQA · 2019-04-29T14:47:45Z

Test build #104994 has finished for PR 24486 at commit 6ce0a32.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-04T16:47:24Z

Test build #105125 has finished for PR 24486 at commit 2fdc3a6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2019-05-04T17:54:30Z

retest this please

SparkQA · 2019-05-04T19:41:17Z

Test build #105126 has finished for PR 24486 at commit 2fdc3a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-05-06T16:55:08Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+      // our bucketing is un-compatible with hive(different hash function).
+      // but downstream(Hive/Presto) still can read it as not bucketed table.
+      // We set the SerDe correctly and bucketing_version to spark.
+      // The downstream decides how to read it by themselves, a similar implementation:


What is the impact if the downstream makes a wrong decision?

Sorry. It's not bucketed table at Hive side. Related code:

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

Lines 444 to 459 in f9776e3

if (bucketSpec.isDefined) {

val BucketSpec(numBuckets, bucketColumnNames, sortColumnNames) = bucketSpec.get

properties.put(DATASOURCE_SCHEMA_NUMBUCKETS, numBuckets.toString)

properties.put(DATASOURCE_SCHEMA_NUMBUCKETCOLS, bucketColumnNames.length.toString)

bucketColumnNames.zipWithIndex.foreach { case (bucketCol, index) =>

properties.put(s"$DATASOURCE_SCHEMA_BUCKETCOL_PREFIX$index", bucketCol)

}

if (sortColumnNames.nonEmpty) {

properties.put(DATASOURCE_SCHEMA_NUMSORTCOLS, sortColumnNames.length.toString)

sortColumnNames.zipWithIndex.foreach { case (sortCol, index) =>

properties.put(s"$DATASOURCE_SCHEMA_SORTCOL_PREFIX$index", sortCol)

}

}

}

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

Lines 976 to 990 in 33f3c48

table.bucketSpec match {

case Some(bucketSpec) if DDLUtils.isHiveTable(table) =>

hiveTable.setNumBuckets(bucketSpec.numBuckets)

hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava)

if (bucketSpec.sortColumnNames.nonEmpty) {

hiveTable.setSortCols(

bucketSpec.sortColumnNames

.map(col => new Order(col, HIVE_COLUMN_ORDER_ASC))

.toList

.asJava

)

}

case _ =>

}

…side

SparkQA · 2019-05-08T12:21:34Z

Test build #105253 has finished for PR 24486 at commit 4e1dd5c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

Have you manually tested it? To read the Spark bucketed table in Hive side as non-bucketed table?

wangyum · 2019-05-09T00:09:53Z

Have you manually tested it? To read the Spark bucketed table in Hive side as non-bucketed table?

Yes. I have tested it. Note that we should set spark.sql.parquet.writeLegacyFormat to true.

gatorsmile · 2019-05-22T15:43:18Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

-        (None, message)
+            "Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. " +
+            "But Hive can read it as not bucketed table."
+        (Some(newHiveCompatibleMetastoreTable(serde)), message)


Should we set bucketSpec = None?

Not necessary:

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

Lines 1009 to 1023 in fee695d

table.bucketSpec match {

case Some(bucketSpec) if DDLUtils.isHiveTable(table) =>

hiveTable.setNumBuckets(bucketSpec.numBuckets)

hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava)

if (bucketSpec.sortColumnNames.nonEmpty) {

hiveTable.setSortCols(

bucketSpec.sortColumnNames

.map(col => new Order(col, HIVE_COLUMN_ORDER_ASC))

.toList

.asJava

)

}

case _ =>

}

gatorsmile · 2019-05-22T15:44:58Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala

+      // It's not a bucketed table at Hive side
+      val client =
+        spark.sharedState.externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client
+      val hiveSide = client.runSqlHive("DESC FORMATTED t")


also check the results of the read path.

SparkQA · 2019-05-23T14:21:31Z

Test build #105723 has finished for PR 24486 at commit 26c895a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-23T16:10:47Z

Test build #105725 has finished for PR 24486 at commit 87b302c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2019-08-15T06:19:25Z

ping @cloud-fan

cloud-fan · 2019-08-15T06:28:44Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala


  }
+
+  test("Set the bucketed data source table SerDe correctly") {


let's include the jira id in test name.

cloud-fan · 2019-08-15T06:29:43Z

LGTM

SparkQA · 2019-08-15T07:05:02Z

Test build #109147 has finished for PR 24486 at commit 842bd3e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2019-08-15T07:08:21Z

retest this please

SparkQA · 2019-08-15T08:59:17Z

Test build #109149 has finished for PR 24486 at commit 842bd3e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-15T09:21:50Z

thanks, merging to master!

gatorsmile · 2019-08-26T23:31:57Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala

+          |CLUSTERED BY (c1)
+          |SORTED BY (c1)
+          |INTO 2 BUCKETS
+          |AS SELECT 1 AS c1, 2 AS c2


Only one row is hard to prove Hive can read it correctly. Could you improve the tests?

In addition, try to create a partitioned and bucked table and see whether they are readable by Hive.

You can create a separate test suite for it.

Write the write information of this data to metadata.

6ce0a32

Set the bucketed data source table SerDe correctly

2fdc3a6

wangyum changed the title ~~[SPARK-27592][SQL] Write the data of table write information to metadata~~ [SPARK-27592][SQL] Set the bucketed data source table SerDe correctly May 4, 2019

gatorsmile reviewed May 6, 2019

View reviewed changes

Spark bucketed data source table always a not bucketed table at Hive …

4e1dd5c

…side

viirya reviewed May 8, 2019

View reviewed changes

gatorsmile reviewed May 22, 2019

View reviewed changes

wangyum added 3 commits May 23, 2019 20:20

Merge remote-tracking branch 'upstream/master' into SPARK-27592

921bbb0

Check the results.

26c895a

Fix scala style

87b302c

dongjoon-hyun added the SQL label Jun 14, 2019

cloud-fan reviewed Aug 15, 2019

View reviewed changes

wangyum added 2 commits August 15, 2019 14:36

Merge remote-tracking branch 'upstream/master' into SPARK-27592

86e4394

Include the jira id in test name.

842bd3e

cloud-fan closed this in 1b416a0 Aug 15, 2019

wangyum deleted the SPARK-27592 branch August 15, 2019 09:39

gatorsmile reviewed Aug 26, 2019

View reviewed changes

	table.bucketSpec match {
	case Some(bucketSpec) if DDLUtils.isHiveTable(table) =>
	hiveTable.setNumBuckets(bucketSpec.numBuckets)
	hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava)

	if (bucketSpec.sortColumnNames.nonEmpty) {
	hiveTable.setSortCols(
	bucketSpec.sortColumnNames
	.map(col => new Order(col, HIVE_COLUMN_ORDER_ASC))
	.toList
	.asJava
	)
	}
	case _ =>
	}

	if (bucketSpec.isDefined) {
	val BucketSpec(numBuckets, bucketColumnNames, sortColumnNames) = bucketSpec.get

	properties.put(DATASOURCE_SCHEMA_NUMBUCKETS, numBuckets.toString)
	properties.put(DATASOURCE_SCHEMA_NUMBUCKETCOLS, bucketColumnNames.length.toString)
	bucketColumnNames.zipWithIndex.foreach { case (bucketCol, index) =>
	properties.put(s"$DATASOURCE_SCHEMA_BUCKETCOL_PREFIX$index", bucketCol)
	}

	if (sortColumnNames.nonEmpty) {
	properties.put(DATASOURCE_SCHEMA_NUMSORTCOLS, sortColumnNames.length.toString)
	sortColumnNames.zipWithIndex.foreach { case (sortCol, index) =>
	properties.put(s"$DATASOURCE_SCHEMA_SORTCOL_PREFIX$index", sortCol)
	}
	}
	}


		}

		test("Set the bucketed data source table SerDe correctly") {

Conversation

wangyum commented Apr 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 29, 2019

Uh oh!

wangyum commented Apr 29, 2019

Uh oh!

SparkQA commented Apr 29, 2019

Uh oh!

SparkQA commented May 4, 2019

Uh oh!

wangyum commented May 4, 2019

Uh oh!

SparkQA commented May 4, 2019

Uh oh!

gatorsmile May 6, 2019

Choose a reason for hiding this comment

Uh oh!

wangyum May 8, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 8, 2019

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

wangyum commented May 9, 2019

Uh oh!

gatorsmile May 22, 2019

Choose a reason for hiding this comment

Uh oh!

wangyum May 23, 2019

Choose a reason for hiding this comment

Uh oh!

gatorsmile May 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 23, 2019

Uh oh!

SparkQA commented May 23, 2019

Uh oh!

wangyum commented Aug 15, 2019

Uh oh!

cloud-fan Aug 15, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Aug 15, 2019

Uh oh!

SparkQA commented Aug 15, 2019

Uh oh!

wangyum commented Aug 15, 2019

Uh oh!

SparkQA commented Aug 15, 2019

Uh oh!

cloud-fan commented Aug 15, 2019

Uh oh!

gatorsmile Aug 26, 2019

Choose a reason for hiding this comment

Uh oh!

wangyum Aug 27, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Comments

wangyum commented Apr 29, 2019 •

edited

Loading

gatorsmile May 22, 2019 •

edited

Loading