[SPARK-17642] [SQL] support DESC EXTENDED/FORMATTED table column commands #16422

wzhfy · 2016-12-28T14:01:34Z

What changes were proposed in this pull request?

Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command.
Support DESC EXTENDED | FORMATTED TABLE COLUMN command to show column-level statistics.
Do NOT support describe nested columns.

How was this patch tested?

Added test cases.

wzhfy · 2016-12-28T14:02:30Z

cc @cloud-fan @gatorsmile

SparkQA · 2016-12-28T15:45:55Z

Test build #70670 has finished for PR 16422 at commit 3058ab1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DescribeColumnCommand(

SparkQA · 2016-12-29T03:06:20Z

Test build #70691 has finished for PR 16422 at commit d41a9cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-12-29T04:51:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

shall we throw exception here if partition spec is given?

yes we should

cloud-fan · 2016-12-29T04:51:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

please follow other commands and add more description.

cloud-fan · 2016-12-29T04:52:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

can you add a link to the hive spec about this?

I got these names by running hive. I can't find any document about the names, but I'll add a link of the corresponding JIRA of Hive.

cloud-fan · 2016-12-29T04:53:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

where do we use isExtended?

I will remove it since the result is same with or without isExtended.

cloud-fan · 2016-12-29T04:54:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

why we create an ArrayBuffer? Doesn't it always return a single row?

yea I'll delete it.

cloud-fan · 2016-12-29T05:00:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

I don't get it, so you get the attribute just for column comment and name and data type? I think CatalogTable.schema already have this information.

SparkQA · 2016-12-29T14:14:18Z

Test build #70719 has finished for PR 16422 at commit 30cb1ae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-12-30T03:07:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

the parser rule for the column name here:

describeColName : identifier ('.' (identifier | STRING))* ;

can we just make it identifier? "a.b" should refer to a column named "a.b", or the inner field "b" from column "a"? let's check with other databases.

It seems mysql doesn't support struct or nested types? @gatorsmile Can you give some advice on this?

I assume we are following Hive syntax here? What is the behavior of Hive?

cloud-fan · 2016-12-30T03:09:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

shall we call getTempViewOrPermanentTableMetadata?

ok, thanks!

gatorsmile · 2016-12-30T07:55:00Z

What is the behavior of DESC COLUMN for the complex/nested type (map, struct, array)? Could you add a test case? Also copy and paste the result here?

wzhfy · 2016-12-30T08:00:54Z

Hive running result:

hive> desc student_test;
OK
id                  	int                 	                    
info                	struct<name:string,age:int>	                    
Time taken: 1.643 seconds, Fetched: 2 row(s)
hive> desc student_test info;
OK
name                	string              	from deserializer   
age                 	int                 	from deserializer   
Time taken: 0.062 seconds, Fetched: 2 row(s)
hive> desc student_test info.name;
OK
name                	string              	from deserializer   
Time taken: 0.061 seconds, Fetched: 1 row(s)
hive> desc formatted student_test info.name;
OK
# col_name            	data_type           	min                 	max                 	num_nulls           	distinct_count      	avg_col_len         	max_col_len         	num_trues           	num_falses          	comment             
	 	 	 	 	 	 	 	 	 	 
name                	string              	                    	                    	                    	                    	                    	                    	                    	                    	from deserializer   
Time taken: 0.252 seconds, Fetched: 3 row(s)
hive> desc formatted student_test info;
OK
# col_name            	data_type           	min                 	max                 	num_nulls           	distinct_count      	avg_col_len         	max_col_len         	num_trues           	num_falses          	comment             
	 	 	 	 	 	 	 	 	 	 
name                	string              	                    	                    	                    	                    	                    	                    	                    	                    	from deserializer   
age                 	int                 	                    	                    	                    	                    	                    	                    	                    	                    	from deserializer   
Time taken: 0.086 seconds, Fetched: 4 row(s)

wzhfy · 2016-12-30T08:25:17Z

@cloud-fan and I discussed about complex typed column, and he suggested we don't support desc table column command for fields in complex column, because there's no more information of field name when describing info.name than describing info.
And if we don't plan to support describing fields in complex typed column, we can change describeColName to identifier.
We just want to know what's the behavior of other databases on describing fields in complex columns.

cloud-fan · 2016-12-30T08:48:56Z

@wzhfy I think you misunderstand @gatorsmile 's question. We should support complex column, e.g. struct type column, array type column. But we should not support nested column, e.g. DESC tbl col1.field1

wzhfy · 2016-12-30T09:13:31Z

@cloud-fan Sorry it was kind of typo, I meant fields in complex column, I forgot to type "fields in". I've updated the comment and now it's consistent with your suggestion.

gatorsmile · 2016-12-31T02:45:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

This might generate a confusing error message.

sql("describe formatted default.tab1.s").show(false) org.apache.spark.sql.catalyst.parser.ParseException: DESC TABLE COLUMN for an inner column of a nested type is not supported(line 1, pos 0)

In this case, formatted becomes table identifier. Should I postpone detection of nested column to run() method of DescColumnCommand? Then the existence of table idenfifier will be checked first.

Sure, you can try it.

gatorsmile · 2016-12-31T03:09:10Z

To get the column names and types (of either basic or complex types), we do not need DESC COLUMN. DESC TABLE is enough.

For retrieving the statistics, each vendor has different ways. Normally, users can access the statistics from the catalog tables/views or data dictionary views. AFAIK, I do not know any system offers DESC COLUMN, except the Hive-like system. Hive 2.x also has a different syntax from Hive 1.x. In this PR, we follow Hive 2.x.

The complex types can be achieved in RDBMS by UDT. For example, in Oracle, the logical mapping of structured type is abstract data types. Also, DB2 documents how to use the structured type in the link. To access the nested field, it is using double dots (e.g., col1..field1). : )

gatorsmile · 2016-12-31T04:37:18Z

After rethinking about it, DESC EXTENDED/FORMATTED COLUMN discloses the data patterns/statistics info. These info are pretty sensitive. Not all the users should be allowed to access it.

We might face the security-related complaints about this feature. Also cc @rxin @yhuai @hvanhovell

SparkQA · 2017-01-03T10:43:29Z

Test build #70808 has finished for PR 16422 at commit 4a68ed6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-01-04T04:19:57Z

@gatorsmile Why is statistics info sensitive? Users can run sql queries to get each of them (max, min, ndv, etc) anyway.

gatorsmile · 2017-01-05T05:42:43Z

Column-level security can block users to access the specific columns, but this command DESC EXTENDED/FORMATTED COLUMN might not be part of the design/solution.

gatorsmile · 2017-05-23T17:31:40Z

Maybe we can close it at first and then revisit it later?

wzhfy · 2017-05-24T01:09:57Z

OK, I'll close it for now

SparkQA · 2017-06-11T21:07:06Z

Test build #77897 has finished for PR 16422 at commit 5b6c289.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DescribeColumnCommand(

cloud-fan · 2017-07-10T06:44:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

+    val catalog = sparkSession.sessionState.catalog
+    val resolver = sparkSession.sessionState.conf.resolver
+    val relation = sparkSession.table(table).queryExecution.analyzed
+    val field = {


nit:

val field = relation.resolve(colNameParts, resolver).getOrElse { ... }

cloud-fan · 2017-07-10T06:46:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

+
+    val catalogTable = catalog.getTempViewOrPermanentTableMetadata(table)
+    val colStats = catalogTable.stats.map(_.colStats).getOrElse(Map.empty)
+    val cs = colStats.get(field.name)


nit: val colStats = catalogTable.stats.flatMap(_.colStats.get(field.name))

cloud-fan · 2017-07-10T06:48:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

+        comment.getOrElse("NULL"))
+    }
+
+    Seq(Row(formatColumnInfo(fieldValues)))


This does not match the schema, you are returning a row with single string column. We should do

val row = if (isExtended) { Row( field.name, ... } else { Row(...) } Seq(Row)

I tried this before. We would have two rows, first row for column info names (col_name...) and second row for values (c1 ...).
For this two-row result, the alignment is not good. So I changed to the current way.

why we need a row with column info name which is duplicated with schema?

We can NOT make the output mismatch with schema, or sql("desc column...").select("max") will pass analysis but fail at runtime.

The schema info is not aligned with the actual value, two lines format is more readable and following hive's style.

Thanks for pointing that out, I'll fix it after we decide the output format.

cloud-fan · 2017-07-10T06:51:29Z

sql/core/src/test/resources/sql-tests/results/describe-table-column.sql.out

+-- !query 2 schema
+struct<col_name:string,data_type:string,min:string,max:string,num_nulls:string,distinct_count:string,avg_col_len:string,max_col_len:string,comment:string>
+-- !query 2 output
+col_name 	data_type 	min  	max  	num_nulls 	distinct_count 	avg_col_len 	max_col_len 	comment        	


can you check with hive? I feel this output is not friendly to users. I'd like to see something like:
schema: <info: string, value: string>
output:

col_name abc data_type int max 3 ....

I already checked with hive previously in this discussion. The output here is the same as in hive.

ok then we need to decide if we wanna diverge with hive here, cc @gatorsmile

I think Hive's style would have better readability only if it supports describe multiple columns. So I did some tests, which showed hive doesn't support that:

hive> desc formatted src key, value; FAILED: ParseException line 1:22 missing EOF at ',' near 'key' hive> desc formatted src key value; FAILED: ParseException line 1:23 extraneous input 'value' expecting EOF near '<EOF>'

Therefore, I think @cloud-fan 's proposed style is more readable.

gatorsmile · 2017-08-25T01:00:23Z

ping @wzhfy

wzhfy · 2017-08-28T23:59:31Z

@gatorsmile Will update in this week.

gatorsmile · 2017-09-08T16:38:42Z

Welcome back. Will review it today.

SparkQA · 2017-09-08T17:00:09Z

Test build #81560 has finished for PR 16422 at commit 53e4b38.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-09-09T03:23:38Z

@gatorsmile @cloud-fan Sorry for the late update. I changed the output form as @cloud-fan previously suggested:

col_name   abc
data_type  int
max        3
....

gatorsmile · 2017-09-09T05:08:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

+      // If the field is not an attribute after `resolve`, then it's a nested field.
+      throw new AnalysisException(s"DESC TABLE COLUMN command is not supported for nested column:" +
+        s" ${UnresolvedAttribute(colNameParts).name}")
+    }


val colName = UnresolvedAttribute(colNameParts).name val field = relation.resolve(colNameParts, resolver).getOrElse { throw new AnalysisException(s"Column $colName does not exist") } if (!field.isInstanceOf[Attribute]) { // If the field is not an attribute after `resolve`, then it's a nested field. throw new AnalysisException( s"DESC TABLE COLUMN command does not supported nested data types: $colName") }

SparkQA · 2017-09-10T05:01:31Z

Test build #81598 has finished for PR 16422 at commit 85cc045.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-10T07:04:46Z

Test build #81599 has finished for PR 16422 at commit 0d49ee9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-09-10T10:23:38Z

retest this please

SparkQA · 2017-09-10T13:06:33Z

Test build #81602 has finished for PR 16422 at commit 0d49ee9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-09-12T02:37:15Z

@gatorsmile @cloud-fan Comments fixed. Do you have time to take another look?

gatorsmile · 2017-09-12T04:57:37Z

retest this please

SparkQA · 2017-09-12T07:04:45Z

Test build #81655 has finished for PR 16422 at commit 0d49ee9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-12T07:11:16Z

retest this please

SparkQA · 2017-09-12T09:56:17Z

Test build #81664 has finished for PR 16422 at commit 0d49ee9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-12T15:59:22Z

LGTM

gatorsmile · 2017-09-12T16:00:12Z

Thanks! Merged to master.

cloud-fan · 2017-09-13T00:15:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala


+/**
+ * A command to list the info for a column, including name, data type, column stats and comment.
+ * This function creates a [[DescribeColumnCommand]] logical plan.


This comment line seems not needed.

There are other two similar comments (ShowPartitionsCommand, ShowColumnsCommand) in this file, shall I remove them all?

A followup PR to improve the comments is sent: #19213

cloud-fan · 2017-09-13T05:07:00Z

sql/core/src/test/resources/sql-tests/inputs/describe-table-column.sql

+DESC desc_col_temp_table key1;
+
+-- Test persistent table
+CREATE TABLE desc_col_table (key int COMMENT 'column_comment') USING PARQUET;


shall we drop these testing tables at the end?

yes we should. I'll drop them in the followup pr.

cloud-fan · 2017-09-13T05:08:25Z

sql/core/src/test/resources/sql-tests/results/describe-table-column.sql.out

+data_type	int
+comment	column_comment
+min	NULL
+max	NULL


why min max is NULL?

because the table is empty

cloud-fan reviewed Dec 29, 2016

View reviewed changes

wzhfy force-pushed the descColumn branch from b59f7c1 to 30cb1ae Compare December 29, 2016 12:04

cloud-fan reviewed Dec 30, 2016

View reviewed changes

gatorsmile reviewed Dec 31, 2016

View reviewed changes

wzhfy closed this May 24, 2017

wzhfy reopened this Jun 11, 2017

wzhfy force-pushed the descColumn branch from 4a68ed6 to 5b6c289 Compare June 11, 2017 19:26

cloud-fan reviewed Jul 10, 2017

View reviewed changes

new output format

53e4b38

wzhfy force-pushed the descColumn branch from 4a66490 to 53e4b38 Compare September 8, 2017 14:14

gatorsmile reviewed Sep 9, 2017

View reviewed changes

fix comment

85cc045

fix test print

0d49ee9

asfgit closed this in 515910e Sep 12, 2017

cloud-fan reviewed Sep 13, 2017

View reviewed changes

[SPARK-17642] [SQL] support DESC EXTENDED/FORMATTED table column commands #16422

[SPARK-17642] [SQL] support DESC EXTENDED/FORMATTED table column commands #16422

Uh oh!

Conversation

wzhfy commented Dec 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

wzhfy commented Dec 28, 2016

Uh oh!

SparkQA commented Dec 28, 2016

Uh oh!

SparkQA commented Dec 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Dec 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wzhfy commented Dec 30, 2016

Uh oh!

wzhfy commented Dec 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Dec 30, 2016

Uh oh!

wzhfy commented Dec 30, 2016

Uh oh!

gatorsmile Dec 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Dec 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Dec 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jan 3, 2017

Uh oh!

wzhfy commented Jan 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Jan 5, 2017

Uh oh!

wzhfy commented Dec 28, 2016 •

edited

Loading

gatorsmile commented Dec 30, 2016 •

edited

Loading

wzhfy commented Dec 30, 2016 •

edited

Loading

gatorsmile Dec 31, 2016 •

edited

Loading

gatorsmile commented Dec 31, 2016 •

edited

Loading

gatorsmile commented Dec 31, 2016 •

edited

Loading

wzhfy commented Jan 4, 2017 •

edited

Loading

wzhfy Jul 10, 2017 •

edited

Loading

cloud-fan Jul 10, 2017 •

edited

Loading