Skip to content

Conversation

@wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Dec 28, 2016

What changes were proposed in this pull request?

Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command.
Support DESC EXTENDED | FORMATTED TABLE COLUMN command to show column-level statistics.
Do NOT support describe nested columns.

How was this patch tested?

Added test cases.

@wzhfy
Copy link
Contributor Author

wzhfy commented Dec 28, 2016

cc @cloud-fan @gatorsmile

@SparkQA
Copy link

SparkQA commented Dec 28, 2016

Test build #70670 has finished for PR 16422 at commit 3058ab1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class DescribeColumnCommand(

@SparkQA
Copy link

SparkQA commented Dec 29, 2016

Test build #70691 has finished for PR 16422 at commit d41a9cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we throw exception here if partition spec is given?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we should

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please follow other commands and add more description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a link to the hive spec about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got these names by running hive. I can't find any document about the names, but I'll add a link of the corresponding JIRA of Hive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we use isExtended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove it since the result is same with or without isExtended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we create an ArrayBuffer? Doesn't it always return a single row?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I'll delete it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it, so you get the attribute just for column comment and name and data type? I think CatalogTable.schema already have this information.

@SparkQA
Copy link

SparkQA commented Dec 29, 2016

Test build #70719 has finished for PR 16422 at commit 30cb1ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parser rule for the column name here:

describeColName
    : identifier ('.' (identifier | STRING))*
    ;

can we just make it identifier? "a.b" should refer to a column named "a.b", or the inner field "b" from column "a"? let's check with other databases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems mysql doesn't support struct or nested types? @gatorsmile Can you give some advice on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we are following Hive syntax here? What is the behavior of Hive?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we call getTempViewOrPermanentTableMetadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks!

@gatorsmile
Copy link
Member

gatorsmile commented Dec 30, 2016

What is the behavior of DESC COLUMN for the complex/nested type (map, struct, array)? Could you add a test case? Also copy and paste the result here?

@wzhfy
Copy link
Contributor Author

wzhfy commented Dec 30, 2016

Hive running result:

hive> desc student_test;
OK
id                  	int                 	                    
info                	struct<name:string,age:int>	                    
Time taken: 1.643 seconds, Fetched: 2 row(s)
hive> desc student_test info;
OK
name                	string              	from deserializer   
age                 	int                 	from deserializer   
Time taken: 0.062 seconds, Fetched: 2 row(s)
hive> desc student_test info.name;
OK
name                	string              	from deserializer   
Time taken: 0.061 seconds, Fetched: 1 row(s)
hive> desc formatted student_test info.name;
OK
# col_name            	data_type           	min                 	max                 	num_nulls           	distinct_count      	avg_col_len         	max_col_len         	num_trues           	num_falses          	comment             
	 	 	 	 	 	 	 	 	 	 
name                	string              	                    	                    	                    	                    	                    	                    	                    	                    	from deserializer   
Time taken: 0.252 seconds, Fetched: 3 row(s)
hive> desc formatted student_test info;
OK
# col_name            	data_type           	min                 	max                 	num_nulls           	distinct_count      	avg_col_len         	max_col_len         	num_trues           	num_falses          	comment             
	 	 	 	 	 	 	 	 	 	 
name                	string              	                    	                    	                    	                    	                    	                    	                    	                    	from deserializer   
age                 	int                 	                    	                    	                    	                    	                    	                    	                    	                    	from deserializer   
Time taken: 0.086 seconds, Fetched: 4 row(s)

@wzhfy
Copy link
Contributor Author

wzhfy commented Dec 30, 2016

@cloud-fan and I discussed about complex typed column, and he suggested we don't support desc table column command for fields in complex column, because there's no more information of field name when describing info.name than describing info.
And if we don't plan to support describing fields in complex typed column, we can change describeColName to identifier.
We just want to know what's the behavior of other databases on describing fields in complex columns.

@cloud-fan
Copy link
Contributor

@wzhfy I think you misunderstand @gatorsmile 's question. We should support complex column, e.g. struct type column, array type column. But we should not support nested column, e.g. DESC tbl col1.field1

@wzhfy
Copy link
Contributor Author

wzhfy commented Dec 30, 2016

@cloud-fan Sorry it was kind of typo, I meant fields in complex column, I forgot to type "fields in". I've updated the comment and now it's consistent with your suggestion.

Copy link
Member

@gatorsmile gatorsmile Dec 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might generate a confusing error message.

sql("describe formatted default.tab1.s").show(false)
org.apache.spark.sql.catalyst.parser.ParseException:
DESC TABLE COLUMN for an inner column of a nested type is not supported(line 1, pos 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, formatted becomes table identifier. Should I postpone detection of nested column to run() method of DescColumnCommand? Then the existence of table idenfifier will be checked first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, you can try it.

@gatorsmile
Copy link
Member

gatorsmile commented Dec 31, 2016

To get the column names and types (of either basic or complex types), we do not need DESC COLUMN. DESC TABLE is enough.

For retrieving the statistics, each vendor has different ways. Normally, users can access the statistics from the catalog tables/views or data dictionary views. AFAIK, I do not know any system offers DESC COLUMN, except the Hive-like system. Hive 2.x also has a different syntax from Hive 1.x. In this PR, we follow Hive 2.x.

The complex types can be achieved in RDBMS by UDT. For example, in Oracle, the logical mapping of structured type is abstract data types. Also, DB2 documents how to use the structured type in the link. To access the nested field, it is using double dots (e.g., col1..field1). : )

@gatorsmile
Copy link
Member

gatorsmile commented Dec 31, 2016

After rethinking about it, DESC EXTENDED/FORMATTED COLUMN discloses the data patterns/statistics info. These info are pretty sensitive. Not all the users should be allowed to access it.

We might face the security-related complaints about this feature. Also cc @rxin @yhuai @hvanhovell

@SparkQA
Copy link

SparkQA commented Jan 3, 2017

Test build #70808 has finished for PR 16422 at commit 4a68ed6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Jan 4, 2017

@gatorsmile Why is statistics info sensitive? Users can run sql queries to get each of them (max, min, ndv, etc) anyway.

@gatorsmile
Copy link
Member

Column-level security can block users to access the specific columns, but this command DESC EXTENDED/FORMATTED COLUMN might not be part of the design/solution.

@gatorsmile
Copy link
Member

Maybe we can close it at first and then revisit it later?

@wzhfy
Copy link
Contributor Author

wzhfy commented May 24, 2017

OK, I'll close it for now

@wzhfy wzhfy closed this May 24, 2017
@wzhfy wzhfy reopened this Jun 11, 2017
@SparkQA
Copy link

SparkQA commented Jun 11, 2017

Test build #77897 has finished for PR 16422 at commit 5b6c289.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class DescribeColumnCommand(

val catalog = sparkSession.sessionState.catalog
val resolver = sparkSession.sessionState.conf.resolver
val relation = sparkSession.table(table).queryExecution.analyzed
val field = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

val field = relation.resolve(colNameParts, resolver).getOrElse {
  ...
}


val catalogTable = catalog.getTempViewOrPermanentTableMetadata(table)
val colStats = catalogTable.stats.map(_.colStats).getOrElse(Map.empty)
val cs = colStats.get(field.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: val colStats = catalogTable.stats.flatMap(_.colStats.get(field.name))

comment.getOrElse("NULL"))
}

Seq(Row(formatColumnInfo(fieldValues)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not match the schema, you are returning a row with single string column. We should do

val row = if (isExtended) {
  Row(
    field.name,
    ...
} else {
   Row(...)
}
Seq(Row)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this before. We would have two rows, first row for column info names (col_name...) and second row for values (c1 ...).
For this two-row result, the alignment is not good. So I changed to the current way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. why we need a row with column info name which is duplicated with schema?
  2. We can NOT make the output mismatch with schema, or sql("desc column...").select("max") will pass analysis but fail at runtime.

Copy link
Contributor Author

@wzhfy wzhfy Jul 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The schema info is not aligned with the actual value, two lines format is more readable and following hive's style.
  2. Thanks for pointing that out, I'll fix it after we decide the output format.

-- !query 2 schema
struct<col_name:string,data_type:string,min:string,max:string,num_nulls:string,distinct_count:string,avg_col_len:string,max_col_len:string,comment:string>
-- !query 2 output
col_name data_type min max num_nulls distinct_count avg_col_len max_col_len comment
Copy link
Contributor

@cloud-fan cloud-fan Jul 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check with hive? I feel this output is not friendly to users. I'd like to see something like:
schema: <info: string, value: string>
output:

col_name   abc
data_type  int
max        3
....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already checked with hive previously in this discussion. The output here is the same as in hive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok then we need to decide if we wanna diverge with hive here, cc @gatorsmile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Hive's style would have better readability only if it supports describe multiple columns. So I did some tests, which showed hive doesn't support that:

hive> desc formatted src key, value;
FAILED: ParseException line 1:22 missing EOF at ',' near 'key'
hive> desc formatted src key value;
FAILED: ParseException line 1:23 extraneous input 'value' expecting EOF near '<EOF>'

Therefore, I think @cloud-fan 's proposed style is more readable.

@gatorsmile
Copy link
Member

ping @wzhfy

@wzhfy
Copy link
Contributor Author

wzhfy commented Aug 28, 2017

@gatorsmile Will update in this week.

@gatorsmile
Copy link
Member

Welcome back. Will review it today.

@SparkQA
Copy link

SparkQA commented Sep 8, 2017

Test build #81560 has finished for PR 16422 at commit 53e4b38.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 9, 2017

@gatorsmile @cloud-fan Sorry for the late update. I changed the output form as @cloud-fan previously suggested:

col_name   abc
data_type  int
max        3
....

// If the field is not an attribute after `resolve`, then it's a nested field.
throw new AnalysisException(s"DESC TABLE COLUMN command is not supported for nested column:" +
s" ${UnresolvedAttribute(colNameParts).name}")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    val colName = UnresolvedAttribute(colNameParts).name
    val field = relation.resolve(colNameParts, resolver).getOrElse {
      throw new AnalysisException(s"Column $colName does not exist")
    }
    if (!field.isInstanceOf[Attribute]) {
      // If the field is not an attribute after `resolve`, then it's a nested field.
      throw new AnalysisException(
        s"DESC TABLE COLUMN command does not supported nested data types: $colName")
    }

@SparkQA
Copy link

SparkQA commented Sep 10, 2017

Test build #81598 has finished for PR 16422 at commit 85cc045.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 10, 2017

Test build #81599 has finished for PR 16422 at commit 0d49ee9.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 10, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Sep 10, 2017

Test build #81602 has finished for PR 16422 at commit 0d49ee9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 12, 2017

@gatorsmile @cloud-fan Comments fixed. Do you have time to take another look?

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Sep 12, 2017

Test build #81655 has finished for PR 16422 at commit 0d49ee9.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Sep 12, 2017

Test build #81664 has finished for PR 16422 at commit 0d49ee9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in 515910e Sep 12, 2017

/**
* A command to list the info for a column, including name, data type, column stats and comment.
* This function creates a [[DescribeColumnCommand]] logical plan.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment line seems not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other two similar comments (ShowPartitionsCommand, ShowColumnsCommand) in this file, shall I remove them all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A followup PR to improve the comments is sent: #19213

DESC desc_col_temp_table key1;

-- Test persistent table
CREATE TABLE desc_col_table (key int COMMENT 'column_comment') USING PARQUET;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we drop these testing tables at the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we should. I'll drop them in the followup pr.

data_type int
comment column_comment
min NULL
max NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why min max is NULL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the table is empty

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants