-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17642] [SQL] support DESC EXTENDED/FORMATTED table column commands #16422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #70670 has finished for PR 16422 at commit
|
|
Test build #70691 has finished for PR 16422 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we throw exception here if partition spec is given?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we should
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please follow other commands and add more description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a link to the hive spec about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got these names by running hive. I can't find any document about the names, but I'll add a link of the corresponding JIRA of Hive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do we use isExtended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove it since the result is same with or without isExtended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we create an ArrayBuffer? Doesn't it always return a single row?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea I'll delete it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it, so you get the attribute just for column comment and name and data type? I think CatalogTable.schema already have this information.
|
Test build #70719 has finished for PR 16422 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the parser rule for the column name here:
describeColName
: identifier ('.' (identifier | STRING))*
;
can we just make it identifier? "a.b" should refer to a column named "a.b", or the inner field "b" from column "a"? let's check with other databases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems mysql doesn't support struct or nested types? @gatorsmile Can you give some advice on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we are following Hive syntax here? What is the behavior of Hive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we call getTempViewOrPermanentTableMetadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks!
|
What is the behavior of DESC COLUMN for the complex/nested type (map, struct, array)? Could you add a test case? Also copy and paste the result here? |
|
Hive running result: |
|
@cloud-fan and I discussed about complex typed column, and he suggested we don't support desc table column command for fields in complex column, because there's no more information of field |
|
@wzhfy I think you misunderstand @gatorsmile 's question. We should support complex column, e.g. struct type column, array type column. But we should not support nested column, e.g. |
|
@cloud-fan Sorry it was kind of typo, I meant fields in complex column, I forgot to type "fields in". I've updated the comment and now it's consistent with your suggestion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might generate a confusing error message.
sql("describe formatted default.tab1.s").show(false)
org.apache.spark.sql.catalyst.parser.ParseException:
DESC TABLE COLUMN for an inner column of a nested type is not supported(line 1, pos 0)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, formatted becomes table identifier. Should I postpone detection of nested column to run() method of DescColumnCommand? Then the existence of table idenfifier will be checked first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, you can try it.
|
To get the column names and types (of either basic or complex types), we do not need For retrieving the statistics, each vendor has different ways. Normally, users can access the statistics from the catalog tables/views or data dictionary views. AFAIK, I do not know any system offers The complex types can be achieved in RDBMS by UDT. For example, in Oracle, the logical mapping of structured type is abstract data types. Also, DB2 documents how to use the structured type in the link. To access the nested field, it is using double dots (e.g., |
|
After rethinking about it, We might face the security-related complaints about this feature. Also cc @rxin @yhuai @hvanhovell |
|
Test build #70808 has finished for PR 16422 at commit
|
|
@gatorsmile Why is statistics info sensitive? Users can run sql queries to get each of them (max, min, ndv, etc) anyway. |
|
Column-level security can block users to access the specific columns, but this command |
|
Maybe we can close it at first and then revisit it later? |
|
OK, I'll close it for now |
|
Test build #77897 has finished for PR 16422 at commit
|
| val catalog = sparkSession.sessionState.catalog | ||
| val resolver = sparkSession.sessionState.conf.resolver | ||
| val relation = sparkSession.table(table).queryExecution.analyzed | ||
| val field = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
val field = relation.resolve(colNameParts, resolver).getOrElse {
...
}
|
|
||
| val catalogTable = catalog.getTempViewOrPermanentTableMetadata(table) | ||
| val colStats = catalogTable.stats.map(_.colStats).getOrElse(Map.empty) | ||
| val cs = colStats.get(field.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: val colStats = catalogTable.stats.flatMap(_.colStats.get(field.name))
| comment.getOrElse("NULL")) | ||
| } | ||
|
|
||
| Seq(Row(formatColumnInfo(fieldValues))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not match the schema, you are returning a row with single string column. We should do
val row = if (isExtended) {
Row(
field.name,
...
} else {
Row(...)
}
Seq(Row)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this before. We would have two rows, first row for column info names (col_name...) and second row for values (c1 ...).
For this two-row result, the alignment is not good. So I changed to the current way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- why we need a row with column info name which is duplicated with schema?
- We can NOT make the output mismatch with schema, or
sql("desc column...").select("max")will pass analysis but fail at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The schema info is not aligned with the actual value, two lines format is more readable and following hive's style.
- Thanks for pointing that out, I'll fix it after we decide the output format.
| -- !query 2 schema | ||
| struct<col_name:string,data_type:string,min:string,max:string,num_nulls:string,distinct_count:string,avg_col_len:string,max_col_len:string,comment:string> | ||
| -- !query 2 output | ||
| col_name data_type min max num_nulls distinct_count avg_col_len max_col_len comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you check with hive? I feel this output is not friendly to users. I'd like to see something like:
schema: <info: string, value: string>
output:
col_name abc
data_type int
max 3
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already checked with hive previously in this discussion. The output here is the same as in hive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok then we need to decide if we wanna diverge with hive here, cc @gatorsmile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Hive's style would have better readability only if it supports describe multiple columns. So I did some tests, which showed hive doesn't support that:
hive> desc formatted src key, value;
FAILED: ParseException line 1:22 missing EOF at ',' near 'key'
hive> desc formatted src key value;
FAILED: ParseException line 1:23 extraneous input 'value' expecting EOF near '<EOF>'
Therefore, I think @cloud-fan 's proposed style is more readable.
|
ping @wzhfy |
|
@gatorsmile Will update in this week. |
|
Welcome back. Will review it today. |
|
Test build #81560 has finished for PR 16422 at commit
|
|
@gatorsmile @cloud-fan Sorry for the late update. I changed the output form as @cloud-fan previously suggested: |
| // If the field is not an attribute after `resolve`, then it's a nested field. | ||
| throw new AnalysisException(s"DESC TABLE COLUMN command is not supported for nested column:" + | ||
| s" ${UnresolvedAttribute(colNameParts).name}") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val colName = UnresolvedAttribute(colNameParts).name
val field = relation.resolve(colNameParts, resolver).getOrElse {
throw new AnalysisException(s"Column $colName does not exist")
}
if (!field.isInstanceOf[Attribute]) {
// If the field is not an attribute after `resolve`, then it's a nested field.
throw new AnalysisException(
s"DESC TABLE COLUMN command does not supported nested data types: $colName")
}|
Test build #81598 has finished for PR 16422 at commit
|
|
Test build #81599 has finished for PR 16422 at commit
|
|
retest this please |
|
Test build #81602 has finished for PR 16422 at commit
|
|
@gatorsmile @cloud-fan Comments fixed. Do you have time to take another look? |
|
retest this please |
|
Test build #81655 has finished for PR 16422 at commit
|
|
retest this please |
|
Test build #81664 has finished for PR 16422 at commit
|
|
LGTM |
|
Thanks! Merged to master. |
|
|
||
| /** | ||
| * A command to list the info for a column, including name, data type, column stats and comment. | ||
| * This function creates a [[DescribeColumnCommand]] logical plan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment line seems not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are other two similar comments (ShowPartitionsCommand, ShowColumnsCommand) in this file, shall I remove them all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A followup PR to improve the comments is sent: #19213
| DESC desc_col_temp_table key1; | ||
|
|
||
| -- Test persistent table | ||
| CREATE TABLE desc_col_table (key int COMMENT 'column_comment') USING PARQUET; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we drop these testing tables at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we should. I'll drop them in the followup pr.
| data_type int | ||
| comment column_comment | ||
| min NULL | ||
| max NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why min max is NULL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the table is empty
What changes were proposed in this pull request?
Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command.
Support DESC EXTENDED | FORMATTED TABLE COLUMN command to show column-level statistics.
Do NOT support describe nested columns.
How was this patch tested?
Added test cases.