-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP][SPARK-37888][SQL][TESTS] Unify v1 and v2 DESCRIBE TABLE tests #35265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Row("data", "string", ""), | ||
| Row("id", "bigint", ""), | ||
| Row("", "", ""), | ||
| Row("# Partitioning", "", ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan this is a WIP PR, but as you can see there are quite a bit of differences between v1 and v2 command. Do you want me to unify the behavior in this PR? If so, should we update v1 to match v2 or the other way around? TIA!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea I know it's going to be hard.
The most difficult part is partitioning. V2 supports transform instead of partition column names. I'd suggest specially handling tables with only partition columns, to not surprise the majority of users.
a string
b int
# Partition Columns
b int
Once there exists transforms, we use the v2 format
a string
b int
c timestamp
# Partitioning
identity(b)
year(c)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Detailed Table Information, we need to add more information in the v2 command to follow v1, such as Created Time. We can match V1Table directly first, and then add more built-in v2 table properties later.
@Peng-Lei is EXTERAL a reserved keyword now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh. PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most difficult part is partitioning. V2 supports
transforminstead of partition column names. I'd suggest specially handling tables with only partition columns, to not surprise the majority of users.
Can I assume that only the IdentityTransforms are the partition columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, only IdentityTransforms represents partition columns
| case DescribeRelation( | ||
| ResolvedV1TableOrViewIdentifier(ident), partitionSpec, isExtended, output) => | ||
| ResolvedV1TableOrViewIdentifier(ident), partitionSpec, isExtended, output) | ||
| if conf.useV1Command => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should still use v1 command for views.
| append(buffer, s"# ${output.head.name}", output(1).name, output(2).name) | ||
| } | ||
| schema.foreach { column => | ||
| schema.sortBy(_.name).foreach { column => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this is weird. The column order matters and we should retain it.
| rows += toCatalystRow(s"# ${output(0).name}", output(1).name, output(2).name) | ||
| rows ++= table.partitioning.sortBy(_.describe).map { | ||
| case t => | ||
| val field = nameToField(t.describe) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should cast v2 partitioning to IdentityTransform and get the column names
| Row("# col_name", "data_type", "comment"), | ||
| Row("id", "bigint", null), | ||
| Row("", "", ""), | ||
| Row("# Metadata Columns", "", ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it the only difference?
| Row("Location", "file:/tmp/table_name", ""), | ||
| Row("Provider", "parquet", ""), | ||
| Row("Owner", "", ""), | ||
| Row("External", "true", ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems better to follow the v1 command and print Type: EXTERNAL/MANAGED
| Row("Last Access", "UNKNOWN", ""), | ||
| Row("Created By", s"Spark ${org.apache.spark.SPARK_VERSION}", ""), | ||
| Row("Type", "EXTERNAL", ""), | ||
| Row("Provider", "parquet", ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Row("Provider", "parquet", ""), | |
| Row("Provider", defaultUsing.stripLeft("USING").trim, ""), |
| class DescribeTableSuite extends command.DescribeTableSuiteBase with CommandSuiteBase { | ||
| override def namespace: String = "ns1.ns2" | ||
|
|
||
| test("DESCRIBE TABLE with non-'partitioned-by' clause") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| test("DESCRIBE TABLE with non-'partitioned-by' clause") { | |
| test("DESCRIBE TABLE with bucketed table") { |
| } | ||
|
|
||
|
|
||
| test("DESCRIBE TABLE with v2 catalog when table does not exist.") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we move it to the base suite?
| } | ||
| } | ||
|
|
||
| test("DESCRIBE TABLE EXTENDED using v2 catalog") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | ||
| } | ||
|
|
||
| test("SPARK-34561: drop/add columns to a dataset of `DESCRIBE TABLE`") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move it to the base suite?
| Row("id", "bigint", null), | ||
| Row("", "", ""), | ||
| Row("# Partitioning", "", ""), | ||
| Row("Part 0", "bucket(3, id)", ""))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, partition transforms are implicitly named. e.g. people can do TRUNCATE TABLE t PARTITION (a=1). We can get the partition names if the table implements SupportsPartitionManagement.
Now I have a new proposal to display partitionings, which is more consistent between v1 and v2
# Partition Information
# col_name data_type comment
c string
d string
# Bucketing Information
# bucket_cols sort_cols num_buckets
a, b c, d 4
If there exists partition transforms (which means we can't be v1 compatible anymore)
# Partition Information
# col_name transform
x year(ts)
y bucket(3, id)
z identity(d)
If the partition col names are unknown (the table doesn't implement SupportsPartitionManagement), we can use unknown_name_0, unknown_name_1, etc.
### What changes were proposed in this pull request? In the PR, I propose to change output of v2 `DESCRIBE TABLE`, and make it the same as v1. In particular: 1. Return NULL instead of empty strings when any comment doesn't exist in the schema info. 2. When a v2 table has identity transformations (partition by) only, output the partitioning info in v1 style. For instance: ```sql > CREATE TABLE tbl (id bigint, data string) USING _; > DESCRIBE TABLE tbl; +-----------------------+---------+-------+ |col_name |data_type|comment| +-----------------------+---------+-------+ |id |bigint |null | |data |string |null | |# Partition Information| | | |# col_name |data_type|comment| |id |bigint |null | +-----------------------+---------+-------+ ``` Also the PR moves/adds some tests to the base traits: - "DESCRIBE TABLE of a non-partitioned table" - "DESCRIBE TABLE of a partitioned table" and addresses review comments in #35265. ### Why are the changes needed? The changes unify outputs of v1 and v2 implementations, and make the migration process from the version v1 to v2 easier for Spark SQL users. ### Does this PR introduce _any_ user-facing change? Yes, it changes outputs of v2 `DESCRIBE TABLE`. ### How was this patch tested? By running the `DESCRIBE TABLE` test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DescribeTableSuite" ``` and related test suites: ``` $ build/sbt "test:testOnly *DataSourceV2SQLSuite" ``` Closes #36946 from MaxGekk/unify-v1-v2-describe-table. Authored-by: Max Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>
### What changes were proposed in this pull request? In the PR, I propose to change output of v2 `DESCRIBE TABLE`, and make it the same as v1. In particular: 1. Return NULL instead of empty strings when any comment doesn't exist in the schema info. 2. When a v2 table has identity transformations (partition by) only, output the partitioning info in v1 style. For instance: ```sql > CREATE TABLE tbl (id bigint, data string) USING _; > DESCRIBE TABLE tbl; +-----------------------+---------+-------+ |col_name |data_type|comment| +-----------------------+---------+-------+ |id |bigint |null | |data |string |null | |# Partition Information| | | |# col_name |data_type|comment| |id |bigint |null | +-----------------------+---------+-------+ ``` Also the PR moves/adds some tests to the base traits: - "DESCRIBE TABLE of a non-partitioned table" - "DESCRIBE TABLE of a partitioned table" and addresses review comments in apache/spark#35265. ### Why are the changes needed? The changes unify outputs of v1 and v2 implementations, and make the migration process from the version v1 to v2 easier for Spark SQL users. ### Does this PR introduce _any_ user-facing change? Yes, it changes outputs of v2 `DESCRIBE TABLE`. ### How was this patch tested? By running the `DESCRIBE TABLE` test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DescribeTableSuite" ``` and related test suites: ``` $ build/sbt "test:testOnly *DataSourceV2SQLSuite" ``` Closes #36946 from MaxGekk/unify-v1-v2-describe-table. Authored-by: Max Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
### What changes were proposed in this pull request? In the PR, I propose to change output of v2 `DESCRIBE TABLE`, and make it the same as v1. In particular: 1. Return NULL instead of empty strings when any comment doesn't exist in the schema info. 2. When a v2 table has identity transformations (partition by) only, output the partitioning info in v1 style. For instance: ```sql > CREATE TABLE tbl (id bigint, data string) USING _; > DESCRIBE TABLE tbl; +-----------------------+---------+-------+ |col_name |data_type|comment| +-----------------------+---------+-------+ |id |bigint |null | |data |string |null | |# Partition Information| | | |# col_name |data_type|comment| |id |bigint |null | +-----------------------+---------+-------+ ``` Also the PR moves/adds some tests to the base traits: - "DESCRIBE TABLE of a non-partitioned table" - "DESCRIBE TABLE of a partitioned table" and addresses review comments in apache/spark#35265. ### Why are the changes needed? The changes unify outputs of v1 and v2 implementations, and make the migration process from the version v1 to v2 easier for Spark SQL users. ### Does this PR introduce _any_ user-facing change? Yes, it changes outputs of v2 `DESCRIBE TABLE`. ### How was this patch tested? By running the `DESCRIBE TABLE` test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DescribeTableSuite" ``` and related test suites: ``` $ build/sbt "test:testOnly *DataSourceV2SQLSuite" ``` Closes #36946 from MaxGekk/unify-v1-v2-describe-table. Authored-by: Max Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>
### What changes were proposed in this pull request? In the PR, I propose to change output of v2 `DESCRIBE TABLE`, and make it the same as v1. In particular: 1. Return NULL instead of empty strings when any comment doesn't exist in the schema info. 2. When a v2 table has identity transformations (partition by) only, output the partitioning info in v1 style. For instance: ```sql > CREATE TABLE tbl (id bigint, data string) USING _; > DESCRIBE TABLE tbl; +-----------------------+---------+-------+ |col_name |data_type|comment| +-----------------------+---------+-------+ |id |bigint |null | |data |string |null | |# Partition Information| | | |# col_name |data_type|comment| |id |bigint |null | +-----------------------+---------+-------+ ``` Also the PR moves/adds some tests to the base traits: - "DESCRIBE TABLE of a non-partitioned table" - "DESCRIBE TABLE of a partitioned table" and addresses review comments in apache/spark#35265. ### Why are the changes needed? The changes unify outputs of v1 and v2 implementations, and make the migration process from the version v1 to v2 easier for Spark SQL users. ### Does this PR introduce _any_ user-facing change? Yes, it changes outputs of v2 `DESCRIBE TABLE`. ### How was this patch tested? By running the `DESCRIBE TABLE` test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DescribeTableSuite" ``` and related test suites: ``` $ build/sbt "test:testOnly *DataSourceV2SQLSuite" ``` Closes #36946 from MaxGekk/unify-v1-v2-describe-table. Authored-by: Max Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>
What changes were proposed in this pull request?
DESCRIBE TABLEparsing tests toDescribeRelationParserSuite.DESCRIBE TABLEtests into one traitorg.apache.spark.sql.execution.command.DescribeTableSuiteBase, and put datasource specific tests to thev1.DescribeTableSuiteandv2.DescribeTableSuite.The changes follow the approach of #30287.
Why are the changes needed?
DESCRIBE TABLEtests for both DSv1/Hive DSv1 and DSv2Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing unit tests and new tests.