Skip to content

Conversation

@imback82
Copy link
Contributor

@imback82 imback82 commented Jan 21, 2022

What changes were proposed in this pull request?

  1. Move DESCRIBE TABLE parsing tests to DescribeRelationParserSuite.
  2. Put common DESCRIBE TABLE tests into one trait org.apache.spark.sql.execution.command.DescribeTableSuiteBase, and put datasource specific tests to the v1.DescribeTableSuite and v2.DescribeTableSuite.

The changes follow the approach of #30287.

Why are the changes needed?

  1. The unification will allow to run common DESCRIBE TABLE tests for both DSv1/Hive DSv1 and DSv2
  2. We can detect missing features and differences between DSv1 and DSv2 implementations.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests and new tests.

@github-actions github-actions bot added the SQL label Jan 21, 2022
@imback82 imback82 changed the title [SPARK-37888][SQL][TESTS] Unify v1 and v2 CREATE NAMESPACE tests [SPARK-37888][SQL][TESTS] Unify v1 and v2 DESCRIBE TABLE tests Jan 21, 2022
@imback82 imback82 changed the title [SPARK-37888][SQL][TESTS] Unify v1 and v2 DESCRIBE TABLE tests [WIP][SPARK-37888][SQL][TESTS] Unify v1 and v2 DESCRIBE TABLE tests Jan 21, 2022
Row("data", "string", ""),
Row("id", "bigint", ""),
Row("", "", ""),
Row("# Partitioning", "", ""),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan this is a WIP PR, but as you can see there are quite a bit of differences between v1 and v2 command. Do you want me to unify the behavior in this PR? If so, should we update v1 to match v2 or the other way around? TIA!

Copy link
Contributor

@cloud-fan cloud-fan Jan 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I know it's going to be hard.

The most difficult part is partitioning. V2 supports transform instead of partition column names. I'd suggest specially handling tables with only partition columns, to not surprise the majority of users.

a                   	string              	                    
b                   	int                 	                            	                    
# Partition Columns	                    	                                       	                    
b                   	int

Once there exists transforms, we use the v2 format

a                   	string              	                    
b                   	int      
c                       timestamp           	                            	                    
# Partitioning	                    	                                       	                    
identity(b)
year(c)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Detailed Table Information, we need to add more information in the v2 command to follow v1, such as Created Time. We can match V1Table directly first, and then add more built-in v2 table properties later.

@Peng-Lei is EXTERAL a reserved keyword now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh. PR.

Copy link
Contributor Author

@imback82 imback82 Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most difficult part is partitioning. V2 supports transform instead of partition column names. I'd suggest specially handling tables with only partition columns, to not surprise the majority of users.

Can I assume that only the IdentityTransforms are the partition columns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only IdentityTransforms represents partition columns

@imback82 imback82 marked this pull request as draft January 21, 2022 04:18
case DescribeRelation(
ResolvedV1TableOrViewIdentifier(ident), partitionSpec, isExtended, output) =>
ResolvedV1TableOrViewIdentifier(ident), partitionSpec, isExtended, output)
if conf.useV1Command =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still use v1 command for views.

append(buffer, s"# ${output.head.name}", output(1).name, output(2).name)
}
schema.foreach { column =>
schema.sortBy(_.name).foreach { column =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this is weird. The column order matters and we should retain it.

rows += toCatalystRow(s"# ${output(0).name}", output(1).name, output(2).name)
rows ++= table.partitioning.sortBy(_.describe).map {
case t =>
val field = nameToField(t.describe)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should cast v2 partitioning to IdentityTransform and get the column names

Row("# col_name", "data_type", "comment"),
Row("id", "bigint", null),
Row("", "", ""),
Row("# Metadata Columns", "", ""),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the only difference?

Row("Location", "file:/tmp/table_name", ""),
Row("Provider", "parquet", ""),
Row("Owner", "", ""),
Row("External", "true", ""),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems better to follow the v1 command and print Type: EXTERNAL/MANAGED

Row("Last Access", "UNKNOWN", ""),
Row("Created By", s"Spark ${org.apache.spark.SPARK_VERSION}", ""),
Row("Type", "EXTERNAL", ""),
Row("Provider", "parquet", ""),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Row("Provider", "parquet", ""),
Row("Provider", defaultUsing.stripLeft("USING").trim, ""),

class DescribeTableSuite extends command.DescribeTableSuiteBase with CommandSuiteBase {
override def namespace: String = "ns1.ns2"

test("DESCRIBE TABLE with non-'partitioned-by' clause") {
Copy link
Contributor

@cloud-fan cloud-fan Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
test("DESCRIBE TABLE with non-'partitioned-by' clause") {
test("DESCRIBE TABLE with bucketed table") {

}


test("DESCRIBE TABLE with v2 catalog when table does not exist.") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we move it to the base suite?

}
}

test("DESCRIBE TABLE EXTENDED using v2 catalog") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
}

test("SPARK-34561: drop/add columns to a dataset of `DESCRIBE TABLE`") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move it to the base suite?

Row("id", "bigint", null),
Row("", "", ""),
Row("# Partitioning", "", ""),
Row("Part 0", "bucket(3, id)", "")))
Copy link
Contributor

@cloud-fan cloud-fan Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, partition transforms are implicitly named. e.g. people can do TRUNCATE TABLE t PARTITION (a=1). We can get the partition names if the table implements SupportsPartitionManagement.

Now I have a new proposal to display partitionings, which is more consistent between v1 and v2

# Partition Information	                    	                    
# col_name          	data_type           	comment             
c                   	string              	                    
d                   	string
# Bucketing Information
# bucket_cols         sort_cols                num_buckets
a, b                    c, d                        4

If there exists partition transforms (which means we can't be v1 compatible anymore)

# Partition Information	                    	                    
# col_name          	transform
x                         year(ts)
y                         bucket(3, id)
z                         identity(d)

If the partition col names are unknown (the table doesn't implement SupportsPartitionManagement), we can use unknown_name_0, unknown_name_1, etc.

MaxGekk added a commit that referenced this pull request Jun 23, 2022
### What changes were proposed in this pull request?
In the PR, I propose to change output of v2 `DESCRIBE TABLE`, and make it the same as v1. In particular:
1. Return NULL instead of empty strings when any comment doesn't exist in the schema info.
2. When a v2 table has identity transformations (partition by) only, output the partitioning info in v1 style. For instance:
```sql
> CREATE TABLE tbl (id bigint, data string) USING _;
> DESCRIBE TABLE tbl;
+-----------------------+---------+-------+
|col_name               |data_type|comment|
+-----------------------+---------+-------+
|id                     |bigint   |null   |
|data                   |string   |null   |
|# Partition Information|         |       |
|# col_name             |data_type|comment|
|id                     |bigint   |null   |
+-----------------------+---------+-------+
```

Also the PR moves/adds some tests to the base traits:
- "DESCRIBE TABLE of a non-partitioned table"
- "DESCRIBE TABLE of a partitioned table"

and addresses review comments in #35265.

### Why are the changes needed?
The changes unify outputs of v1 and v2 implementations, and make the migration process from the version v1 to v2 easier for Spark SQL users.

### Does this PR introduce _any_ user-facing change?
Yes, it changes outputs of v2 `DESCRIBE TABLE`.

### How was this patch tested?
By running the `DESCRIBE TABLE` test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DescribeTableSuite"
```
and related test suites:
```
$ build/sbt "test:testOnly *DataSourceV2SQLSuite"
```

Closes #36946 from MaxGekk/unify-v1-v2-describe-table.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Jun 23, 2022
### What changes were proposed in this pull request?
In the PR, I propose to change output of v2 `DESCRIBE TABLE`, and make it the same as v1. In particular:
1. Return NULL instead of empty strings when any comment doesn't exist in the schema info.
2. When a v2 table has identity transformations (partition by) only, output the partitioning info in v1 style. For instance:
```sql
> CREATE TABLE tbl (id bigint, data string) USING _;
> DESCRIBE TABLE tbl;
+-----------------------+---------+-------+
|col_name               |data_type|comment|
+-----------------------+---------+-------+
|id                     |bigint   |null   |
|data                   |string   |null   |
|# Partition Information|         |       |
|# col_name             |data_type|comment|
|id                     |bigint   |null   |
+-----------------------+---------+-------+
```

Also the PR moves/adds some tests to the base traits:
- "DESCRIBE TABLE of a non-partitioned table"
- "DESCRIBE TABLE of a partitioned table"

and addresses review comments in apache/spark#35265.

### Why are the changes needed?
The changes unify outputs of v1 and v2 implementations, and make the migration process from the version v1 to v2 easier for Spark SQL users.

### Does this PR introduce _any_ user-facing change?
Yes, it changes outputs of v2 `DESCRIBE TABLE`.

### How was this patch tested?
By running the `DESCRIBE TABLE` test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DescribeTableSuite"
```
and related test suites:
```
$ build/sbt "test:testOnly *DataSourceV2SQLSuite"
```

Closes #36946 from MaxGekk/unify-v1-v2-describe-table.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jul 16, 2022
@github-actions github-actions bot closed this Jul 17, 2022
a0x8o added a commit to a0x8o/spark that referenced this pull request Dec 30, 2022
### What changes were proposed in this pull request?
In the PR, I propose to change output of v2 `DESCRIBE TABLE`, and make it the same as v1. In particular:
1. Return NULL instead of empty strings when any comment doesn't exist in the schema info.
2. When a v2 table has identity transformations (partition by) only, output the partitioning info in v1 style. For instance:
```sql
> CREATE TABLE tbl (id bigint, data string) USING _;
> DESCRIBE TABLE tbl;
+-----------------------+---------+-------+
|col_name               |data_type|comment|
+-----------------------+---------+-------+
|id                     |bigint   |null   |
|data                   |string   |null   |
|# Partition Information|         |       |
|# col_name             |data_type|comment|
|id                     |bigint   |null   |
+-----------------------+---------+-------+
```

Also the PR moves/adds some tests to the base traits:
- "DESCRIBE TABLE of a non-partitioned table"
- "DESCRIBE TABLE of a partitioned table"

and addresses review comments in apache/spark#35265.

### Why are the changes needed?
The changes unify outputs of v1 and v2 implementations, and make the migration process from the version v1 to v2 easier for Spark SQL users.

### Does this PR introduce _any_ user-facing change?
Yes, it changes outputs of v2 `DESCRIBE TABLE`.

### How was this patch tested?
By running the `DESCRIBE TABLE` test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DescribeTableSuite"
```
and related test suites:
```
$ build/sbt "test:testOnly *DataSourceV2SQLSuite"
```

Closes #36946 from MaxGekk/unify-v1-v2-describe-table.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Dec 30, 2022
### What changes were proposed in this pull request?
In the PR, I propose to change output of v2 `DESCRIBE TABLE`, and make it the same as v1. In particular:
1. Return NULL instead of empty strings when any comment doesn't exist in the schema info.
2. When a v2 table has identity transformations (partition by) only, output the partitioning info in v1 style. For instance:
```sql
> CREATE TABLE tbl (id bigint, data string) USING _;
> DESCRIBE TABLE tbl;
+-----------------------+---------+-------+
|col_name               |data_type|comment|
+-----------------------+---------+-------+
|id                     |bigint   |null   |
|data                   |string   |null   |
|# Partition Information|         |       |
|# col_name             |data_type|comment|
|id                     |bigint   |null   |
+-----------------------+---------+-------+
```

Also the PR moves/adds some tests to the base traits:
- "DESCRIBE TABLE of a non-partitioned table"
- "DESCRIBE TABLE of a partitioned table"

and addresses review comments in apache/spark#35265.

### Why are the changes needed?
The changes unify outputs of v1 and v2 implementations, and make the migration process from the version v1 to v2 easier for Spark SQL users.

### Does this PR introduce _any_ user-facing change?
Yes, it changes outputs of v2 `DESCRIBE TABLE`.

### How was this patch tested?
By running the `DESCRIBE TABLE` test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DescribeTableSuite"
```
and related test suites:
```
$ build/sbt "test:testOnly *DataSourceV2SQLSuite"
```

Closes #36946 from MaxGekk/unify-v1-v2-describe-table.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants