Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Sep 20, 2016

What changes were proposed in this pull request?

This PR implements DESCRIBE table PARTITION SQL Syntax again. It was supported until Spark 1.6.2, but was dropped since 2.0.0.

Spark 1.6.2

scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY (c STRING, d STRING)")
res1: org.apache.spark.sql.DataFrame = [result: string]

scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
res2: org.apache.spark.sql.DataFrame = [result: string]

scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
+----------------------------------------------------------------+
|result                                                          |
+----------------------------------------------------------------+
|a                      string                                   |
|b                      int                                      |
|c                      string                                   |
|d                      string                                   |
|                                                                |
|# Partition Information                                         |
|# col_name             data_type               comment          |
|                                                                |
|c                      string                                   |
|d                      string                                   |
+----------------------------------------------------------------+

Spark 2.0

  • Before
scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY (c STRING, d STRING)")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
res1: org.apache.spark.sql.DataFrame = []

scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
org.apache.spark.sql.catalyst.parser.ParseException:
Unsupported SQL statement
  • After
scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY (c STRING, d STRING)")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
res1: org.apache.spark.sql.DataFrame = []

scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
+-----------------------+---------+-------+
|col_name               |data_type|comment|
+-----------------------+---------+-------+
|a                      |string   |null   |
|b                      |int      |null   |
|c                      |string   |null   |
|d                      |string   |null   |
|# Partition Information|         |       |
|# col_name             |data_type|comment|
|c                      |string   |null   |
|d                      |string   |null   |
+-----------------------+---------+-------+

scala> sql("DESC EXTENDED partitioned_table PARTITION (c='Us', d=1)").show(100,false)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------+
|col_name                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |data_type|comment|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------+
|a                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |string   |null   |
|b                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |int      |null   |
|c                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |string   |null   |
|d                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |string   |null   |
|# Partition Information                                                                                                                                                                                                                                                                                                                                                                                                                                                            |         |       |
|# col_name                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |data_type|comment|
|c                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |string   |null   |
|d                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |string   |null   |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |         |       |
|Detailed Partition Information CatalogPartition(
        Partition Values: [Us, 1]
        Storage(Location: file:/Users/dhyun/SPARK-17612-DESC-PARTITION/spark-warehouse/partitioned_table/c=Us/d=1, InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Properties: [serialization.format=1])
        Partition Parameters:{transient_lastDdlTime=1475001066})|         |       |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------+


scala> sql("DESC FORMATTED partitioned_table PARTITION (c='Us', d=1)").show(100,false)
+--------------------------------+---------------------------------------------------------------------------------------+-------+
|col_name                        |data_type                                                                              |comment|
+--------------------------------+---------------------------------------------------------------------------------------+-------+
|a                               |string                                                                                 |null   |
|b                               |int                                                                                    |null   |
|c                               |string                                                                                 |null   |
|d                               |string                                                                                 |null   |
|# Partition Information         |                                                                                       |       |
|# col_name                      |data_type                                                                              |comment|
|c                               |string                                                                                 |null   |
|d                               |string                                                                                 |null   |
|                                |                                                                                       |       |
|# Detailed Partition Information|                                                                                       |       |
|Partition Value:                |[Us, 1]                                                                                |       |
|Database:                       |default                                                                                |       |
|Table:                          |partitioned_table                                                                      |       |
|Location:                       |file:/Users/dhyun/SPARK-17612-DESC-PARTITION/spark-warehouse/partitioned_table/c=Us/d=1|       |
|Partition Parameters:           |                                                                                       |       |
|  transient_lastDdlTime         |1475001066                                                                             |       |
|                                |                                                                                       |       |
|# Storage Information           |                                                                                       |       |
|SerDe Library:                  |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                                     |       |
|InputFormat:                    |org.apache.hadoop.mapred.TextInputFormat                                               |       |
|OutputFormat:                   |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                             |       |
|Compressed:                     |No                                                                                     |       |
|Storage Desc Parameters:        |                                                                                       |       |
|  serialization.format          |1                                                                                      |       |
+--------------------------------+---------------------------------------------------------------------------------------+-------+

How was this patch tested?

Pass the Jenkins tests with a new testcase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some validation for the output returned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @skambha .
Thank you for review. BTW, what kind of validation do you want? If we have a DESC command output validation test cases, it will be the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking - of verifying that the output result points out the partition columns correctly.
Is there a testcase that covers this already? If so that is fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, as you see in the PR description, the output result is the same with the DESC command with PARTITION spec. In other words, the output does not depend on the given spec. Only, this command raises exceptions if the given partition does not exists.

the output result points out the partition columns correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive supports describe based on the partition specified and will list the details for the particular partition when used with formatted or extended option.
DESCRIBE formatted part_table partition (d='abc')
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Describe

@dongjoon-hyun, this might be beyond the scope of this PR but this would be useful if there are a lot of partitions and we want to find details for a given partition. What do you think? Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skambha . Let's wait some feedback from committers. Before putting more effort on this, I want to know if this kind of PR about DESC regression will be merged into master.

@SparkQA
Copy link

SparkQA commented Sep 20, 2016

Test build #65680 has finished for PR 15168 at commit 1396eb1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class DescribeTableCommand(

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
Could you review this PR to implement DESCRIBE table PARTITION?

@dongjoon-hyun
Copy link
Member Author

Hi, @hvanhovell .
Could you review this PR when you have some time?

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun am I right to assume that this currently does not describe the partition the user requests? Why not implement the command now you are at it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you collect here and add some names? It is a bit cryptic at the moment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this touch the result? How is the partition information added to the result?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to test the output with SQLQueryTestSuite? Or does that have portability issues?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Sep 21, 2016

Thank you, @hvanhovell . Yes, currently, as you mentioned, EXTEDNED or FORMATTED is not included in this PR since I wasn't sure that Spark 2.0 want to fix this regression.

Like Spark 1.6, I'll add that, too.

@dongjoon-hyun
Copy link
Member Author

Hi, @hvanhovell .
Currently, DESC FORMATTED TABLE and DESC EXTENDED TABLE shows a little bit different from Apache Spark 1.6. I think it's improved a little. I'm revising the DETAIL/EXTENDED partition info of this PR to be consistent with Spark 2.0, not consistent with Spark 1.6.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Sep 23, 2016

Hi, @hvanhovell . I updated the followings.

  • Add 'EXTENDED' and 'FORMATTED' description for 'PARTITION'.
  • Move the tests into SQLQueryTestSuite. describe.sql. Please note that the suite sorts the result, so it looks differently. You can see the PR description more correctly.
  • Update PR description

Could you review this again when you have some time?

@gatorsmile
Copy link
Member

gatorsmile commented Sep 23, 2016

I also like this feature. The Hive-generated statistics for partitioned tables are stored in partition-level properties. See the PR: #15158

We need it before https://issues.apache.org/jira/browse/SPARK-17129 is done; otherwise, users are unable to read the statistics of partitioned tables.

BTW, will try to review this PR tonight.

@dongjoon-hyun
Copy link
Member Author

Thank you, @gatorsmile .

@SparkQA
Copy link

SparkQA commented Sep 23, 2016

Test build #65842 has finished for PR 15168 at commit c33af71.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 23, 2016

Test build #65841 has finished for PR 15168 at commit f8fb269.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Oh, my bad.
The only failure is describe.sql. We can not use SQLQueryTestSuite since the last access time is always different.

describe.sql *** FAILED *** (272 milliseconds)
[info]   Expected "...reated: Fri Sep 23 1[2:30:40] PDT 2016
[info]      Last Acce...", but got "...reated: Fri Sep 23 1[4:47:49] PDT 2016
[info]      Last Acce..." Result dit not match for query #3

I'll move back the testcase into the original test suite.

@SparkQA
Copy link

SparkQA commented Sep 24, 2016

Test build #65859 has finished for PR 15168 at commit ba22975.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

The failure seems to be irrelevant. Retest this please.

[info] - Naive Bayes Multinomial *** FAILED *** (137 milliseconds)
[info]   Expected 0.7 and 0.6494565217391305 to be within 0.05 using absolute tolerance.
[info]   validateModelFit:

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Sep 24, 2016

Test build #65861 has finished for PR 15168 at commit ba22975.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

gatorsmile commented Sep 24, 2016

The implementation of this PR does not support data source tables. Please try the following example.

      spark
        .range(1).select('id as 'a, 'id as 'b, 'id as 'c, 'id as 'd).write
        .bucketBy(2, "b").sortBy("c").partitionBy("d")
        .saveAsTable("t1")
      sql("DESC FORMATTED t1 partition (d=1)").show(100, false)

The error you will get is:

org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: table is not partitioned but partition spec exists: {d=1};

@dongjoon-hyun
Copy link
Member Author

Thank you, @gatorsmile . I'll fix that soon.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Sep 24, 2016

Hi, @gatorsmile .

It's the same behavior with Apache Spark 1.6.2. This PR doesn't make any regression for Datasource tables. Actually, the error you mentioned is raised by HiveMetastore since it's not Hive compatible table as you see the WARNING below.

scala> sc.version
res5: String = 1.6.2

scala> sqlContext.range(1).select('id as 'a, 'id as 'b, 'id as 'c, 'id as 'd).write.partitionBy("d").saveAsTable("t1")
16/09/24 08:24:51 WARN HiveContext$$anon$2: Persisting partitioned data source relation `t1` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. Input path(s): 
file:/user/hive/warehouse/t1

scala> sql("DESC FORMATTED t1 partition (d=1)").show(100, false)
16/09/24 08:25:06 ERROR DDLTask: java.lang.RuntimeException: cannot find field null from [0:col]
    at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:416)

@gatorsmile
Copy link
Member

Yeah, I knew. : ) This PR is to support DESCRIBE PARTITION, right? The other DESCRIBE commands can support all the table types. Thus, I believe we need to support all the cases for DESCRIBE PARTITION

@dongjoon-hyun
Copy link
Member Author

Actually, I think there was a reason not to do that in Spark 1.6.2. In Spark 2.0, is there any new way to lookup the partition info from Parquet Datasource through SessionCatalog? SessionCatalog seems not to support that until now like the Spark 1.6.2.

Anyway, if we need to support that here. This PR needs to be IMPROVEMENT jira issue. So far, it was BUG issue to recover the regression. :)

@gatorsmile
Copy link
Member

Yeah, please change the JIRA to improvement.

I did not check the code of Spark 1.6.2, but, since Spark 2.0, we provide a native support for DESC TABLE. See the PR: #13025

@dongjoon-hyun
Copy link
Member Author

Sure. I'll investigate from that. Thank you for the pointer, @gatorsmile .

@gatorsmile
Copy link
Member

gatorsmile commented Sep 25, 2016

Partitioned View is another potential issue for this PR.

FYI, just submitted a related PR #15233

@dongjoon-hyun
Copy link
Member Author

Hi, @gatorsmile . I investigated the related PRs.

Unfortunately, those are extracting the partition column names from table properties. But, table properties does not have partition column values. Currently, Spark does not hold the partitioned column values of Datasource table in CatalogTable.

In this case, I'd like to make another issue for that since that's really a new feature. Since it's not supported Spark 1.6, I think it's nice-to-have item, not a must-to-have item in this PR. How do you think about that?

@gatorsmile
Copy link
Member

Thank you for your investigation! How about generating a user-friendly error message in these unsupported cases? The error messages from Hive is very confusing to the external users.

@dongjoon-hyun
Copy link
Member Author

Yep. That's would be great. Thank you for guidance always!

@SparkQA
Copy link

SparkQA commented Sep 25, 2016

Test build #65877 has finished for PR 15168 at commit 21e8295.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Thank you for review @gatorsmile .

Hi, @hvanhovell .
Could you review this again?

@hvanhovell
Copy link
Contributor

@dongjoon-hyun sorry about that. Is the PR description (including all the exceptions) up to date?

@dongjoon-hyun
Copy link
Member Author

Never mind~. Thank you for coming back! I'll check the PR description again and make up-to-date.

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Left one comment.

val (valid, nonValid) = visitPartitionSpec(ctx.partitionSpec).partition(_._2.isDefined)
if (nonValid.nonEmpty) {
// For non-valid specification for `DESC` command, this raises a ParseException.
return null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just throw a meaningful parse exception and simplify this logic.

visitPartitionSpec(ctx.partitionSpec).map {
  case (key, Some(value)) => key -> value
  case  _ => throw new ParseException("...", ctx)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I'll fix soon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Now, Jenkins is running.

@dongjoon-hyun
Copy link
Member Author

Now, the PR description is also up-to-date.

@SparkQA
Copy link

SparkQA commented Sep 27, 2016

Test build #65988 has finished for PR 15168 at commit ec574ff.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 27, 2016

Test build #65989 has finished for PR 15168 at commit 2ed1069.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -0,0 +1,24 @@
CREATE TABLE t (a STRING, b INT) PARTITIONED BY (c STRING, d STRING);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the table?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thank you!

@SparkQA
Copy link

SparkQA commented Sep 27, 2016

Test build #65993 has finished for PR 15168 at commit fc132e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 27, 2016

Test build #66001 has finished for PR 15168 at commit 04c75e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Finally, it passed! @hvanhovell .

@dongjoon-hyun
Copy link
Member Author

Hi, @hvanhovell .
Could you review this again?

@dongjoon-hyun
Copy link
Member Author

Hi, @hvanhovell . I thought it's a regression bug fix.
I'm wondering if you think this is a new feature for Spark 2.1.

@hvanhovell
Copy link
Contributor

@dongjoon-hyun we made the decision to regress here willingly. So I think it is a new feature, also this is currently not backportable.

LGTM. Merging to master. Thanks.

@asfgit asfgit closed this in 4ecc648 Sep 29, 2016
@dongjoon-hyun
Copy link
Member Author

Oh, I see. Does that mean it's not allowed to be backported due to the policy although I made a patch for branch-2.0?

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @hvanhovell , @gatorsmile , and @skambha !

@gatorsmile
Copy link
Member

When implementing the native DDL support, we blocked a few DDL statements due to the resource limit. Thus, this PR is not a bug fix. In theory, we should not add new features into the previous releases.

@hvanhovell
Copy link
Contributor

@dongjoon-hyun the risk of this PR breaking something is quite low; so I am open to backporting as long as long as it does not create too much issues. BTW: did you open a backport? this was against master?

@dongjoon-hyun
Copy link
Member Author

@gatorsmile , @hvanhovell . Yep. I knew the policy about new feature issues and bug fix issues. So, I wish this PR is considered as bug fix issue from the beginning. Ya, it's a new feature indeed.

@hvanhovell . I didn't make that yet since this PR wasn't merged before. I'll try to make a backport PR for that to receive a chance to be reviewed. That will be mostly the same code. If you think that's too much at that time, I'll close the backport PR then.

Thank you again, All.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants