Skip to content

Conversation

@wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Oct 5, 2016

What changes were proposed in this pull request?

This pr adds some test cases for statistics: case sensitive column names, non ascii column names, refresh table, and also improves some documentation.

How was this patch tested?

add test cases

@wzhfy
Copy link
Contributor Author

wzhfy commented Oct 5, 2016

cc @cloud-fan @gatorsmile

@SparkQA
Copy link

SparkQA commented Oct 5, 2016

Test build #66378 has finished for PR 15360 at commit 0ad7c88.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thank you! Will review it tonight or tomorrow morning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you create a separate function for the following checking logics? Then, you can have two test cases without duplicate codes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bug exposed by the newly added test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) Improving the test case coverage is important.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detected -> deduplicated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave the comments to explain which DDL commands trigger the refresh; Otherwise, the reviewers might be confused about what this test case is doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this DDL?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above both DDL will call refreshTable with the same table name. Right? If the source codes remove any refreshTable, the test case still passes. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll split these two command into two separate test cases, for table stats and column stats respectively.

@SparkQA
Copy link

SparkQA commented Oct 7, 2016

Test build #66503 has finished for PR 15360 at commit e7979c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

We need a test case for Hive serde table. So far, I still have not found any test case to cover Hive serde tables.

@wzhfy
Copy link
Contributor Author

wzhfy commented Oct 8, 2016

@gatorsmile Oh, I thought by "hive serde tables" you mean tables stored in Hive metastore.
Let me create test cases for both data source table and hive serde table in sql/hive.

@SparkQA
Copy link

SparkQA commented Oct 8, 2016

Test build #66581 has finished for PR 15360 at commit 2ee4252.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Oct 8, 2016

retest this please

@SparkQA
Copy link

SparkQA commented Oct 9, 2016

Test build #66586 has finished for PR 15360 at commit 2ee4252.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Will review this tonight. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We hit a bug here... Not by your PRs, but this test case just exposes it. No need to worry about it. I will fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What bug? Please let me know when that bug fix pr is sent. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not attempt to create a Hive-compatible table in this case. It always fails because of column names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it cause any problems? The logic to create Hive-compatible table is quite conservative, we will try to save into hive metastore first, if fails, fallback to spark specific format.

Copy link
Contributor Author

@wzhfy wzhfy Oct 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It outputs a warning including an exception, and the test can complete successfully.

WARN org.apache.spark.sql.hive.HiveExternalCatalog: Could not persist `default`.`tbl` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format.
org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name c1 in the table definition.
    at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:720)
...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment to briefly explain the test case scenario. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you leave a code comment here to emphasize it? I am just afraid this might be modified without notice. Newly computed stats should override the existing stats.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you deduplicate the two test cases refreshing table stats and refreshing column stats by calling the same common function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile rebased and updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

statsBeforeAfterUpdate -> getStatsBeforeAfterAnalyzeCommand

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Analyze Table COMPUTE STATISTICS FOR COLUMNS is also Analyze Table. Thus, the input parm name is confusing. How about isAnalyzeTable -> isAnalyzeColumns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile
Copy link
Member

cc @cloud-fan I do not have any more comment. Could you check this please? Thanks!

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #66799 has finished for PR 15360 at commit 1e64163.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #66807 has finished for PR 15360 at commit d93d082.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy wzhfy force-pushed the colStats2 branch 2 times, most recently from d782c14 to 30ac539 Compare October 14, 2016 02:23
@wzhfy
Copy link
Contributor Author

wzhfy commented Oct 14, 2016

resolve conflicts

@SparkQA
Copy link

SparkQA commented Oct 14, 2016

Test build #66935 has finished for PR 15360 at commit d782c14.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • case class AnalyzeColumnCommand(

@SparkQA
Copy link

SparkQA commented Oct 14, 2016

Test build #66937 has finished for PR 15360 at commit 30ac539.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val (statsBeforeUpdate, statsAfterUpdate) = getStatsBeforeAfterUpdate(isAnalyzeColumns = false)

assert(statsBeforeUpdate.sizeInBytes > 0)
assert(statsBeforeUpdate.rowCount.contains(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should not use Option as a collection, but use it more explicitly statsBeforeUpdate.rowCount == Some(1) . BTW Option.contains is not in scala 2.10

rsd = spark.sessionState.conf.ndvMaxError)
}

private def dataAndColStats(): (DataFrame, Seq[(StructField, ColumnStat)]) = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method doesn't take any parameters so its result is static. Can we just create 2 fields for them? e.g.

private lazy val testDataFrame = ...
private lazy val expectedStats = ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They share some common values e.g. intSeq, stringSeq... so I put them in a single method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then can we

private lazy val (testDataFrame, expectedStats) = {
  ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's good, thanks!

@SparkQA
Copy link

SparkQA commented Oct 14, 2016

Test build #66955 has finished for PR 15360 at commit 6cf23ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, merging to master!

@asfgit asfgit closed this in 7486442 Oct 14, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
## What changes were proposed in this pull request?
This pr adds some test cases for statistics: case sensitive column names, non ascii column names, refresh table, and also improves some documentation.

## How was this patch tested?
add test cases

Author: wangzhenhua <[email protected]>

Closes apache#15360 from wzhfy/colStats2.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
This pr adds some test cases for statistics: case sensitive column names, non ascii column names, refresh table, and also improves some documentation.

## How was this patch tested?
add test cases

Author: wangzhenhua <[email protected]>

Closes apache#15360 from wzhfy/colStats2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants