-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17078] [SQL] Show stats when explain #16594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @rxin @cloud-fan |
|
Test build #71424 has started for PR 16594 at commit |
|
retest this please |
|
Test build #71430 has finished for PR 16594 at commit
|
|
Test build #71508 has finished for PR 16594 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general style suggestion. Normally, the SQL keywords are using upper case in the test cases.
explain select * from src -> EXPLAIN SELECT * FROM src
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not doing this by default? Do we need an extra flag?
If needed, the name should be SHOW_TABLE_STATS_IN_EXPLAIN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do it by default, it can simplify this PR a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SHOW_TABLE_STATS_IN_EXPLAIN could be misleading, because we are not only showing stats for table, but also for all logical plans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's invalidated by default because stats info can be inaccurate (and in some cases very inaccurate), and can confuse regular users. At current stage it's better to be a feature for administrators and developers to see how cbo behaves in estimation. So I make the flag "internal".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, when the stats are not accurate, will it be the cause of an inefficient plan? If so, why not showing them the number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. e.g., after joins of many tables, if sizeInBytes is computed by the simple way (non-cbo way), we just multiply all the sizes of these tables, then sizeInBytes becomes a ridiculously large value. I think this will harm user experience.
I agree removing the flag can simplify code a lot, but I'm hesitated to expose such information to all users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the sizeInBytes affects the plan decision, I think it makes sense to let users see it.
When the plan is not expected and the number is super large, they might turn on/off CBO or trigger the command to re-analyze the tables. Hiding it looks not right to me, even if the number is ugly. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. But since it changes user interface, let's double check with others. @rxin @hvanhovell @cloud-fan Shall we show stats of LogicalPlan directly in explain command ?
|
Test build #71588 has finished for PR 16594 at commit
|
|
@wzhfy could you add an example of this to the PR description? I am a bit worried that the explain plans will become (much) harder to read. I am also interested to see if this new explain output is understandable for an end user. |
|
@hvanhovell I've updated the description which shows a simple example. The explained plan will become hard to read when joining many tables and sizeInBytes is computed by the simple way (non-cbo way), i.e. we just multiply all the sizes of these tables, then sizeInBytes becomes a super large value (could be more than a hundred digits). |
|
sorry this explain plan makes no sense -- it is impossible to read. |
|
@rxin Can we add a flag to enable or disable it? Currently there's no other way to see size and row count except debugging. |
|
Let us do some research how the other RDBMSs are doing it? For example, Oracle |
|
PostgreSQL has a few different options in the EXPLAIN command: The same plan with cost estimates suppressed: |
|
MySQL also outputs the statistics in the command of EXPLAIN. See the link: https://dev.mysql.com/doc/refman/5.7/en/explain-extended.html
|
|
SQLServer has three ways to show the plan: graphical plans, text plans, and XML plans. Actually, it is pretty advanced. When using the text plans, users can set the output formats:
I found a 300-pages book |
|
:- ) No perfect solution, but we should use the metric prefix when the number is huge. |
|
Test build #71847 has finished for PR 16594 at commit
|
|
To show a very large Long number, there is no need to print out every digit in the number. We can use exponent. For example, a number 120,000,000,005,123 can be printed as 1.2*10E14, where 10E14 means 10 to the power 14. |
|
@ron8hu Yes, I've already updated this pr. I'll present an example. |
|
@rxin @gatorsmile @hvanhovell I've updated this pr and make stats much more readable: SizeInBytes is shown in units of B, KB, MB ... PB, e.g. Now the above example looks like this: |
|
Test build #71906 has finished for PR 16594 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are having bytesToString in Utils.scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That method only accepts Long parameter, and estimated stats can still be unreadable even when using TB as unit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to use that method in combination with current logic, thanks for reminding
|
I still do not think using an internal configuration is a user friendly way to show the plan costs. Basically, using this way, we are not wanting users to see it. When CBO is introduced, we should let users see it. The traditional RDBMS admins might not expect such a design |
|
@gatorsmile I just did a quick fix to show how the improved stats look like. If @rxin @hvanhovell accept the change proposed in this pr, I'll update to remove the flag :) |
|
Test build #71921 has started for PR 16594 at commit |
|
ok here is an idea how about as the way to add stats? we already have explain codegen. |
|
I like the idea proposed by rxin |
|
me 2. |
|
ok I'll modify it with this new command. |
| decimalValue.toString() + " B" | ||
| } | ||
| } else { | ||
| decimalValue.toString() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always represent it using scientific notation? Or only do it when the number is too large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://en.wikipedia.org/wiki/Metric_prefix
Even if we do not have a unit, we still can use K, M, G, T, P, E?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, will that be more readable than scientific notation if no unit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With or without units, the readability is the same, right? If we make them consistent, the impl of def format(number: BigInt) will look much cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't make them consistent here, because unit string is added inside Utils.bytesToString.
How about move the logic for size into Utils.bytesToString and make it support BigInt?
Then we can remove def format:
def simpleString: String = {
Seq(s"sizeInBytes=${Utils.bytesToString(sizeInBytes)}",
if (rowCount.isDefined) {
// Show row count in scientific notation.
s"rowCount=${BigDecimal(rowCount.get, new MathContext(3, RoundingMode.HALF_UP)).toString()}"
} else {
""
},
s"isBroadcastable=$isBroadcastable"
).filter(_.nonEmpty).mkString(", ")
}
|
|
||
| def toStringWithStats: String = completeString(appendStats = true) | ||
|
|
||
| def completeString(appendStats: Boolean): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private?
| } else if (isExplainableStatement(statement)) { | ||
| ExplainCommand(statement, extended = ctx.EXTENDED != null, codegen = ctx.CODEGEN != null) | ||
| ExplainCommand(statement, extended = ctx.EXTENDED != null, codegen = ctx.CODEGEN != null, | ||
| cost = ctx.COST != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to fix the style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give a clue on the style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExplainCommand(
statement,
extended = ctx.EXTENDED != null,
codegen = ctx.CODEGEN != null,
cost = ctx.COST != null)
| extended: Boolean = false, | ||
| codegen: Boolean = false) | ||
| codegen: Boolean = false, | ||
| cost: Boolean = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add @parm like the other parameters
| def format(number: BigInt, isSize: Boolean): String = { | ||
| val decimalValue = BigDecimal(number, new MathContext(3, RoundingMode.HALF_UP)) | ||
| if (isSize) { | ||
| // The largest unit in Utils.bytesToString is TB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about improving bytesToString and make it support PB or higher?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I also think TB is a little small
|
Test build #73200 has finished for PR 16594 at commit
|
|
Test build #73295 has finished for PR 16594 at commit
|
| FORMAT: 'FORMAT'; | ||
| LOGICAL: 'LOGICAL'; | ||
| CODEGEN: 'CODEGEN'; | ||
| COST: 'COST'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also put in it nonReserved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Also please update the hiveNonReservedKeyword in TableIdentifierParserSuite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Updated.
|
LGTM except one comment |
|
Test build #73402 has started for PR 16594 at commit |
|
LGTM, pending test |
|
retest this please |
|
Test build #73419 has finished for PR 16594 at commit
|
|
thanks, merging to master! |
## What changes were proposed in this pull request? Currently we can only check the estimated stats in logical plans by debugging. We need to provide an easier and more efficient way for developers/users. In this pr, we add EXPLAIN COST command to show stats in the optimized logical plan. E.g. ``` spark-sql> EXPLAIN COST select count(1) from store_returns; ... == Optimized Logical Plan == Aggregate [count(1) AS count(1)#24L], Statistics(sizeInBytes=16.0 B, rowCount=1, isBroadcastable=false) +- Project, Statistics(sizeInBytes=4.3 GB, rowCount=5.76E+8, isBroadcastable=false) +- Relation[sr_returned_date_sk#3,sr_return_time_sk#4,sr_item_sk#5,sr_customer_sk#6,sr_cdemo_sk#7,sr_hdemo_sk#8,sr_addr_sk#9,sr_store_sk#10,sr_reason_sk#11,sr_ticket_number#12,sr_return_quantity#13,sr_return_amt#14,sr_return_tax#15,sr_return_amt_inc_tax#16,sr_fee#17,sr_return_ship_cost#18,sr_refunded_cash#19,sr_reversed_charge#20,sr_store_credit#21,sr_net_loss#22] parquet, Statistics(sizeInBytes=28.6 GB, rowCount=5.76E+8, isBroadcastable=false) ... ``` ## How was this patch tested? Add test cases. Author: wangzhenhua <[email protected]> Author: Zhenhua Wang <[email protected]> Closes apache#16594 from wzhfy/showStats.

What changes were proposed in this pull request?
Currently we can only check the estimated stats in logical plans by debugging. We need to provide an easier and more efficient way for developers/users.
In this pr, we add EXPLAIN COST command to show stats in the optimized logical plan.
E.g.
How was this patch tested?
Add test cases.