feat(bigquery): add support for query cost estimate#18694
feat(bigquery): add support for query cost estimate#18694kzosabe wants to merge 8 commits intoapache:masterfrom kzosabe:feat/bigquery-cost-estimate
Conversation
|
|
||
| @classmethod | ||
| def estimate_statement_cost( | ||
| cls, statement: str, cursor: Any, engine: Engine |
There was a problem hiding this comment.
The only way to estimate the cost in advance in BigQuery is to run the query with dry_run, and since this is not possible with only cursor, I add engine as an argument.
Another way to handle bigquery.Client directly is to configure sqlalchemy to pass the dryrun parameter when creating the connection, but this seems to be more complicated...
| @classmethod | ||
| def query_cost_formatter( | ||
| cls, raw_cost: List[Dict[str, Any]] | ||
| ) -> List[Dict[str, str]]: | ||
| def format_bytes_str(raw_bytes: int) -> str: | ||
| if not isinstance(raw_bytes, int): | ||
| return str(raw_bytes) | ||
| units = ["B", "KiB", "MiB", "GiB", "TiB", "PiB"] | ||
| index = 0 | ||
| bytes = float(raw_bytes) | ||
| while bytes >= 1024 and index < len(units) - 1: | ||
| bytes /= 1024 | ||
| index += 1 | ||
|
|
||
| return "{:.1f}".format(bytes) + f" {units[index]}" | ||
|
|
||
| return [ | ||
| { | ||
| k: format_bytes_str(v) if k == "Total bytes processed" else str(v) | ||
| for k, v in row.items() | ||
| } | ||
| for row in raw_cost | ||
| ] |
There was a problem hiding this comment.
It seems this this logic overlaps with the humanize functions in the query_cost_formatter methods in TrinoEngineSpec and PrestoEngineSpec . I wonder if we should move humanize to BaseEngineSpec` so we could remove the duplication?
There was a problem hiding this comment.
Thanks for the review!
so we could remove the duplication?
It is possible that it is better to go DRY, but there are some things to consider somewhat.
The intent of this implementation was to be consistent with the official UI provided by BigQuery, both in KiB notation and to the first decimal place.

In particular, the current presto and trino implementations divide by 1000 instead of 1024, which is a problem.
There will be a small difference between the number of predicted bytes in BigQuery and the number of predicted bytes in superset.
I would like to avoid using the current humanize implementation as is, because this would cause confusion for users.
There was a problem hiding this comment.
I think that there are several possible patterns:
a. Based on the humanize implementation, prepare methods to pass prefixes and to_next_prefixes as parameters
It allows for common implementation and same result, but is somewhat complex to implement.
b. Provide two methods, humanize_number and humanize_bytes
The behavior of the byte count display in trino and presto changes slightly.
c. Keep a separate implementation ( or share only between trino and presto )
Which do you think is the best?
For me, any of them is OK and I'll try it.
However, it might be better to work on a separate PR.
There was a problem hiding this comment.
Good point, being consistent with the BQ console definitely makes sense. I remember a discussion about this in the original PR where 1024 vs 1000 was debated: #8172 (comment) While being consistent with the BQ console, it would feel funny to have different units in for BQ vs Presto/Trino. @betodealmeida thoughts?
There was a problem hiding this comment.
Does anyone have any suggestions?
I think the ability of estimating bytes is important, and the format is relatively unimportant, so I'm not strongly concerned about how to handle it.
I think it would be better to refactor trino and presto to KiB notation, but if there are some reasons not to do so, I'll make the BigQuery implementation as KB notation.
There was a problem hiding this comment.
I think we can change the humanize function in Presto (it's also duplicated in Trino) to return bytes in 1024 increments. The only reason I did 1000 is because it's also applied to row count and other parameters. This way it's consistent. Ideally we'd have a single function used by Presto, Trino, BigQuery and other engine specs.
(Note that it's also possible to overwrite the formatter function using QUERY_COST_FORMATTERS_BY_ENGINE in the config. We used that at Lyft to show the query cost in dollars, estimated run time, and carbon footprint.)
Codecov Report
@@ Coverage Diff @@
## master #18694 +/- ##
==========================================
+ Coverage 66.28% 66.39% +0.11%
==========================================
Files 1605 1619 +14
Lines 62863 63012 +149
Branches 6341 6341
==========================================
+ Hits 41666 41835 +169
+ Misses 19545 19525 -20
Partials 1652 1652
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
superset/db_engine_specs/bigquery.py
Outdated
| "Could not import libraries `google.cloud` or `google.oauth2`, " | ||
| "which are required to be installed in your environment in order " | ||
| "to estimate cost" |
There was a problem hiding this comment.
Curious, wouldn't these be necessarily installed if the user has a BigQuery database connected?
There was a problem hiding this comment.
That's right, we can simply use import here. I'll fix this!
| @classmethod | ||
| def query_cost_formatter( | ||
| cls, raw_cost: List[Dict[str, Any]] | ||
| ) -> List[Dict[str, str]]: | ||
| def format_bytes_str(raw_bytes: int) -> str: | ||
| if not isinstance(raw_bytes, int): | ||
| return str(raw_bytes) | ||
| units = ["B", "KiB", "MiB", "GiB", "TiB", "PiB"] | ||
| index = 0 | ||
| bytes = float(raw_bytes) | ||
| while bytes >= 1024 and index < len(units) - 1: | ||
| bytes /= 1024 | ||
| index += 1 | ||
|
|
||
| return "{:.1f}".format(bytes) + f" {units[index]}" | ||
|
|
||
| return [ | ||
| { | ||
| k: format_bytes_str(v) if k == "Total bytes processed" else str(v) | ||
| for k, v in row.items() | ||
| } | ||
| for row in raw_cost | ||
| ] |
There was a problem hiding this comment.
I think we can change the humanize function in Presto (it's also duplicated in Trino) to return bytes in 1024 increments. The only reason I did 1000 is because it's also applied to row count and other parameters. This way it's consistent. Ideally we'd have a single function used by Presto, Trino, BigQuery and other engine specs.
(Note that it's also possible to overwrite the formatter function using QUERY_COST_FORMATTERS_BY_ENGINE in the config. We used that at Lyft to show the query cost in dollars, estimated run time, and carbon footprint.)
| return [str(s).strip(" ;") for s in sqlparse.parse(sql)] | ||
|
|
||
| @classmethod | ||
| def _humanize(cls, value: Any, suffix: str, category: Optional[str] = None) -> str: |
There was a problem hiding this comment.
If there could be an input like (1000, "", "dollars") and an output like "$1,000", there would be more categories
| { | ||
| "Output count": "904 M rows", | ||
| "Output size": "354 GB", | ||
| "Output size": "329 GiB", |
There was a problem hiding this comment.
These two values are identical except for the units. I took care not to change any of the other outputs.
|
@betodealmeida Hello, I've fixed the points you pointed out, could you please review? |
|
@betodealmeida @villebro I've been waiting for a review for about 3 weeks since last update, I'd love to get some feedback on whether something is wrong with my PR or reviewers are busy. |
|
I'm assuming that since @kzosabe hasn't responded, we missed the boat on that |
SUMMARY
The ability to know in advance how many bytes a query will process is important when using BigQuery.
This PR adds the ability to estimate query cost to BigQuery integration as well as other DB systems such as postgres and presto.
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
before:

after:

TESTING INSTRUCTIONS
ESTIMATE_QUERY_COSTADVANCED>SQL Lab>Enable query cost estimationin the edit dialog in the BigQuery database connectionADDITIONAL INFORMATION