feat(bigquery): add support for query cost estimate by kzosabe · Pull Request #18694 · apache/superset

kzosabe · 2022-02-13T10:28:21Z

SUMMARY

The ability to know in advance how many bytes a query will process is important when using BigQuery.
This PR adds the ability to estimate query cost to BigQuery integration as well as other DB systems such as postgres and presto.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

before:

after:

TESTING INSTRUCTIONS

Enable feature flag ESTIMATE_QUERY_COST
Check ADVANCED > SQL Lab > Enable query cost estimation in the edit dialog in the BigQuery database connection

ADDITIONAL INFORMATION

Has associated issue:
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

kzosabe · 2022-02-13T10:30:01Z

superset/db_engine_specs/bigquery.py

+
+    @classmethod
+    def estimate_statement_cost(
+        cls, statement: str, cursor: Any, engine: Engine


The only way to estimate the cost in advance in BigQuery is to run the query with dry_run, and since this is not possible with only cursor, I add engine as an argument.

Another way to handle bigquery.Client directly is to configure sqlalchemy to pass the dryrun parameter when creating the connection, but this seems to be more complicated...

https://github.com/googleapis/python-bigquery-sqlalchemy#connection-string-parameters

villebro

@kzosabe this looks great! One minor comment about making the code a bit more DRY, other than that looks good to go.

villebro · 2022-02-14T07:19:02Z

superset/db_engine_specs/bigquery.py

+    @classmethod
+    def query_cost_formatter(
+        cls, raw_cost: List[Dict[str, Any]]
+    ) -> List[Dict[str, str]]:
+        def format_bytes_str(raw_bytes: int) -> str:
+            if not isinstance(raw_bytes, int):
+                return str(raw_bytes)
+            units = ["B", "KiB", "MiB", "GiB", "TiB", "PiB"]
+            index = 0
+            bytes = float(raw_bytes)
+            while bytes >= 1024 and index < len(units) - 1:
+                bytes /= 1024
+                index += 1
+
+            return "{:.1f}".format(bytes) + f" {units[index]}"
+
+        return [
+            {
+                k: format_bytes_str(v) if k == "Total bytes processed" else str(v)
+                for k, v in row.items()
+            }
+            for row in raw_cost
+        ]


It seems this this logic overlaps with the humanize functions in the query_cost_formatter methods in TrinoEngineSpec and PrestoEngineSpec . I wonder if we should move humanize to BaseEngineSpec` so we could remove the duplication?

Thanks for the review!

so we could remove the duplication?

It is possible that it is better to go DRY, but there are some things to consider somewhat.

The intent of this implementation was to be consistent with the official UI provided by BigQuery, both in KiB notation and to the first decimal place.

In particular, the current presto and trino implementations divide by 1000 instead of 1024, which is a problem.
There will be a small difference between the number of predicted bytes in BigQuery and the number of predicted bytes in superset.
I would like to avoid using the current humanize implementation as is, because this would cause confusion for users.

I think that there are several possible patterns:

a. Based on the humanize implementation, prepare methods to pass prefixes and to_next_prefixes as parameters

It allows for common implementation and same result, but is somewhat complex to implement.

b. Provide two methods, humanize_number and humanize_bytes

The behavior of the byte count display in trino and presto changes slightly.

c. Keep a separate implementation ( or share only between trino and presto )

Which do you think is the best?
For me, any of them is OK and I'll try it.
However, it might be better to work on a separate PR.

Good point, being consistent with the BQ console definitely makes sense. I remember a discussion about this in the original PR where 1024 vs 1000 was debated: #8172 (comment) While being consistent with the BQ console, it would feel funny to have different units in for BQ vs Presto/Trino. @betodealmeida thoughts?

Does anyone have any suggestions?
I think the ability of estimating bytes is important, and the format is relatively unimportant, so I'm not strongly concerned about how to handle it.
I think it would be better to refactor trino and presto to KiB notation, but if there are some reasons not to do so, I'll make the BigQuery implementation as KB notation.

I think we can change the humanize function in Presto (it's also duplicated in Trino) to return bytes in 1024 increments. The only reason I did 1000 is because it's also applied to row count and other parameters. This way it's consistent. Ideally we'd have a single function used by Presto, Trino, BigQuery and other engine specs.

(Note that it's also possible to overwrite the formatter function using QUERY_COST_FORMATTERS_BY_ENGINE in the config. We used that at Lyft to show the query cost in dollars, estimated run time, and carbon footprint.)

codecov · 2022-02-14T07:21:10Z

Codecov Report

Merging #18694 (9e36f0f) into master (225015f) will increase coverage by 0.11%.
The diff coverage is 91.22%.

@@            Coverage Diff             @@
##           master   #18694      +/-   ##
==========================================
+ Coverage   66.28%   66.39%   +0.11%     
==========================================
  Files        1605     1619      +14     
  Lines       62863    63012     +149     
  Branches     6341     6341              
==========================================
+ Hits        41666    41835     +169     
+ Misses      19545    19525      -20     
  Partials     1652     1652

Flag	Coverage Δ
hive	`52.22% <24.56%> (+0.08%)`	⬆️
mysql	`81.48% <91.22%> (+0.15%)`	⬆️
postgres	`81.53% <91.22%> (+0.15%)`	⬆️
presto	`52.07% <24.56%> (+0.08%)`	⬆️
python	`81.96% <91.22%> (+0.15%)`	⬆️
sqlite	`81.22% <91.22%> (+0.15%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
superset/db_engine_specs/trino.py	`79.41% <50.00%> (+8.61%)`	⬆️
superset/db_engine_specs/base.py	`88.98% <91.30%> (+0.23%)`	⬆️
superset/db_engine_specs/bigquery.py	`87.02% <96.00%> (+2.17%)`	⬆️
superset/db_engine_specs/postgres.py	`97.29% <100.00%> (+0.02%)`	⬆️
superset/db_engine_specs/presto.py	`89.27% <100.00%> (+0.14%)`	⬆️
superset/key_value/commands/create.py	`80.95% <0.00%> (-3.05%)`	⬇️
superset/key_value/utils.py	`100.00% <0.00%> (ø)`
superset/dashboards/filter_state/api.py	`100.00% <0.00%> (ø)`
superset/explore/form_data/commands/parameters.py	`100.00% <0.00%> (ø)`
superset/utils/pandas_postprocessing.py
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 225015f...9e36f0f. Read the comment docs.

betodealmeida · 2022-02-16T19:08:45Z

superset/db_engine_specs/bigquery.py

+                "Could not import libraries `google.cloud` or `google.oauth2`, "
+                "which are required to be installed in your environment in order "
+                "to estimate cost"


Curious, wouldn't these be necessarily installed if the user has a BigQuery database connected?

That's right, we can simply use import here. I'll fix this!

betodealmeida · 2022-02-16T19:18:26Z

superset/db_engine_specs/bigquery.py

+    @classmethod
+    def query_cost_formatter(
+        cls, raw_cost: List[Dict[str, Any]]
+    ) -> List[Dict[str, str]]:
+        def format_bytes_str(raw_bytes: int) -> str:
+            if not isinstance(raw_bytes, int):
+                return str(raw_bytes)
+            units = ["B", "KiB", "MiB", "GiB", "TiB", "PiB"]
+            index = 0
+            bytes = float(raw_bytes)
+            while bytes >= 1024 and index < len(units) - 1:
+                bytes /= 1024
+                index += 1
+
+            return "{:.1f}".format(bytes) + f" {units[index]}"
+
+        return [
+            {
+                k: format_bytes_str(v) if k == "Total bytes processed" else str(v)
+                for k, v in row.items()
+            }
+            for row in raw_cost
+        ]


I think we can change the humanize function in Presto (it's also duplicated in Trino) to return bytes in 1024 increments. The only reason I did 1000 is because it's also applied to row count and other parameters. This way it's consistent. Ideally we'd have a single function used by Presto, Trino, BigQuery and other engine specs.

(Note that it's also possible to overwrite the formatter function using QUERY_COST_FORMATTERS_BY_ENGINE in the config. We used that at Lyft to show the query cost in dollars, estimated run time, and carbon footprint.)

kzosabe · 2022-02-19T02:27:44Z

superset/db_engine_specs/base.py

        return [str(s).strip(" ;") for s in sqlparse.parse(sql)]

+    @classmethod
+    def _humanize(cls, value: Any, suffix: str, category: Optional[str] = None) -> str:


If there could be an input like (1000, "", "dollars") and an output like "$1,000", there would be more categories

kzosabe · 2022-02-19T02:30:34Z

tests/integration_tests/db_engine_specs/presto_tests.py

            {
                "Output count": "904 M rows",
-                "Output size": "354 GB",
+                "Output size": "329 GiB",


These two values are identical except for the units. I took care not to change any of the other outputs.

kzosabe · 2022-03-03T23:24:09Z

@betodealmeida Hello, I've fixed the points you pointed out, could you please review?

kzosabe · 2022-03-17T12:49:14Z

@betodealmeida @villebro I've been waiting for a review for about 3 weeks since last update, I'd love to get some feedback on whether something is wrong with my PR or reviewers are busy.
If reviewers are busy, It's no problem and I'll wait as long as needed. This is my first OSS commit, so I would like your advice if there is anything that is not good.

villebro · 2023-01-09T06:59:04Z

@kzosabe I'm really sorry for having dropped the ball on this PR 🙁 As you may have seen a later PR #21325 added this feature. I'd still love to get the humanize logic from this PR in, so if you're still open to working on this I promise to do my best to see this through without delay.

rusackas · 2023-01-26T06:56:04Z

@kzosabe any interest in rebasing and following through on making this more about the humanize logic as @villebro suggested? If not, we should close this out. Thanks in either case, we appreciate the contribution!

rusackas · 2024-02-06T20:29:57Z

I'm assuming that since @kzosabe hasn't responded, we missed the boat on that humanize logic. Thanks for the PR nonetheless, and we hope we see more of you around here. I'll close this, but feel free to re-open if you want to pick this back up!

kzosabe added 3 commits February 13, 2022 18:36

Add engine param for each estimate_statement_cost call

6ae4638

Add bigquery cost estimation implementation

6e676f6

Add tests

4e63843

pull-request-size bot added the size/L label Feb 13, 2022

kzosabe commented Feb 13, 2022

View reviewed changes

Fix import issue

6c06b05

villebro reviewed Feb 14, 2022

View reviewed changes

Fix lint

dba135c

villebro requested review from betodealmeida and zhaoyongjie February 14, 2022 12:36

betodealmeida reviewed Feb 16, 2022

View reviewed changes

kzosabe added 3 commits February 17, 2022 19:09

Remove unnecessary try-except

1fd1488

Add common implementation of humanize numbers

f5908e5

Add the ability to handle byte count to humanize

9e36f0f

kzosabe commented Feb 19, 2022

View reviewed changes

kzosabe requested a review from betodealmeida February 23, 2022 21:55

kzosabe requested a review from villebro March 17, 2022 12:49

zamar-roura mentioned this pull request Dec 11, 2022

feat(db_engine_specs): big query cost estimation #21325

Merged

7 tasks

rusackas closed this Feb 6, 2024

Conversation

kzosabe commented Feb 13, 2022

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

Uh oh!

Choose a reason for hiding this comment

Uh oh!

villebro left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Feb 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kzosabe commented Mar 3, 2022

Uh oh!

kzosabe commented Mar 17, 2022

Uh oh!

villebro commented Jan 9, 2023

Uh oh!

rusackas commented Jan 26, 2023

Uh oh!

rusackas commented Feb 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Feb 14, 2022 •

edited

Loading