feat: enable ETag header for dashboard GET requests by graceguo-supercat · Pull Request #10963 · apache/superset

graceguo-supercat · 2020-09-18T21:39:08Z

SUMMARY

ETag header is widely used as efficient server-side caching mechanism. Superset had ETag support for explore_json, this PR is to expand the coverage to dashboard GET request.

When dashboard request come in, Superset need to gather all the datasource metadata and all the slices query parameters that are used for this dashboard, and return as a big blob of data. For a large dashboard in airbnb, we saw some dashboard can have 100+ datasources and 300+ slices. This server-side processing can take 4 seconds, and make the whole dashboard load became very slow.

eTag could be a good solution for large dashboards with less frequent changes:

TEST PLAN

Open a regular dashboard page, you can see it get 200 response from browser dev tool.
Reload the same dashboard from browser, you will see 304 response, and response time is really fast (~50 ms)
Try change dashboard layout, or modify one of the chart from another browser window.
Reload the same dashboard from browser, you will see 200 response again, since the cache is stale.

For example: regular GET request takes about 4 seconds. If enabled eTag header, 2nd request will get 304 and faster responded since there is no server-side processing:

cc @betodealmeida @etr2460

codecov-commenter · 2020-09-21T07:19:35Z

Codecov Report

Merging #10963 into master will decrease coverage by 0.70%.
The diff coverage is 71.42%.

@@            Coverage Diff             @@
##           master   #10963      +/-   ##
==========================================
- Coverage   61.46%   60.76%   -0.71%     
==========================================
  Files         382      382              
  Lines       24139    24154      +15     
==========================================
- Hits        14836    14676     -160     
- Misses       9303     9478     +175

Flag	Coverage Δ
#python	`60.76% <71.42%> (-0.71%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
superset/utils/decorators.py	`53.52% <56.25%> (-1.48%)`	⬇️
superset/views/utils.py	`83.07% <76.47%> (-1.24%)`	⬇️
superset/views/core.py	`74.46% <83.33%> (-0.01%)`	⬇️
superset/db_engines/hive.py	`0.00% <0.00%> (-85.72%)`	⬇️
superset/db_engine_specs/hive.py	`53.90% <0.00%> (-30.08%)`	⬇️
superset/db_engine_specs/presto.py	`70.85% <0.00%> (-11.44%)`	⬇️
superset/db_engine_specs/sqlite.py	`65.62% <0.00%> (-9.38%)`	⬇️
superset/utils/celery.py	`82.14% <0.00%> (-3.58%)`	⬇️
superset/examples/world_bank.py	`97.10% <0.00%> (-2.90%)`	⬇️
superset/examples/birth_names.py	`97.36% <0.00%> (-2.64%)`	⬇️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dc893fe...bcf90b9. Read the comment docs.

ktmud · 2020-09-23T18:10:57Z

superset/utils/decorators.py


+            # if it is dashboard request but feature is not eabled,
+            # do not use cache
+            is_dashboard = is_dashboard_request(kwargs)


Instead of adding a helper function, maybe we can just expand dashboard_id_or_slug in wrapper and make this code more transparent?

def wrapper(*args: Any, dashboard_id_or_slug: str=None, **kwargs: Any) -> ETagResponseMixin: # ... is_dashboard = dashboard_id_or_slug is not None

Better yet, we should probably aim for keeping all dashboard specific logics out of etag_cache so it stays generic. Maybe add a skip= parameter that runs the feature flag check.

@etag_cache(skip=lambda: is_feature_enabled("ENABLE_DASHBOARD_ETAG_HEADER"))

def etag_cache( max_age: int, check_perms: Callable[..., Any], skip: Optional[Callable[..., Any]] = None, ) -> Callable[..., Any]: def decorator(f: Callable[..., Any]) -> Callable[..., Any]: def wrapper(*args: Any, **kwargs: Any) -> ETagResponseMixin: check_perms(*args, **kwargs) if request.method == "POST" or (skip and skip(*args, **kwargs)): return f(*args, **kwargs)

thanks! after introduce this skip, dashboard_id is not needed in decorator function any more.

ktmud · 2020-09-23T18:58:21Z

superset/utils/decorators.py

+                        tzinfo=timezone.utc
+                    ).astimezone(tz=None)
+                    if latest_changed_on.timestamp() > latest_record.timestamp():
+                        response = None


Can probably rename check_latest_changed_on to get_latest_changed_on and do

if get_latest_changed_on: latest_changed_on = get_latest_changed_on(*args, **kwargs) if response and response.last_modified and response.last_modified < latest_changed_on: response = None else: latest_changed_on = datetime.utcnow()

bkyryliuk · 2020-09-23T21:17:38Z

@graceguo-supercat can you please describe what effect this change will have on

permission checks, e.g, if user gained access, what they still see permission denied if that is cached
how it will interact with chart cache
how it will interact with datasource / dashboard changes

graceguo-supercat · 2020-09-24T00:38:11Z

etag_cache decorator works this way:

explore_json flow:

http request come in
check_slice_perms. No matter has cached response, if no permission, response with error.
if method is POST, run query (not use cache). Otherwise check if this request has cache.
if no cached response, run query and create a cache key for the response
send response to client-side, with eTag header, last_modified header and expiration time header. last_modified time is now (response time).

dashboard flow:

http request come in
check_dashboard_perms. No matter has cached response, if no permission, response with error.
if method is POST, or feature is not enabled, run dashboard function to build response (not use cache). Otherwise check if this request has cache.
if no cached response, run dashboard function.
if has cache, compare cached time with dashboard last modified time: it could be dashboard's metadata was changed (dashboard's changed_on), or any of its slices was changed (slice's changed_on). If cache is stale, run dashboard function.
send response to client-side, with eTag header, last_modified header and expiration time header. last_modified time is max of (dashboard' changed_on and its slices changed_on)

bkyryliuk · 2020-09-24T01:21:04Z

response to client-side, with eTag header, last_modified header and expiration time header. last_modified time is max of (dashboard' changed_on and its slices change

great thanks, it looks like it addresses question #1, what about 2 and 3?

e.g. for #2

if chart was cached for ~10 hours and etag cached the dashboard, does it mean that chart on this dashboard will be cached for 10 hours + etag expiration time ?
for Implementing my own highcharts wrapper #3
how it will affect dashboard / datasource changes like annotations, changed to the default filters, css rules, etc would those changes be cached for the user as well?

Sorry if those questions do not make sense or are obvious. I am just trying to understand the flow here

graceguo-supercat · 2020-09-24T01:42:55Z

Note:
For explore_json requests, we want to cache query results.

For dashboard requests, we want to cache dashboard bootstrap data, which includes datasource metadata, slices parameters, etc. Dashboard front-end js use datasource metadata, slice parameters, and dashboard filters to build query and fetch query results, the results itself are not in the dashboard bootstrap data.

This PR will only focus on dashboard's cache stale logic. i do not want to change current slice cache behavior.

#2
if chart was not modified for ~10 hours, I assume this means slice entity are not modified, like query parameters, datasource used, it doesn't mean the query results is not modified. Since query results is not part of dashboard bootstrap data, it's not included in the dashboard cache either.

#3:
annotations: dashboard has no annotation. If slice's annotation was changed, slice entity's params and changed_on attribute will be changed.
changed to the default filters, css rules, etc: these change will be stored in dashboard metadata, and will change dashboard changed_on attribute.

bkyryliuk · 2020-09-24T16:17:52Z

superset/views/utils.py

+
+def get_dashboard_latest_changed_on(_self: Any, dashboard_id_or_slug: str) -> datetime:
+    """
+    Get latest changed datetime for a dashboard. The change could be dashboard


s/Get latest changed datetime for a dashboard.
/Get latest changed datetime for a dashboard and it's charts

yes. I rename it to get_dashboard_latest_changedon_dt.

can we get more specific with _self type ?

I don't know what type to use here. do you have suggestion? (Sorry i am not an expert in Python)

bkyryliuk · 2020-09-24T16:21:36Z

superset/utils/decorators.py



+def is_dashboard_request(kwargs: Any) -> bool:
+    return kwargs.get("dashboard_id_or_slug") is not None


doesn't seem robust, it it possible to validate via uri path or just pass a param ?

after introduce this skip, dashboard_id_or_slug is not needed in decorator function any more. this check is removed.

bkyryliuk · 2020-09-24T16:24:29Z

superset/utils/decorators.py

+                latest_changed_on = check_latest_changed_on(*args, **kwargs)
+                if response and response.last_modified:
+                    latest_record = response.last_modified.replace(
+                        tzinfo=timezone.utc


this assumes that superset server runs in utc zone, it may be safer to make it as a superset config variable

this convert, .replace(tzinfo) is not necessary, removed.

bkyryliuk · 2020-09-24T16:26:11Z

superset/views/utils.py

+    return dashboard
+
+
+def get_datasources_from_dashboard(


looks like a good candidate for the Dashboard class method

this is a little confusing: Dashboard class has a get datasources function. So i use it in check_dashboard_perms function. But this function is to group slices by datasources, and the result will be used by another feature:
https://github.com/apache/incubator-superset/blob/ba009b7c09d49f2932fd10269882c901bc020c1d/superset/views/core.py#L1626
Instead of datasource.data, datasource.data_for_slices(slices) can reduce the initial dashboard data load size.

So right now i removed this helper function from utils, and build dict in the dashboard function. But i rename datasource to slices_by_datasources for clarification.

bkyryliuk · 2020-09-24T16:27:04Z

superset/views/utils.py

+    return datasources
+
+
+def get_dashboard_latest_changed_on(_self: Any, dashboard_id_or_slug: str) -> datetime:


what is _self here? ideally we should avoid Any types

Please see other functions that used by decorator:
This function takes `self` since it must have the same signature as the the decorated method.

bkyryliuk · 2020-09-24T16:27:46Z

superset/views/utils.py

        viz_obj.raise_for_access()


+def check_dashboard_perms(_self: Any, dashboard_id_or_slug: str) -> None:


best practice is to have a unit test for every function, it would be great if you could add some

Yes, i agree. but this function is refactored out from dashboard function. it is tested in
https://github.com/apache/incubator-superset/blob/448a41a4e7563cafadea1e03feb5980151e8b56d/tests/security_tests.py#L665
I assume the old unit tests didn't break will be good enough.

a good practice is to incrementally improve the state of the code, however it will be your call here

bkyryliuk · 2020-09-24T16:28:13Z

Note:
For explore_json requests, we want to cache query results.

For dashboard requests, we want to cache dashboard bootstrap data, which includes datasource metadata, slices parameters, etc. Dashboard front-end js use datasource metadata, slice parameters, and dashboard filters to build query and fetch query results, the results itself are not in the dashboard bootstrap data.

This PR will only focus on dashboard's cache stale logic. i do not want to change current slice cache behavior.

#2
if chart was not modified for ~10 hours, I assume this means slice entity are not modified, like query parameters, datasource used, it doesn't mean the query results is not modified. Since query results is not part of dashboard bootstrap data, it's not included in the dashboard cache either.

#3:
annotations: dashboard has no annotation. If slice's annotation was changed, slice entity's params and changed_on attribute will be changed.
changed to the default filters, css rules, etc: these change will be stored in dashboard metadata, and will change dashboard changed_on attribute.

Big thanks for the explanation!

bkyryliuk

LG%nits

bkyryliuk · 2020-09-28T18:41:01Z

superset/utils/decorators.py

add a comment here why content_changed_time is set to now()

We use this content_changed_time as cache's last_modified time.
for dashboard content_changed_time is dashboard entity's latest updated time (like metadata, chart metadata changed time etc). this data is from a callback function.
for explore_json, the cache is query results and there is no entity's latest modified time to use. so we use request time (now) as cache's last_modified time.

I know generalizing too soon is not a good practice, but I wonder if we should pass a callable called is_stale here. It would simplify the decorator logic, and since it would be defined closer to the dashboard it might simplify the logic there as well.

at first i thought is_stale is a good idea. but when start refactor it, i found last_modified time is needed by decorator function(to set header). So is_stale (only boolean value) is not enough.
So instead of using is_stale return true or false, I prefer to keep get_last_modified, and use last_modified time to invalid cache.

bkyryliuk · 2020-09-28T18:44:47Z

superset/views/utils.py

+
+def get_dashboard_latest_changed_on(_self: Any, dashboard_id_or_slug: str) -> datetime:
+    """
+    Get latest changed datetime for a dashboard. The change could be dashboard


can we get more specific with _self type ?

betodealmeida

This looks great, @graceguo-supercat! I left a small comment on how to possible simplify the code a little b it, but I'm not 100% sure it would help.

betodealmeida · 2020-09-28T23:51:33Z

superset/utils/decorators.py

I know generalizing too soon is not a good practice, but I wonder if we should pass a callable called is_stale here. It would simplify the decorator logic, and since it would be defined closer to the dashboard it might simplify the logic there as well.

ktmud

Sorry, found a couple of more nits, both are optional

ktmud · 2020-09-29T00:53:25Z

superset/utils/decorators.py

Can we rename get_latest_changed_on to get_last_modified just to be more consistent with the response attribute? Imagine in future refactor response.last_modified is renamed to something else, you would know this function is definitely related by searching for last_modified.

agree. rename it to get_last_modified.

ktmud · 2020-09-29T01:02:44Z

superset/views/core.py

Nit: it's a little weird to have both changed_on and changedon, but up to your.

* revert #11137 * revert #10963

* feat: add etag for dashboard load requests * fix review comments

* revert apache#11137 * revert apache#10963

pull-request-size bot added the size/L label Sep 18, 2020

graceguo-supercat changed the title ~~[WIP]feat: enable eTag header for dashboard~~ [WIP]feat: enable eTag header for dashboard page load Sep 18, 2020

graceguo-supercat changed the title ~~[WIP]feat: enable eTag header for dashboard page load~~ [WIP]feat: enable ETag header for dashboard page load Sep 19, 2020

graceguo-supercat force-pushed the gg-DashboardETag branch from eac7940 to bd76db1 Compare September 21, 2020 07:11

graceguo-supercat force-pushed the gg-DashboardETag branch 7 times, most recently from b879c96 to caf3745 Compare September 22, 2020 01:31

feat: add etag for dashboard load requests

d2afe98

graceguo-supercat force-pushed the gg-DashboardETag branch from caf3745 to d2afe98 Compare September 22, 2020 23:12

graceguo-supercat changed the title ~~[WIP]feat: enable ETag header for dashboard page load~~ feat: enable ETag header for dashboard page load Sep 23, 2020

graceguo-supercat changed the title ~~feat: enable ETag header for dashboard page load~~ feat: enable ETag header for dashboard GET requests Sep 23, 2020

mistercrunch requested a review from betodealmeida September 23, 2020 05:43

ktmud reviewed Sep 23, 2020

View reviewed changes

bkyryliuk reviewed Sep 24, 2020

View reviewed changes

graceguo-supercat force-pushed the gg-DashboardETag branch 3 times, most recently from eb3956a to 1c39347 Compare September 25, 2020 22:15

graceguo-supercat requested review from bkyryliuk and ktmud September 25, 2020 23:00

graceguo-supercat force-pushed the gg-DashboardETag branch 2 times, most recently from fda20c7 to c69afd0 Compare September 28, 2020 08:25

bkyryliuk approved these changes Sep 28, 2020

View reviewed changes

betodealmeida approved these changes Sep 28, 2020

View reviewed changes

ktmud reviewed Sep 29, 2020

View reviewed changes

fix review comments

18b31f8

graceguo-supercat force-pushed the gg-DashboardETag branch from c69afd0 to 18b31f8 Compare September 29, 2020 07:01

graceguo-supercat merged commit 6633409 into apache:master Sep 29, 2020

ktmud mentioned this pull request Oct 2, 2020

fix: enable consistent etag across workers and force no-cache for dashboards #11137

Merged

6 tasks

graceguo-supercat pushed a commit to graceguo-supercat/superset that referenced this pull request Oct 8, 2020

revert apache#10963

f98e4f9

graceguo-supercat mentioned this pull request Oct 8, 2020

fix: revert eTag cache feature for dashboard #11203

Merged

1 task

graceguo-supercat pushed a commit that referenced this pull request Oct 8, 2020

fix: revert eTag cache feature for dashboard (#11203)

a10e86a

* revert #11137 * revert #10963

ktmud mentioned this pull request Oct 12, 2020

perf: cache dashboard bootstrap data #11234

Merged

6 tasks

auxten pushed a commit to auxten/incubator-superset that referenced this pull request Nov 20, 2020

feat: enable ETag header for dashboard GET requests (apache#10963)

e1592f3

* feat: add etag for dashboard load requests * fix review comments

auxten pushed a commit to auxten/incubator-superset that referenced this pull request Nov 20, 2020

fix: revert eTag cache feature for dashboard (apache#11203)

e6c0676

* revert apache#11137 * revert apache#10963

etr2460 mentioned this pull request Apr 26, 2021

feat: Add etag caching to dashboard APIs #14357

Merged

8 tasks

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.38.0 First shipped in 0.38.0 labels Mar 12, 2024



		def is_dashboard_request(kwargs: Any) -> bool:
		return kwargs.get("dashboard_id_or_slug") is not None

		return datasources


		def get_dashboard_latest_changed_on(_self: Any, dashboard_id_or_slug: str) -> datetime:

		viz_obj.raise_for_access()


		def check_dashboard_perms(_self: Any, dashboard_id_or_slug: str) -> None:

Conversation

graceguo-supercat commented Sep 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SUMMARY

TEST PLAN

Uh oh!

codecov-commenter commented Sep 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

graceguo-supercat Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkyryliuk commented Sep 23, 2020

Uh oh!

graceguo-supercat commented Sep 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bkyryliuk commented Sep 24, 2020

Uh oh!

graceguo-supercat commented Sep 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

graceguo-supercat Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

graceguo-supercat Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkyryliuk commented Sep 24, 2020

Uh oh!

bkyryliuk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

betodealmeida left a comment

graceguo-supercat commented Sep 18, 2020 •

edited

Loading

codecov-commenter commented Sep 21, 2020 •

edited

Loading

graceguo-supercat Sep 25, 2020 •

edited

Loading

graceguo-supercat commented Sep 24, 2020 •

edited

Loading

graceguo-supercat commented Sep 24, 2020 •

edited

Loading

graceguo-supercat Sep 25, 2020 •

edited

Loading

graceguo-supercat Sep 25, 2020 •

edited

Loading

ktmud left a comment •

edited

Loading

ktmud Sep 29, 2020 •

edited

Loading