feat: enable ETag header for dashboard GET requests#10963
feat: enable ETag header for dashboard GET requests#10963graceguo-supercat merged 2 commits intoapache:masterfrom
Conversation
eac7940 to
bd76db1
Compare
Codecov Report
@@ Coverage Diff @@
## master #10963 +/- ##
==========================================
- Coverage 61.46% 60.76% -0.71%
==========================================
Files 382 382
Lines 24139 24154 +15
==========================================
- Hits 14836 14676 -160
- Misses 9303 9478 +175
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
b879c96 to
caf3745
Compare
caf3745 to
d2afe98
Compare
superset/utils/decorators.py
Outdated
|
|
||
| # if it is dashboard request but feature is not eabled, | ||
| # do not use cache | ||
| is_dashboard = is_dashboard_request(kwargs) |
There was a problem hiding this comment.
Instead of adding a helper function, maybe we can just expand dashboard_id_or_slug in wrapper and make this code more transparent?
def wrapper(*args: Any, dashboard_id_or_slug: str=None, **kwargs: Any) -> ETagResponseMixin:
# ...
is_dashboard = dashboard_id_or_slug is not NoneBetter yet, we should probably aim for keeping all dashboard specific logics out of etag_cache so it stays generic. Maybe add a skip= parameter that runs the feature flag check.
@etag_cache(skip=lambda: is_feature_enabled("ENABLE_DASHBOARD_ETAG_HEADER"))def etag_cache(
max_age: int,
check_perms: Callable[..., Any],
skip: Optional[Callable[..., Any]] = None,
) -> Callable[..., Any]:
def decorator(f: Callable[..., Any]) -> Callable[..., Any]:
def wrapper(*args: Any, **kwargs: Any) -> ETagResponseMixin:
check_perms(*args, **kwargs)
if request.method == "POST" or (skip and skip(*args, **kwargs)):
return f(*args, **kwargs)There was a problem hiding this comment.
thanks! after introduce this skip, dashboard_id is not needed in decorator function any more.
superset/utils/decorators.py
Outdated
| tzinfo=timezone.utc | ||
| ).astimezone(tz=None) | ||
| if latest_changed_on.timestamp() > latest_record.timestamp(): | ||
| response = None |
There was a problem hiding this comment.
Can probably rename check_latest_changed_on to get_latest_changed_on and do
if get_latest_changed_on:
latest_changed_on = get_latest_changed_on(*args, **kwargs)
if response and response.last_modified and response.last_modified < latest_changed_on:
response = None
else:
latest_changed_on = datetime.utcnow()|
@graceguo-supercat can you please describe what effect this change will have on
|
|
|
great thanks, it looks like it addresses question #1, what about 2 and 3? e.g. for #2
Sorry if those questions do not make sense or are obvious. I am just trying to understand the flow here |
|
Note: For dashboard requests, we want to cache dashboard bootstrap data, which includes datasource metadata, slices parameters, etc. Dashboard front-end js use datasource metadata, slice parameters, and dashboard filters to build query and fetch query results, the results itself are not in the dashboard bootstrap data. This PR will only focus on dashboard's cache stale logic. i do not want to change current slice cache behavior. #2 #3: |
|
|
||
| def get_dashboard_latest_changed_on(_self: Any, dashboard_id_or_slug: str) -> datetime: | ||
| """ | ||
| Get latest changed datetime for a dashboard. The change could be dashboard |
There was a problem hiding this comment.
s/Get latest changed datetime for a dashboard.
/Get latest changed datetime for a dashboard and it's charts
There was a problem hiding this comment.
yes. I rename it to get_dashboard_latest_changedon_dt.
There was a problem hiding this comment.
can we get more specific with _self type ?
There was a problem hiding this comment.
I don't know what type to use here. do you have suggestion? (Sorry i am not an expert in Python)
superset/utils/decorators.py
Outdated
|
|
||
|
|
||
| def is_dashboard_request(kwargs: Any) -> bool: | ||
| return kwargs.get("dashboard_id_or_slug") is not None |
There was a problem hiding this comment.
doesn't seem robust, it it possible to validate via uri path or just pass a param ?
There was a problem hiding this comment.
after introduce this skip, dashboard_id_or_slug is not needed in decorator function any more. this check is removed.
superset/utils/decorators.py
Outdated
| latest_changed_on = check_latest_changed_on(*args, **kwargs) | ||
| if response and response.last_modified: | ||
| latest_record = response.last_modified.replace( | ||
| tzinfo=timezone.utc |
There was a problem hiding this comment.
this assumes that superset server runs in utc zone, it may be safer to make it as a superset config variable
There was a problem hiding this comment.
this convert, .replace(tzinfo) is not necessary, removed.
superset/views/utils.py
Outdated
| return dashboard | ||
|
|
||
|
|
||
| def get_datasources_from_dashboard( |
There was a problem hiding this comment.
looks like a good candidate for the Dashboard class method
There was a problem hiding this comment.
this is a little confusing: Dashboard class has a get datasources function. So i use it in check_dashboard_perms function. But this function is to group slices by datasources, and the result will be used by another feature:
https://github.com/apache/incubator-superset/blob/ba009b7c09d49f2932fd10269882c901bc020c1d/superset/views/core.py#L1626
Instead of datasource.data, datasource.data_for_slices(slices) can reduce the initial dashboard data load size.
So right now i removed this helper function from utils, and build dict in the dashboard function. But i rename datasource to slices_by_datasources for clarification.
superset/views/utils.py
Outdated
| return datasources | ||
|
|
||
|
|
||
| def get_dashboard_latest_changed_on(_self: Any, dashboard_id_or_slug: str) -> datetime: |
There was a problem hiding this comment.
what is _self here? ideally we should avoid Any types
There was a problem hiding this comment.
Please see other functions that used by decorator:
This function takes `self` since it must have the same signature as the the decorated method.
| viz_obj.raise_for_access() | ||
|
|
||
|
|
||
| def check_dashboard_perms(_self: Any, dashboard_id_or_slug: str) -> None: |
There was a problem hiding this comment.
best practice is to have a unit test for every function, it would be great if you could add some
There was a problem hiding this comment.
Yes, i agree. but this function is refactored out from dashboard function. it is tested in
https://github.com/apache/incubator-superset/blob/448a41a4e7563cafadea1e03feb5980151e8b56d/tests/security_tests.py#L665
I assume the old unit tests didn't break will be good enough.
There was a problem hiding this comment.
a good practice is to incrementally improve the state of the code, however it will be your call here
Big thanks for the explanation! |
eb3956a to
1c39347
Compare
fda20c7 to
c69afd0
Compare
superset/utils/decorators.py
Outdated
There was a problem hiding this comment.
add a comment here why content_changed_time is set to now()
There was a problem hiding this comment.
We use this content_changed_time as cache's last_modified time.
for dashboard content_changed_time is dashboard entity's latest updated time (like metadata, chart metadata changed time etc). this data is from a callback function.
for explore_json, the cache is query results and there is no entity's latest modified time to use. so we use request time (now) as cache's last_modified time.
There was a problem hiding this comment.
I know generalizing too soon is not a good practice, but I wonder if we should pass a callable called is_stale here. It would simplify the decorator logic, and since it would be defined closer to the dashboard it might simplify the logic there as well.
There was a problem hiding this comment.
at first i thought is_stale is a good idea. but when start refactor it, i found last_modified time is needed by decorator function(to set header). So is_stale (only boolean value) is not enough.
So instead of using is_stale return true or false, I prefer to keep get_last_modified, and use last_modified time to invalid cache.
|
|
||
| def get_dashboard_latest_changed_on(_self: Any, dashboard_id_or_slug: str) -> datetime: | ||
| """ | ||
| Get latest changed datetime for a dashboard. The change could be dashboard |
There was a problem hiding this comment.
can we get more specific with _self type ?
betodealmeida
left a comment
There was a problem hiding this comment.
This looks great, @graceguo-supercat! I left a small comment on how to possible simplify the code a little b it, but I'm not 100% sure it would help.
superset/utils/decorators.py
Outdated
There was a problem hiding this comment.
I know generalizing too soon is not a good practice, but I wonder if we should pass a callable called is_stale here. It would simplify the decorator logic, and since it would be defined closer to the dashboard it might simplify the logic there as well.
superset/utils/decorators.py
Outdated
There was a problem hiding this comment.
Can we rename get_latest_changed_on to get_last_modified just to be more consistent with the response attribute? Imagine in future refactor response.last_modified is renamed to something else, you would know this function is definitely related by searching for last_modified.
There was a problem hiding this comment.
agree. rename it to get_last_modified.
superset/views/core.py
Outdated
There was a problem hiding this comment.
Nit: it's a little weird to have both changed_on and changedon, but up to your.
c69afd0 to
18b31f8
Compare
* feat: add etag for dashboard load requests * fix review comments
SUMMARY
ETag header is widely used as efficient server-side caching mechanism. Superset had ETag support for explore_json, this PR is to expand the coverage to dashboard GET request.
When dashboard request come in, Superset need to gather all the datasource metadata and all the slices query parameters that are used for this dashboard, and return as a big blob of data. For a large dashboard in airbnb, we saw some dashboard can have 100+ datasources and 300+ slices. This server-side processing can take 4 seconds, and make the whole dashboard load became very slow.
eTag could be a good solution for large dashboards with less frequent changes:

TEST PLAN
For example: regular GET request takes about 4 seconds. If enabled eTag header, 2nd request will get 304 and faster responded since there is no server-side processing:

cc @betodealmeida @etr2460