Feature/Fix: times_series limit for Druid #3434

fabianmenges · 2017-09-08T19:13:33Z

When setting s series limit for druid it currently returns the TopN for each individual datapoint instead of the the TopN for the entire time frame. See the before and after pictures attached.

This is probably more of a bug fix since the behavior after this change is consistent with the behavior that you see when you group by multiple columns.

Before:

After:

coveralls · 2017-09-08T19:29:41Z

Coverage increased (+0.3%) to 69.401% when pulling db70e73d3ed5436341e5cfaa4f056ac63dc554f4 on tc-dc:fmenges/druid_timeseries into 3dfdde1 on apache:master.

coveralls · 2017-09-08T19:29:41Z

Coverage increased (+0.3%) to 69.401% when pulling db70e73d3ed5436341e5cfaa4f056ac63dc554f4 on tc-dc:fmenges/druid_timeseries into 3dfdde1 on apache:master.

coveralls · 2017-09-12T14:43:07Z

Coverage increased (+0.1%) to 69.25% when pulling ddcb94b47c3b4cf887abbef4a08cac75d1454c4b on tc-dc:fmenges/druid_timeseries into 147c12d on apache:master.

coveralls · 2017-09-12T22:36:49Z

Coverage increased (+0.3%) to 69.402% when pulling 5d5090b76634eeeb137bf00d8e204962caa96f30 on tc-dc:fmenges/druid_timeseries into 8223729 on apache:master.

mistercrunch

Minor change requested.

Sorry you had to dig through this piece of code, this is not the code I'm most proud of for sure...

Otherwise looks good, though this large messy method is hard to review and hard to unit test, so I'll have to trust you on it and test against our staging.

Which version of Druid are you running? I was told by some of the Druid core committers that the issue you were seeing (around topN not showing fragments) has been fixed in more recent version, and was hoping that an upcoming upgrade would fix this, but we do see this on our side at the moment.

I'm also hoping that we eventually move Druid to use SQLAlchemy as well, but that'll require a proper SQLAlchemy dialect and that's assuming that Druid offers as much control/performance through the SQL option...

mistercrunch · 2017-09-15T03:38:14Z

superset/connectors/druid/models.py

@@ -798,6 +798,28 @@ def values_for_column(self,
    def get_query_str(self, query_obj, phase=1, client=None):
        return self.run_query(client=client, phase=phase, **query_obj)

+    def _add_filter_from_pre_query_data(self, df, dimensions, filter):
+        ret = filter


filter is a reserved word which can be confusing, go for filter_ or filters

ah... keep forgetting filter :(

…for each point in time

coveralls · 2017-09-15T20:53:35Z

Coverage increased (+0.3%) to 69.424% when pulling 9fe8d73 on tc-dc:fmenges/druid_timeseries into e22aecb on apache:master.

fabianmenges · 2017-09-15T20:55:56Z

We are running 0.10.1, we upgraded from 0.9.2 about 2 weeks ago and saw great performance gains across different query types.

This changeset is to handle Druid behavior that is implemented like this by design (this is not addressing or working around a bug in Druid). If you specify a threshold of 5 for a TopN query, Druid will always return the top 5 results per granularity (lets ignore that its not guaranteed they are the actual top 5 but likely the top 5).
As an example of the default TopN behavior, lets say you want to query the top 5 Campaigns with the most (ordered by) "Ad-Impression" on a daily level over a week. Druid will return the top 5 campaigns for Monday, the top 5 for Tuesday, etc... If everyday you have a different set of 5 campaigns you will end up with 7x5 = 35 different Campaigns in your result set, each one with exactly one datapoint. (This is what you can see in the first screenshot)

The argument for this changeset is, that the behavior described is not actually useful and counter intuitive when you want a line-chart for the top 5 Campaigns. What you expect is to see the change over time for the same top 5 Campaigns day by day over the course of the week.
The way this is implemented is running one TopN query with granularity all over the entire time range to find the top 5 campaigns and then run a TopN query with granularity of 1 day filtered to the campaigns of the first run.

We have been running this code for about 3 weeks and have not run into problems.

…for each point in time (apache#3434) small fix

…for each point in time (apache#3434)

fabianmenges force-pushed the fmenges/druid_timeseries branch from db70e73 to ddcb94b Compare September 12, 2017 14:26

fabianmenges force-pushed the fmenges/druid_timeseries branch from ddcb94b to 5d5090b Compare September 12, 2017 21:37

mistercrunch requested changes Sep 15, 2017

View reviewed changes

Feature/Fix: Get a full times_series for your filter instead of Topn …

9fe8d73

…for each point in time

fabianmenges force-pushed the fmenges/druid_timeseries branch from 5d5090b to 9fe8d73 Compare September 15, 2017 20:34

mistercrunch merged commit c3c9ceb into apache:master Sep 18, 2017

fabianmenges deleted the fmenges/druid_timeseries branch September 19, 2017 18:32

timifasubaa pushed a commit to timifasubaa/incubator-superset that referenced this pull request Oct 3, 2017

Feature/Fix: Get a full times_series for your filter instead of Topn …

70ce30c

…for each point in time (apache#3434) small fix

michellethomas pushed a commit to michellethomas/panoramix that referenced this pull request May 24, 2018

Feature/Fix: Get a full times_series for your filter instead of Topn …

f5366c6

…for each point in time (apache#3434)

This was referenced Jul 15, 2021

[Snyk] Upgrade @typescript-eslint/parser from 4.19.0 to 4.28.0 XirdigH/superset#10

Closed

[Snyk] Upgrade @typescript-eslint/eslint-plugin from 4.19.0 to 4.28.0 XirdigH/superset#11

Closed

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.20.0 labels Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/Fix: times_series limit for Druid #3434

Feature/Fix: times_series limit for Druid #3434

fabianmenges commented Sep 8, 2017 •

edited

Loading

coveralls commented Sep 8, 2017

coveralls commented Sep 8, 2017 •

edited

Loading

coveralls commented Sep 12, 2017 •

edited

Loading

coveralls commented Sep 12, 2017 •

edited

Loading

mistercrunch left a comment •

edited

Loading

mistercrunch Sep 15, 2017

fabianmenges Sep 15, 2017

coveralls commented Sep 15, 2017 •

edited

Loading

fabianmenges commented Sep 15, 2017 •

edited

Loading

Feature/Fix: times_series limit for Druid #3434

Feature/Fix: times_series limit for Druid #3434

Conversation

fabianmenges commented Sep 8, 2017 • edited Loading

coveralls commented Sep 8, 2017

coveralls commented Sep 8, 2017 • edited Loading

coveralls commented Sep 12, 2017 • edited Loading

coveralls commented Sep 12, 2017 • edited Loading

mistercrunch left a comment • edited Loading

Choose a reason for hiding this comment

mistercrunch Sep 15, 2017

Choose a reason for hiding this comment

fabianmenges Sep 15, 2017

Choose a reason for hiding this comment

coveralls commented Sep 15, 2017 • edited Loading

fabianmenges commented Sep 15, 2017 • edited Loading

fabianmenges commented Sep 8, 2017 •

edited

Loading

coveralls commented Sep 8, 2017 •

edited

Loading

coveralls commented Sep 12, 2017 •

edited

Loading

coveralls commented Sep 12, 2017 •

edited

Loading

mistercrunch left a comment •

edited

Loading

coveralls commented Sep 15, 2017 •

edited

Loading

fabianmenges commented Sep 15, 2017 •

edited

Loading