Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/Fix: times_series limit for Druid #3434

Merged
merged 1 commit into from
Sep 18, 2017

Conversation

fabianmenges
Copy link
Contributor

@fabianmenges fabianmenges commented Sep 8, 2017

When setting s series limit for druid it currently returns the TopN for each individual datapoint instead of the the TopN for the entire time frame. See the before and after pictures attached.

This is probably more of a bug fix since the behavior after this change is consistent with the behavior that you see when you group by multiple columns.

Before:
before

After:
after

@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 69.401% when pulling db70e73d3ed5436341e5cfaa4f056ac63dc554f4 on tc-dc:fmenges/druid_timeseries into 3dfdde1 on apache:master.

1 similar comment
@coveralls
Copy link

coveralls commented Sep 8, 2017

Coverage Status

Coverage increased (+0.3%) to 69.401% when pulling db70e73d3ed5436341e5cfaa4f056ac63dc554f4 on tc-dc:fmenges/druid_timeseries into 3dfdde1 on apache:master.

@coveralls
Copy link

coveralls commented Sep 12, 2017

Coverage Status

Coverage increased (+0.1%) to 69.25% when pulling ddcb94b47c3b4cf887abbef4a08cac75d1454c4b on tc-dc:fmenges/druid_timeseries into 147c12d on apache:master.

@coveralls
Copy link

coveralls commented Sep 12, 2017

Coverage Status

Coverage increased (+0.3%) to 69.402% when pulling 5d5090b76634eeeb137bf00d8e204962caa96f30 on tc-dc:fmenges/druid_timeseries into 8223729 on apache:master.

Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor change requested.

Sorry you had to dig through this piece of code, this is not the code I'm most proud of for sure...

Otherwise looks good, though this large messy method is hard to review and hard to unit test, so I'll have to trust you on it and test against our staging.

Which version of Druid are you running? I was told by some of the Druid core committers that the issue you were seeing (around topN not showing fragments) has been fixed in more recent version, and was hoping that an upcoming upgrade would fix this, but we do see this on our side at the moment.

I'm also hoping that we eventually move Druid to use SQLAlchemy as well, but that'll require a proper SQLAlchemy dialect and that's assuming that Druid offers as much control/performance through the SQL option...

@@ -798,6 +798,28 @@ def values_for_column(self,
def get_query_str(self, query_obj, phase=1, client=None):
return self.run_query(client=client, phase=phase, **query_obj)

def _add_filter_from_pre_query_data(self, df, dimensions, filter):
ret = filter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filter is a reserved word which can be confusing, go for filter_ or filters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah... keep forgetting filter :(

@coveralls
Copy link

coveralls commented Sep 15, 2017

Coverage Status

Coverage increased (+0.3%) to 69.424% when pulling 9fe8d73 on tc-dc:fmenges/druid_timeseries into e22aecb on apache:master.

@fabianmenges
Copy link
Contributor Author

fabianmenges commented Sep 15, 2017

We are running 0.10.1, we upgraded from 0.9.2 about 2 weeks ago and saw great performance gains across different query types.

This changeset is to handle Druid behavior that is implemented like this by design (this is not addressing or working around a bug in Druid). If you specify a threshold of 5 for a TopN query, Druid will always return the top 5 results per granularity (lets ignore that its not guaranteed they are the actual top 5 but likely the top 5).
As an example of the default TopN behavior, lets say you want to query the top 5 Campaigns with the most (ordered by) "Ad-Impression" on a daily level over a week. Druid will return the top 5 campaigns for Monday, the top 5 for Tuesday, etc... If everyday you have a different set of 5 campaigns you will end up with 7x5 = 35 different Campaigns in your result set, each one with exactly one datapoint. (This is what you can see in the first screenshot)

The argument for this changeset is, that the behavior described is not actually useful and counter intuitive when you want a line-chart for the top 5 Campaigns. What you expect is to see the change over time for the same top 5 Campaigns day by day over the course of the week.
The way this is implemented is running one TopN query with granularity all over the entire time range to find the top 5 campaigns and then run a TopN query with granularity of 1 day filtered to the campaigns of the first run.

We have been running this code for about 3 weeks and have not run into problems.

@mistercrunch mistercrunch merged commit c3c9ceb into apache:master Sep 18, 2017
@fabianmenges fabianmenges deleted the fmenges/druid_timeseries branch September 19, 2017 18:32
timifasubaa pushed a commit to timifasubaa/incubator-superset that referenced this pull request Oct 3, 2017
michellethomas pushed a commit to michellethomas/panoramix that referenced this pull request May 24, 2018
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.20.0 labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.20.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants