feat(legacy-preset-chart-nvd3): add zero imputation by villebro · Pull Request #758 · apache-superset/superset-ui

villebro · 2020-08-27T10:39:25Z

🏆 Enhancements

Add control for imputation of missing timestamps. Useful if missing data indicates zero observations.

SCREENSHOTS (these changes are on a forthcoming PR on `incubator-superset`)

When imputing annual data with monthly time grains:

When trying to zero out without selecting a time grain first:

vercel · 2020-08-27T10:39:29Z

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/superset/superset-ui/2a9yfk4ii
✅ Preview: https://superset-ui-git-fork-preset-io-villebro-zero-out.superset.vercel.app

codecov · 2020-08-27T13:21:33Z

Codecov Report

Merging #758 into master will increase coverage by 0.44%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #758      +/-   ##
==========================================
+ Coverage   24.45%   24.90%   +0.44%     
==========================================
  Files         335      353      +18     
  Lines        7654     7839     +185     
  Branches      938     1003      +65     
==========================================
+ Hits         1872     1952      +80     
- Misses       5708     5785      +77     
- Partials       74      102      +28

Impacted Files	Coverage Δ
...gins/legacy-preset-chart-nvd3/src/NVD3Controls.tsx	`0.00% <ø> (ø)`
plugins/plugin-chart-table/src/index.ts	`0.00% <0.00%> (ø)`
plugins/plugin-chart-table/test/testData.ts	`83.33% <0.00%> (ø)`
plugins/legacy-plugin-chart-rose/src/Rose.js	`0.00% <0.00%> (ø)`
plugins/legacy-plugin-chart-rose/src/index.js	`0.00% <0.00%> (ø)`
plugins/legacy-preset-chart-nvd3/src/utils.js	`14.18% <0.00%> (ø)`
plugins/plugin-chart-table/src/TableChart.tsx	`57.14% <0.00%> (ø)`
plugins/legacy-plugin-chart-chord/src/Chord.js	`0.00% <0.00%> (ø)`
plugins/legacy-plugin-chart-chord/src/index.js	`0.00% <0.00%> (ø)`
plugins/legacy-plugin-chart-sankey/src/index.js	`0.00% <0.00%> (ø)`
... and 332 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2115ecd...971607a. Read the comment docs.

eugeniamz · 2020-08-27T13:27:02Z

this feature is very important as it can completely change the chart interpretation if it is not clear that the data points are missing. For example, this chart has several dates missing :

if we "fill the gap" of missing data points :

I had been fixing this in the DB with time-series functions but not all the db support time series. I also had been resolving this by working around with resample :

but it is not scalable if the grain changes and also does not allows to use of other advance analytics function.

eugeniamz · 2020-08-27T13:33:42Z

And another good use case is in Time Series Chart; if dates are missing, the Comparison function does the comparison between data points to do WoW or MoM are not the real WoW. For example, if this is my dataset :

6/25/20
6/27/20
7/1/20
7/2/20
7/3/20
7/7/20
7/8/20
7/9/20

and I use the time-series table to do WoW by setting comparison of 7 periods,

07/09/2020 -> will be compared with 6/25/20, not with 07/02/20

bkyryliuk · 2020-08-27T15:23:11Z

And another good use case is in Time Series Chart; if dates are missing, the Comparison function does the comparison between data points to do WoW or MoM are not the real WoW. For example, if this is my dataset :

6/25/20
6/27/20
7/1/20
7/2/20
7/3/20
7/7/20
7/8/20
7/9/20

and I use the time-series table to do WoW by setting comparison of 7 periods,

07/09/2020 -> will be compared with 6/25/20, not with 07/02/20

+1 for this feature.
thank you @villebro & @eugeniamz for fixing it, there were a couple bug reports about how WoW performs with the missing days on our side, this would be a good solution for it.

kristw · 2020-08-28T19:23:33Z

plugins/legacy-preset-chart-nvd3/src/NVD3Controls.tsx

+      [<h1 className="section-header">{t('Data imputation')}</h1>],
+      [
+        {
+          name: 'zero_out',


can the name be more specific e.g. fill_missing_with_zero or fill_zero? this parameter will be stored.

Agreed, will update the name.

mistercrunch · 2020-09-09T00:47:43Z

There's interesting interplay with the pandas.resample feature here. There are combinations of both these control sets that do not make sense.

Another approach (not necessarily a better one) would be to add a new "Rule" that says "matches query time grain", and add a method "fill with zeros". This would enable for applying whatever fill-method makes sense at the specified time grain.

Overall both these features (pandas.resample and this new checkbox) could be redesigned to be more comprehensive. This feels a bit like adding a layer to something already opaque.

eugeniamz · 2020-09-09T01:20:01Z

I don’t think that filling the gap is just a function the resample but a feature for all the time series chart. If you use it only as resample you can’t use any of the others functions as time comparison or rolling functions.

villebro · 2020-09-09T05:19:26Z

There's interesting interplay with the pandas.resample feature here. There are combinations of both these control sets that do not make sense.

Another approach (not necessarily a better one) would be to add a new "Rule" that says "matches query time grain", and add a method "fill with zeros". This would enable for applying whatever fill-method makes sense at the specified time grain.

Overall both these features (pandas.resample and this new checkbox) could be redesigned to be more comprehensive. This feels a bit like adding a layer to something already opaque.

I thought of adding that combo, but part of this design was influenced by discussions with a few business users who were clearly put off by using a multi-field option to fill missing grains with zero. So here the proposal was to make something that is easy to toggle on/off, and make it as business user friendly as possible. But I agree, this does add some potential clutter to the mix.

Another option would be to reword and redesign the whole "resample" + "rule" fields to be less Pandas centric and more business centric, focusing more on what the feature does in a viz context (in this case gap filling). It'd be interesting to hear what people are currently using resampling for, but I'm guessing users should almost never be doing downsampling (that's what the time grain is for). So something like a checkbox for "Fill gaps" and then a "value" selector with a slightly simpler options, such as "zero" and "previous" should probably cover the majority of use cases.

mistercrunch · 2020-09-09T21:44:59Z

Another option would be to reword and redesign the whole "resample" + "rule" fields to be less Pandas centric and more business centric, focusing more on what the feature does in a viz context (in this case gap filling). It'd be interesting to hear what people are currently using resampling for, but I'm guessing users should almost never be doing downsampling (that's what the time grain is for). So something like a checkbox for "Fill gaps" and then a "value" selector with a slightly simpler options, such as "zero" and "previous" should probably cover the majority of use cases.

That's what I'm advocating for.

About pandas.resample's typical use case, I think what's really typical here is to use timestamp level data event data (say at the millisecond level) that can be very "bursty" and defining how to both aggregate and fill the gaps. In our case, we're NOT interested in aggregating as we absolutely want to push that down to the db engine. From a more conceptual perspective (deviating form pandas' take on this here) we are interested in dealing with gaps though. I think that the methods that are reasonable to expose are (in order of my interpretation of popularity of desires & expectations):

zero fill: fill missing data with zeroes
cut-the-line: have a break in the line, with the line ending at the missing point and restarting on the other side. This usually requires markers to get floating dots to be represented
link through: default nvd3 behavior, jump over the missing data point and link the existing ones
forward-fill: repeat the last data point until the next existing one
back-fill:: repeat the next data point backwards to fill missing ones

While people might want forward-fill and backward-filling, I don't think that generally it's right to make up data points that don't exist. There's a case where your raw data represents something like "setting a gauge" or "changes in rank" where conceptually forward filling is right though. Pretty narrow use case...

Am I missing anything?

ktmud · 2020-09-14T00:22:09Z

Missing data was once a pain for Big Number with Trendline, too. I I did something similar but different to fix: apache/superset#9341 For my case, I didn't have to fill missing dates in between available records because they already look like zeros in the viz, but had to also added forward fill just for the last timestamp.

I agree it would be very helpful to have data imputation based on time grains. Coming to think of it, I feel maybe we can just make this checkbox + fill method select part of the "Time" control section so to enable it for any timeseries chart:

Users may also choose to configure fill methods for each metric separately in AdhocMetricsControl or another control.

audita12 · 2020-09-29T04:29:03Z

And another good use case is in Time Series Chart; if dates are missing, the Comparison function does the comparison between data points to do WoW or MoM are not the real WoW. For example, if this is my dataset :

6/25/20
6/27/20
7/1/20
7/2/20
7/3/20
7/7/20
7/8/20
7/9/20

and I use the time-series table to do WoW by setting comparison of 7 periods,

07/09/2020 -> will be compared with 6/25/20, not with 07/02/20

Looking forward to this solutions, thank you @villebro & @eugeniamz for fixing this.

lowstz · 2021-02-23T09:20:30Z

This feature very useful for non-aggregate data to show the gaps

villebro requested a review from a team as a code owner August 27, 2020 10:39

vercel bot had a problem deploying to Preview August 27, 2020 10:39 Failure

pull-request-size bot added the size/S label Aug 27, 2020

villebro force-pushed the villebro/zero-out branch from a68da50 to a1bb4e0 Compare August 27, 2020 13:11

vercel bot deployed to Preview August 27, 2020 13:11 View deployment

bkyryliuk approved these changes Aug 27, 2020

View reviewed changes

kristw reviewed Aug 28, 2020

View reviewed changes

villebro added 2 commits August 31, 2020 21:07

feat(legacy-preset-chart-nvd3): add zero imputation

df183ce

rename to more appropriate name

971607a

villebro force-pushed the villebro/zero-out branch from a1bb4e0 to 971607a Compare September 8, 2020 07:52

vercel bot deployed to Preview September 8, 2020 07:52 View deployment

willbarrett approved these changes Sep 8, 2020

View reviewed changes

ktmud approved these changes Sep 8, 2020

View reviewed changes

villebro closed this Sep 6, 2021

villebro mentioned this pull request Sep 6, 2021

feat: add resample operator to advanced analytic #1349

Merged

Conversation

villebro commented Aug 27, 2020

SCREENSHOTS (these changes are on a forthcoming PR on incubator-superset)

Uh oh!

vercel bot commented Aug 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

eugeniamz commented Aug 27, 2020

Uh oh!

eugeniamz commented Aug 27, 2020

Uh oh!

bkyryliuk commented Aug 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kristw Aug 28, 2020

Choose a reason for hiding this comment

Uh oh!

villebro Sep 8, 2020

Choose a reason for hiding this comment

Uh oh!

mistercrunch commented Sep 9, 2020

Uh oh!

eugeniamz commented Sep 9, 2020 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

villebro commented Sep 9, 2020

Uh oh!

mistercrunch commented Sep 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ktmud commented Sep 14, 2020

Uh oh!

audita12 commented Sep 29, 2020

Uh oh!

lowstz commented Feb 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

SCREENSHOTS (these changes are on a forthcoming PR on `incubator-superset`)

vercel bot commented Aug 27, 2020 •

edited

Loading

codecov bot commented Aug 27, 2020 •

edited

Loading

bkyryliuk commented Aug 27, 2020 •

edited

Loading

eugeniamz commented Sep 9, 2020 via email •

edited

Loading

mistercrunch commented Sep 9, 2020 •

edited

Loading