Skip to content
This repository was archived by the owner on Dec 10, 2021. It is now read-only.

feat(legacy-preset-chart-nvd3): add zero imputation#758

Closed
villebro wants to merge 2 commits intoapache-superset:masterfrom
preset-io:villebro/zero-out
Closed

feat(legacy-preset-chart-nvd3): add zero imputation#758
villebro wants to merge 2 commits intoapache-superset:masterfrom
preset-io:villebro/zero-out

Conversation

@villebro
Copy link
Contributor

🏆 Enhancements

Add control for imputation of missing timestamps. Useful if missing data indicates zero observations.

SCREENSHOTS (these changes are on a forthcoming PR on incubator-superset)

When imputing annual data with monthly time grains:
image
When trying to zero out without selecting a time grain first:
image

@villebro villebro requested a review from a team as a code owner August 27, 2020 10:39
@vercel
Copy link

vercel bot commented Aug 27, 2020

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/superset/superset-ui/2a9yfk4ii
✅ Preview: https://superset-ui-git-fork-preset-io-villebro-zero-out.superset.vercel.app

@codecov
Copy link

codecov bot commented Aug 27, 2020

Codecov Report

Merging #758 into master will increase coverage by 0.44%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #758      +/-   ##
==========================================
+ Coverage   24.45%   24.90%   +0.44%     
==========================================
  Files         335      353      +18     
  Lines        7654     7839     +185     
  Branches      938     1003      +65     
==========================================
+ Hits         1872     1952      +80     
- Misses       5708     5785      +77     
- Partials       74      102      +28     
Impacted Files Coverage Δ
...gins/legacy-preset-chart-nvd3/src/NVD3Controls.tsx 0.00% <ø> (ø)
plugins/plugin-chart-table/src/index.ts 0.00% <0.00%> (ø)
plugins/plugin-chart-table/test/testData.ts 83.33% <0.00%> (ø)
plugins/legacy-plugin-chart-rose/src/Rose.js 0.00% <0.00%> (ø)
plugins/legacy-plugin-chart-rose/src/index.js 0.00% <0.00%> (ø)
plugins/legacy-preset-chart-nvd3/src/utils.js 14.18% <0.00%> (ø)
plugins/plugin-chart-table/src/TableChart.tsx 57.14% <0.00%> (ø)
plugins/legacy-plugin-chart-chord/src/Chord.js 0.00% <0.00%> (ø)
plugins/legacy-plugin-chart-chord/src/index.js 0.00% <0.00%> (ø)
plugins/legacy-plugin-chart-sankey/src/index.js 0.00% <0.00%> (ø)
... and 332 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2115ecd...971607a. Read the comment docs.

@eugeniamz
Copy link

this feature is very important as it can completely change the chart interpretation if it is not clear that the data points are missing. For example, this chart has several dates missing :
image

if we "fill the gap" of missing data points :
image

I had been fixing this in the DB with time-series functions but not all the db support time series. I also had been resolving this by working around with resample :

image

but it is not scalable if the grain changes and also does not allows to use of other advance analytics function.

@eugeniamz
Copy link

And another good use case is in Time Series Chart; if dates are missing, the Comparison function does the comparison between data points to do WoW or MoM are not the real WoW. For example, if this is my dataset :

6/25/20
6/27/20
7/1/20
7/2/20
7/3/20
7/7/20
7/8/20
7/9/20

and I use the time-series table to do WoW by setting comparison of 7 periods,

07/09/2020 -> will be compared with 6/25/20, not with 07/02/20

@bkyryliuk
Copy link

bkyryliuk commented Aug 27, 2020

And another good use case is in Time Series Chart; if dates are missing, the Comparison function does the comparison between data points to do WoW or MoM are not the real WoW. For example, if this is my dataset :

6/25/20
6/27/20
7/1/20
7/2/20
7/3/20
7/7/20
7/8/20
7/9/20

and I use the time-series table to do WoW by setting comparison of 7 periods,

07/09/2020 -> will be compared with 6/25/20, not with 07/02/20

+1 for this feature.
thank you @villebro & @eugeniamz for fixing it, there were a couple bug reports about how WoW performs with the missing days on our side, this would be a good solution for it.

[<h1 className="section-header">{t('Data imputation')}</h1>],
[
{
name: 'zero_out',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the name be more specific e.g. fill_missing_with_zero or fill_zero? this parameter will be stored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will update the name.

@mistercrunch
Copy link
Contributor

There's interesting interplay with the pandas.resample feature here. There are combinations of both these control sets that do not make sense.

Another approach (not necessarily a better one) would be to add a new "Rule" that says "matches query time grain", and add a method "fill with zeros". This would enable for applying whatever fill-method makes sense at the specified time grain.

Overall both these features (pandas.resample and this new checkbox) could be redesigned to be more comprehensive. This feels a bit like adding a layer to something already opaque.

@eugeniamz
Copy link

eugeniamz commented Sep 9, 2020 via email

@villebro
Copy link
Contributor Author

villebro commented Sep 9, 2020

There's interesting interplay with the pandas.resample feature here. There are combinations of both these control sets that do not make sense.

Another approach (not necessarily a better one) would be to add a new "Rule" that says "matches query time grain", and add a method "fill with zeros". This would enable for applying whatever fill-method makes sense at the specified time grain.

Overall both these features (pandas.resample and this new checkbox) could be redesigned to be more comprehensive. This feels a bit like adding a layer to something already opaque.

I thought of adding that combo, but part of this design was influenced by discussions with a few business users who were clearly put off by using a multi-field option to fill missing grains with zero. So here the proposal was to make something that is easy to toggle on/off, and make it as business user friendly as possible. But I agree, this does add some potential clutter to the mix.

Another option would be to reword and redesign the whole "resample" + "rule" fields to be less Pandas centric and more business centric, focusing more on what the feature does in a viz context (in this case gap filling). It'd be interesting to hear what people are currently using resampling for, but I'm guessing users should almost never be doing downsampling (that's what the time grain is for). So something like a checkbox for "Fill gaps" and then a "value" selector with a slightly simpler options, such as "zero" and "previous" should probably cover the majority of use cases.

@mistercrunch
Copy link
Contributor

mistercrunch commented Sep 9, 2020

Another option would be to reword and redesign the whole "resample" + "rule" fields to be less Pandas centric and more business centric, focusing more on what the feature does in a viz context (in this case gap filling). It'd be interesting to hear what people are currently using resampling for, but I'm guessing users should almost never be doing downsampling (that's what the time grain is for). So something like a checkbox for "Fill gaps" and then a "value" selector with a slightly simpler options, such as "zero" and "previous" should probably cover the majority of use cases.

That's what I'm advocating for.

About pandas.resample's typical use case, I think what's really typical here is to use timestamp level data event data (say at the millisecond level) that can be very "bursty" and defining how to both aggregate and fill the gaps. In our case, we're NOT interested in aggregating as we absolutely want to push that down to the db engine. From a more conceptual perspective (deviating form pandas' take on this here) we are interested in dealing with gaps though. I think that the methods that are reasonable to expose are (in order of my interpretation of popularity of desires & expectations):

  • zero fill: fill missing data with zeroes
  • cut-the-line: have a break in the line, with the line ending at the missing point and restarting on the other side. This usually requires markers to get floating dots to be represented
  • link through: default nvd3 behavior, jump over the missing data point and link the existing ones
  • forward-fill: repeat the last data point until the next existing one
  • back-fill:: repeat the next data point backwards to fill missing ones

While people might want forward-fill and backward-filling, I don't think that generally it's right to make up data points that don't exist. There's a case where your raw data represents something like "setting a gauge" or "changes in rank" where conceptually forward filling is right though. Pretty narrow use case...

Am I missing anything?

@ktmud
Copy link
Contributor

ktmud commented Sep 14, 2020

Missing data was once a pain for Big Number with Trendline, too. I I did something similar but different to fix: apache/superset#9341 For my case, I didn't have to fill missing dates in between available records because they already look like zeros in the viz, but had to also added forward fill just for the last timestamp.

I agree it would be very helpful to have data imputation based on time grains. Coming to think of it, I feel maybe we can just make this checkbox + fill method select part of the "Time" control section so to enable it for any timeseries chart:

Users may also choose to configure fill methods for each metric separately in AdhocMetricsControl or another control.

@audita12
Copy link

And another good use case is in Time Series Chart; if dates are missing, the Comparison function does the comparison between data points to do WoW or MoM are not the real WoW. For example, if this is my dataset :

6/25/20
6/27/20
7/1/20
7/2/20
7/3/20
7/7/20
7/8/20
7/9/20

and I use the time-series table to do WoW by setting comparison of 7 periods,

07/09/2020 -> will be compared with 6/25/20, not with 07/02/20

Looking forward to this solutions, thank you @villebro & @eugeniamz for fixing this.

@lowstz
Copy link

lowstz commented Feb 23, 2021

This feature very useful for non-aggregate data to show the gaps

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants