-
Notifications
You must be signed in to change notification settings - Fork 7
Add Sliding MDE analysis (solved issue #229) #246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good stuff, asked a couple questions and changes. The things left would be:
- Add some unit tests
- Install and run pre-commit in this file.
- Bump version.
- (optional) Add a small notebook in docs showing this functionality
) | ||
return results | ||
|
||
def mde_sliding_time_line( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wdyt about mde_rolling_time_line? or rolling_mde_time_line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to mde_rolling_time_line
print(results) | ||
""" | ||
used_time_col = self.time_col or time_col |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have this logic here and not in mde_time_line? I would add it in both or nowhere. In case we add it in both, would put in a separate method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored the logic for handling the time column into an helper _get_time_col().
now, methods that need a time column (like run_average_standard_error or mde_rolling_time_line) call _get_time_col() to validate that a column was provided during class initialization
"or pass `time_col` to this method." | ||
) | ||
|
||
if agg_func is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't come up with a more elegant way, but it's a bit confusing that we allow it as optional above and here we raise when it's None. Any thoughts?
This list:
'sum', 'mean', 'median', 'min', 'max', 'count', 'std', 'var', 'nunique', 'first', 'last'."
I would define it as a constant (or a class attribute) and use it here and the type hint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored it so that the valid aggregation functions are now defined as a class constant VALID_AGG_FUNCS.
agg_func is now a required argument, and the class constant is now used in validation.
as for the type hint, I didn't use the class constant because in lower python versions we cannot unpack a list inside Literal
)[self.target_col].agg(agg_func) | ||
|
||
if post_process_func is not None: | ||
df_grouped[self.target_col] = df_grouped[self.target_col].apply(post_process_func) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apply is very slow, but not sure if assign works better in here. Any thought? if not, we keep it like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! apply can be slower on very large datasets but the advantage of using apply here is that it makes it intuitive and flexible for the user to pass any callable function, without requiring it to be vectorized.
as for assign, I don't think it would improve performance as for element-wise operations it still relies on apply or equivalent under the hood
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #246 +/- ##
==========================================
- Coverage 96.57% 96.46% -0.12%
==========================================
Files 17 17
Lines 1783 1809 +26
==========================================
+ Hits 1722 1745 +23
- Misses 61 64 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a few minor comments. About
(optional) Add a small notebook in docs showing this functionality--> not sure if we want to create a new notebook or if we want to add this functionality to an existing one?!
whatever you think is best, both work for me! I dunno if we have a notebook with orders per user or something like that
def mde_rolling_time_line( | ||
self, | ||
df: pd.DataFrame, | ||
agg_func: Literal[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we have to put it as second argument if we don't want to make it optional, but still I think making it optional is better than breaking the order of arguments from mde_time_line where pre_exepriment_df is the second argument. Do you have an opinion on moving this to an argument below or keeping it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved it and added * before it to make it a “keyword-only argument”. this way is still mandatory but it's aligned with the order of arguments from mde_time_line
and return a scalar. | ||
Example with post_process_func: | ||
def flag_positive(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wdyt about making a full reproducible example? this way we can test it in the test_docs file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
experiment_start = df[time_col].min() | ||
|
||
for n_days in experiment_length: | ||
df_time = df[df[time_col] <= experiment_start + pd.Timedelta(days=n_days)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we rename to df_time_filter or something like this? more explicit I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
Add mde_sliding_time_line method in the NormalPowerAnalysis class, enabling the calculation of MDE over varying experiment lengths.
Key features:
This enhancement provides flexibility to analyze temporal patterns in metric response and better capture dynamics of retention, conversion, or other evolving metrics.
Issue #229