Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate DateDiffLeapYearTransormer and DateDifferenceTransformer #244

Open
davidhopkinson26 opened this issue May 1, 2024 · 1 comment
Labels
breaking change change which would break backwards compatibility feature New feature or request help wanted Extra attention is needed

Comments

@davidhopkinson26
Copy link
Contributor

davidhopkinson26 commented May 1, 2024

What

Consolidate the DateDiffLeapYearTransormer and DateDifferenceTransformer into a single transformer with options to cover the functionality of both.

Why

It would be more user friendly if these two transformers which appear to offer similar functionality were consolidated.

How

These two transformers are semantically very similar but function in different ways which for now necessitates different logic.

DateDiffLeapYearTransormer returns integer year date differences (i.e. age) factoring in leap years.

DateDifferenceTransformer calculates difference between datetimes in days/ hours/ minutes/ seconds. M and Y were removed as options in #123 due to M and Y values not being supported by np.timedeltas

To resolve this issue we need to consider ways of bringing the logic of these together in a way which can support years/ months/ days/ hours/ minutes/ seconds. We need to be able to account for calendar months and leap years. Ideally we would not have forking logic for this and would use a single logical framework for all.

Currently unclear how best to approach this so contributions and ideas are welcome!

@davidhopkinson26 davidhopkinson26 added feature New feature or request help wanted Extra attention is needed breaking change change which would break backwards compatibility labels May 1, 2024
@adamsardar
Copy link
Contributor

adamsardar commented Feb 5, 2025

I spent some time thinking about this ticket.

What are we actually asking for? Is it for us to be able to output 'Months' or 'Years' as a standard time unit? Because it feels like it might not be possible from what I understand.

Some preliminaries:
Days aren't a standard amount of hours (daylight saving hours can create 23 and 25 hours days)
Months aren't a standard amount of days (months contain 28-31 days). Additionally, Februaries (pl?) aren't a standard number of days because of leap years
Years aren't a standard number of days (leap years contain 366 days, common years 365)

If we have a date stamp, which does not track hours, then we can compute the number of days between two dates because we don't care about nor track hours: 30-Mar-25 - 29-Mar-25 = 1 day - daylight saving clock changes on 30th March this year are irrelevant. Or do we care about tracking the hours; i.e. 30-Mar-25 03:30 - 29-Mar-25 03:30 = 0 days because less than 24 hours have passed? This is relevant because it is the same case for months and years.

Do we want the number of months between two datestamps to be dependent on those original datestamps? So 26-Feb-25 - 26-Mar-25 = 1 month and 26-Mar-25 - 26-Apr-25 = 1 month, even though the first equates to 28 days and the second 31 days? Worse still, if you were to cast those months to days, you would need to know the context of the time duration that you started with. I don't think that you want a time unit that requires you to cart around all of the date context like that.

The same problem arises with years. Even worse, you'd need to worry about tracking the fact that years like 1900 and 2100 are not leap years, but 2000 was (I am unsure whether the logic in DateDiffLeapYearTransormer does this ...). It's just a total faff and I'm not sure what you gain in an ML pipeline.

My proposal is to just return the number of days between two datestamps (converting datettimes to datestamps), so 1-Jan-25 23:00 - 2-Jan-01:00 == 1-Jan-25 - 2-Jan-25 = 1 day. Datestamp arithmetic will already account for extra days in leap years. If someone wants a count of years, I'd encourage them to convert to common years via common_years = days/365. If they would like something like months, I'd suggest lunar_months = floor(days/29.5) (or use a non-casting version by defining a month as 28 or 29 days). Or convert to fortnights if you want an integer. For a machine learning pipeline, these definitions are exact and contain sufficient feature importance. No model should be sensitive to a small change in time unit as the result of a leap year.

So to summarise:

  • return integer days from DateDifferenceTransformer
  • Offer the chance to specify a unit for the column (week, fortnight, lunar_month, common_year') or some integer of their choice.
  • We continue to support hours and smaller units of time? See no reason to change.
  • delete DateDiffLeapYearTransormer
  • Include the above considerations about leap years into the docs and code into tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change change which would break backwards compatibility feature New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants