Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of datetime objects #84

Open
tovop opened this issue Nov 13, 2020 · 1 comment
Open

Improve handling of datetime objects #84

tovop opened this issue Nov 13, 2020 · 1 comment
Labels
enhancement New feature or request

Comments

@tovop
Copy link
Collaborator

tovop commented Nov 13, 2020

Is your feature request related to a problem? Please describe.
The handling of datetime objects is extremely slow, especially with large number of objects.

Describe the solution you'd like
It should be more efficient so that converting a large array/list of datetime objects to floats (e.g seconds) should be much faster.

Describe alternatives you've considered
Use pandas.DateTimeIndex instead of datetime.datetime.

@tovop tovop added the enhancement New feature or request label Nov 13, 2020
@eneelo
Copy link
Collaborator

eneelo commented Nov 17, 2020

Various ways of creating arrays of datetime.datetime objects are tested in the enclosed python script (test_datetime_array.zip). There are large differences in terms of performance, and the following execution times are logged for the various functions (for converting a 3-hour time array with time step of 0.1 second):

Average time per loop in fastest run (5 runs, 10 loops):
   datetime_array             187.0 ms
   np_datetime64_array        280.1 ms
   np_datetime64_array_v2      57.3 ms
   pd_to_datetime              18.2 ms
   pd_datetime_array           63.1 ms

The function datetime_array() is equivalent to what is implemented today in the method TimeSeries.dtg_time()

Note that these functions are not strictly equivalent, in that they return slightly different object types:

  • datetime_array() and pd_datetime_array() return numpy.ndarray with datetime.datetime objects.
  • np_datetime64_array*() return numpy.ndarray with numpy.datetime64 objects.
  • pd_to_datetime() returns a pandas.DatetimeIndex (which has pandas.Timestamp objects as the values).

NB: A word of caution is however needed. Calculating the difference between to values in the arrays above will yield quite different results. The difference between two...

  • ... numpy.datetime64 objects will be a timedelta with microseconds (!) as the default time unit.
  • ... pandas.Timestamp objects will be a timedelta with nanoseconds (!) as the default time unit.
  • ... datetime.datetime objects will be a timedelta with seconds as the default time unit.

As of now, my recommendation is that we implement either of the pandas-based functions, i.e. pd_to_datetime() or pd_datetime_array() (the code for both is included below, to enable quick inspection without downloading the enclosed script).
Whether one is chosen over the other will be a matter of convenience (the pandas.DatetimeIndex is flexible and works very well for plotting with matplotlib.pyplot) versus unit consistency (time difference in seconds).

from datetime import datetime, timedelta
import numpy as np
import pandas as pd

def pd_to_datetime(timearray, dtg_ref=None):
    """
    pandas.DatetimeIndex -- utilizing pandas strenghts for efficiency

    To convert the DatetimeIndex to a numpy array (of pd.Timestamp objects):
    >>> dtindex = pd_to_datetime(timearray, dtg_ref)
    >>> arr = dtindex.values
    """
    if dtg_ref is None:
        dtg_ref = pd.Timestamp.now()
    elif isinstance(dtg_ref, (datetime, np.datetime64)):
        dtg_ref = pd.Timestamp(dtg_ref)

    # generate datetime index (strictly: Timestamp index) in two steps:
    #  1) generate datetime index by passing list of floats, without specifying a reference (start) time
    #  2) shift the datetime index to start at the right time
    # NB: defaut unit for pd.to_datetime() is nanoseconds (ns), and the command below is _much_ faster than specifying
    #     the unit explicitly (e.g. pd.to_datetime(timearray, unit='s'))
    dtg_time = pd.to_datetime(np.asarray(timearray) * 1e9)
    dtg_time += dtg_ref - dtg_time[0]

    return dtg_time


def pd_datetime_array(timearray, dtg_ref=None):
    """
    numpy array of datetime.datetime objects -- generated by use of pandas' efficiency.
    """
    dtg_time = pd_to_datetime(timearray, dtg_ref=dtg_ref)
    return dtg_time.to_pydatetime()

PS. It's easy (and quite efficient) to go back to a pandas.DatetimeIndex from a numpy array of datetime.datetime objects, e.g. if needed for plotting:

arr = pd_datetime_array(timearray, dtg_ref)
index = pd.to_datetime(arr)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants