Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_datetime() ns precision inconsistencies between s and ns #15817

Closed
mchwalisz opened this issue Mar 27, 2017 · 6 comments
Closed

to_datetime() ns precision inconsistencies between s and ns #15817

mchwalisz opened this issue Mar 27, 2017 · 6 comments
Labels
Milestone

Comments

@mchwalisz
Copy link
Contributor

Code Sample, a copy-pastable example if possible

import pandas as pd
pd.options.display.float_format = '{:.20f}'.format
with open('example.csv', 'w') as f:
    f.write('''
full,ts,nano
1490195805.433502912,1490195805,433502912
1490195805.933358907,1490195805,933358907
1490195806.433445930,1490195806,433445930
1490195806.933351039,1490195806,933351039
''')
df = pd.read_csv('example.csv')
df['from two ints'] = pd.to_datetime(df['ts'] * 10**9 + df['nano'], unit='ns')
df['from float to s'] = pd.to_datetime(df['full'], unit='s')
df['from float^9 to ns'] = pd.to_datetime(df['full'] * 10**9, unit='ns')

print('Types:')
print(df.dtypes)
print('Data:')
print(df.iloc[0])

Output

Types:
full                         float64
ts                             int64
nano                           int64
from two ints         datetime64[ns]
from float to s       datetime64[ns]
from float^9 to ns    datetime64[ns]
dtype: object
Data:
full                 1490195805.43350267410278320312
ts                                        1490195805
nano                                       433502912
from two ints          2017-03-22 15:16:45.433502912
from float to s           2017-03-22 15:16:45.433503
from float^9 to ns     2017-03-22 15:16:45.433502720
Name: 0, dtype: object

Problem description

I would expect that all of the following methods would give the same result. In the documentation that the epoch times will be rounded to the nearest nanosecond. I don't understand why conversion from float using seconds s unit gives different rounding than nano seconds ns. Not to mention strange float parsing coming from csv.

I expect the whole issue to be connected to #7307.

Expected Output

Types:
full                         float64
ts                             int64
nano                           int64
from two ints         datetime64[ns]
from float to s       datetime64[ns]
from float^9 to ns    datetime64[ns]
dtype: object
Data:
full                            1490195805.433502912
ts                                        1490195805
nano                                       433502912
from two ints          2017-03-22 15:16:45.433502912
from float to s        2017-03-22 15:16:45.433502912
from float^9 to ns     2017-03-22 15:16:45.433502912
Name: 0, dtype: object

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.9.10-040910-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 34.3.2
Cython: 0.25.2
numpy: 1.10.4
scipy: 0.19.0
statsmodels: None
xarray: None
IPython: 5.3.0
sphinx: 1.4.8
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.999999999
httplib2: 0.9.1
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: None
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

This is not really related to #7307, but more with the (possible) precision of floats.

The float "1490195805.433502912" is too large / has too high precision to be represented faithfully as a floating point. Therefore, when converting this float to a datetime64 (which representation is based on integers) you get a conversion error.

In [11]: 1490195805.433502912
Out[11]: 1490195805.433503

In [12]: 1490195805433502912
Out[12]: 1490195805433502912

In [14]: pd.to_datetime(1490195805.433502912, unit='s') 
Out[14]: Timestamp('2017-03-22 15:16:45.433503') 

In [15]: pd.to_datetime(1490195805433502912, unit='ns')  
Out[15]: Timestamp('2017-03-22 15:16:45.433502912')  

@jorisvandenbossche jorisvandenbossche added this to the No action milestone Mar 27, 2017
@mchwalisz
Copy link
Contributor Author

How do you explain the following then?

df['from float^9 to ns'] = pd.to_datetime(df['full'] * 10**9, unit='ns')

from float^9 to ns     2017-03-22 15:16:45.433502720

@jorisvandenbossche
Copy link
Member

A string needs to be parsed to an actual floating point, and also here differences can occur based on which implementation you use:

In [25]: 1490195805.433502912
Out[25]: 1490195805.433503

In [26]: float('1490195805.433502912')
Out[26]: 1490195805.433503

In [27]: pd.read_csv(StringIO("a\n1490195805.433502912")).iloc[0,0]
Out[27]: 1490195805.4335027

In [28]: pd.read_csv(StringIO("a\n1490195805.433502912"), float_precision='high').iloc[0,0]
Out[28]: 1490195805.4335029

And additionally, also floating point arithmetic will give small differences, a float * 10^9 does not necessarily result in the same number as combining both separate parts as integers

@jreback
Copy link
Contributor

jreback commented Mar 27, 2017

@mchwalisz

furthermore, python floats are unbounded precision, while this is casted to a float64 which has high (though limited precision), about 15 significant digits. rounding is unavoidable. The only way to have exact precision is to use a fixed-width exact type (e.g. an int64). This is why we store the datetimes as int64. You can round-trip them exactly even via text this way.

In [6]: pd.to_datetime(1490195805.4335027 * 10**9, unit='ns').value
Out[6]: 1490195805433502720

In [7]: pd.to_datetime(1490195805.4335029 * 10**9, unit='ns').value
Out[7]: 1490195805433502976

In [8]: 1490195805.4335027 * 10**9
Out[8]: 1.4901958054335027e+18

In [9]: 1490195805.4335029 * 10**9
Out[9]: 1.490195805433503e+18

@jreback
Copy link
Contributor

jreback commented Mar 27, 2017

@mchwalisz happy to have a short section in timeseries.rst about this if you want / possibly in .to_datetime doc-string as well. not sure where else to put any docs about this.

@jorisvandenbossche
Copy link
Member

A string needs to be parsed to an actual floating point, and also here differences can occur based on which implementation you use:

In [25]: 1490195805.433502912
Out[25]: 1490195805.433503

In [26]: float('1490195805.433502912')
Out[26]: 1490195805.433503

In [27]: pd.read_csv(StringIO("a\n1490195805.433502912")).iloc[0,0]
Out[27]: 1490195805.4335027

In [28]: pd.read_csv(StringIO("a\n1490195805.433502912"), float_precision='high').iloc[0,0]
Out[28]: 1490195805.4335029

And additionally, also floating point arithmetic will give small differences, a float * 10^9 does not necessarily result in the same number as combining both separate parts as integers.

This is one of the reasons the timestamps are stored as integers.

If you want to read more information about this issue, you can see http://floating-point-gui.de/

@jreback jreback added Docs Datetime Datetime data dtype labels Mar 27, 2017
mchwalisz added a commit to mchwalisz/pandas that referenced this issue Apr 6, 2017
mchwalisz added a commit to mchwalisz/pandas that referenced this issue Apr 6, 2017
mchwalisz added a commit to mchwalisz/pandas that referenced this issue Apr 7, 2017
mchwalisz added a commit to mchwalisz/pandas that referenced this issue Apr 7, 2017
@jreback jreback closed this as completed in 860d555 Apr 8, 2017
@jreback jreback modified the milestones: 0.20.0, No action Apr 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants