Skip to content
This repository has been archived by the owner on Feb 22, 2023. It is now read-only.

Extreme dates cannot be renderes in displaying DataFrames #118

Closed
xhochy opened this issue Apr 6, 2020 · 10 comments
Closed

Extreme dates cannot be renderes in displaying DataFrames #118

xhochy opened this issue Apr 6, 2020 · 10 comments

Comments

@xhochy
Copy link
Owner

xhochy commented Apr 6, 2020

We can store dates in fletcher that pandas cannot store as we allow for other precisions as nanoseconds. Sadly our code currently converts to nanoseconds for printing a DataFrame.

Reproducible example:

import fletcher as fr
import pandas as pd
import datetime

df = pd.DataFrame({
    "date": fr.FletcherContinuousArray([datetime.datetime(9999, 12, 1), datetime.datetime(9999, 12, 1)])
})
print(df.head())

Exception:

Traceback (most recent call last):
  File "extreme_dates.py", line 8, in <module>
    print(df.head())
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/core/frame.py", line 680, in __repr__
    self.to_string(
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/core/frame.py", line 820, in to_string
    return formatter.to_string(buf=buf, encoding=encoding)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 914, in to_string
    return self.get_result(buf=buf, encoding=encoding)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 521, in get_result
    self.write_result(buf=f)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 823, in write_result
    strcols = self._to_str_columns()
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 759, in _to_str_columns
    fmt_values = self._format_col(i)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 948, in _format_col
    return format_array(
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1172, in format_array
    return fmt_obj.get_result()
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1203, in get_result
    fmt_values = self._format_strings()
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1489, in _format_strings
    array = np.asarray(values)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/Users/uwe/Development/fletcher/fletcher/base.py", line 328, in __array__
    return self.data.to_pandas().values
  File "pyarrow/array.pxi", line 567, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/array.pxi", line 1027, in pyarrow.lib.Array._to_pandas
  File "pyarrow/array.pxi", line 1209, in pyarrow.lib._array_like_to_pandas
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 253399622400000000
@jorisvandenbossche
Copy link
Contributor

That's the conversion in pyarrow failing (when converting to scalar objects):

In [23]: a = pa.array([datetime.datetime(9999, 12, 1), datetime.datetime(9999, 12, 1)])   

In [24]: a.to_pandas()    
...
ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 253399622400000000

In [28]: list(a) 
Out[28]: [datetime.datetime(9999, 12, 1, 0, 0), datetime.datetime(9999, 12, 1, 0, 0)]

For dates, we have a date_as_object option in to_pandas, we should probably have something similar for timestamps.
BTW, the scalar conversion (eg converting to a list) does correctly use datetime.datetime class when it is not ns resolution.

Now, that's for the pyarrow side. On the pandas side, I am surprised we go through np.asarray for formatting ExtensionArrays. Normally we should just pass the actual scalars from the ExtensionArray (iter(ea) should be sufficient for calling EA._formatter). That seems like a bug.

@xhochy
Copy link
Owner Author

xhochy commented Apr 6, 2020

For the pyarrow/fletcher side, the issue is that we use in fletcher the hack to go through pd.Series (via self.data.to_pandas().values) to get a NumPy array. Thus we force the conversion to ns. Instead we should use to_numpy (which is sadly only available on pa.Array, not pa.ChunkedArray).

On the pandas side there is a hard conversion to NumPy arrays baked in here: https://github.com/pandas-dev/pandas/blob/01f73100d5f7b942a796ffd000962dee28b43f9c/pandas/io/formats/format.py#L1462

@xhochy
Copy link
Owner Author

xhochy commented Apr 6, 2020

Passing in a correct numpy.array sadly also triggers the conversion to nanoseconds:

../pandas/pandas/io/formats/format.py:752: in _to_str_columns
    fmt_values = self._format_col(i)
../pandas/pandas/io/formats/format.py:936: in _format_col
    return format_array(
../pandas/pandas/io/formats/format.py:1159: in format_array
    return fmt_obj.get_result()
../pandas/pandas/io/formats/format.py:1190: in get_result
    fmt_values = self._format_strings()
../pandas/pandas/io/formats/format.py:1464: in _format_strings
    fmt_values = format_array(
../pandas/pandas/io/formats/format.py:1159: in format_array
    return fmt_obj.get_result()
../pandas/pandas/io/formats/format.py:1190: in get_result
    fmt_values = self._format_strings()
../pandas/pandas/io/formats/format.py:1439: in _format_strings
    values = DatetimeIndex(values)
../pandas/pandas/core/indexes/datetimes.py:249: in __new__
    dtarr = DatetimeArray._from_sequence(
../pandas/pandas/core/arrays/datetimes.py:312: in _from_sequence
    subarr, tz, inferred_freq = sequence_to_dt64ns(
../pandas/pandas/core/arrays/datetimes.py:1755: in sequence_to_dt64ns
    data = conversion.ensure_datetime64ns(data)

@jorisvandenbossche
Copy link
Contributor

That's for sure a bug then. We shouldn't try to coerce again if it's coming from an EA, we should simply call EA._formatter on the individual values.
Can you open an issue for pandas?

@jorisvandenbossche
Copy link
Contributor

So just as a test, if you return an object dtype array, does it work then?

@xhochy
Copy link
Owner Author

xhochy commented Apr 6, 2020

So just as a test, if you return an object dtype array, does it work then?

Yes, then it works.

@xhochy
Copy link
Owner Author

xhochy commented Apr 6, 2020

Can you open an issue for pandas?

Yes, made a semi-open issue: pandas-dev/pandas#33319

@jorisvandenbossche
Copy link
Contributor

Thanks!

@xhochy
Copy link
Owner Author

xhochy commented Apr 7, 2020

The fletcher side is fixed by #119 while on the pandas side we need pandas-dev/pandas#33319

@xhochy
Copy link
Owner Author

xhochy commented Feb 22, 2023

This project has been archived as development has ceased around 2021.
With the support of Apache Arrow-backed extension arrays in pandas, the major goal of this project has been fulfilled.

@xhochy xhochy closed this as completed Feb 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants