Extreme dates cannot be renderes in displaying DataFrames #118

xhochy · 2020-04-06T09:44:45Z

We can store dates in fletcher that pandas cannot store as we allow for other precisions as nanoseconds. Sadly our code currently converts to nanoseconds for printing a DataFrame.

Reproducible example:

import fletcher as fr
import pandas as pd
import datetime

df = pd.DataFrame({
    "date": fr.FletcherContinuousArray([datetime.datetime(9999, 12, 1), datetime.datetime(9999, 12, 1)])
})
print(df.head())

Exception:

Traceback (most recent call last):
  File "extreme_dates.py", line 8, in <module>
    print(df.head())
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/core/frame.py", line 680, in __repr__
    self.to_string(
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/core/frame.py", line 820, in to_string
    return formatter.to_string(buf=buf, encoding=encoding)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 914, in to_string
    return self.get_result(buf=buf, encoding=encoding)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 521, in get_result
    self.write_result(buf=f)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 823, in write_result
    strcols = self._to_str_columns()
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 759, in _to_str_columns
    fmt_values = self._format_col(i)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 948, in _format_col
    return format_array(
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1172, in format_array
    return fmt_obj.get_result()
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1203, in get_result
    fmt_values = self._format_strings()
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1489, in _format_strings
    array = np.asarray(values)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/Users/uwe/Development/fletcher/fletcher/base.py", line 328, in __array__
    return self.data.to_pandas().values
  File "pyarrow/array.pxi", line 567, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/array.pxi", line 1027, in pyarrow.lib.Array._to_pandas
  File "pyarrow/array.pxi", line 1209, in pyarrow.lib._array_like_to_pandas
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 253399622400000000

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2020-04-06T10:01:54Z

That's the conversion in pyarrow failing (when converting to scalar objects):

In [23]: a = pa.array([datetime.datetime(9999, 12, 1), datetime.datetime(9999, 12, 1)])   

In [24]: a.to_pandas()    
...
ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 253399622400000000

In [28]: list(a) 
Out[28]: [datetime.datetime(9999, 12, 1, 0, 0), datetime.datetime(9999, 12, 1, 0, 0)]

For dates, we have a date_as_object option in to_pandas, we should probably have something similar for timestamps.
BTW, the scalar conversion (eg converting to a list) does correctly use datetime.datetime class when it is not ns resolution.

Now, that's for the pyarrow side. On the pandas side, I am surprised we go through np.asarray for formatting ExtensionArrays. Normally we should just pass the actual scalars from the ExtensionArray (iter(ea) should be sufficient for calling EA._formatter). That seems like a bug.

xhochy · 2020-04-06T10:32:30Z

For the pyarrow/fletcher side, the issue is that we use in fletcher the hack to go through pd.Series (via self.data.to_pandas().values) to get a NumPy array. Thus we force the conversion to ns. Instead we should use to_numpy (which is sadly only available on pa.Array, not pa.ChunkedArray).

On the pandas side there is a hard conversion to NumPy arrays baked in here: https://github.com/pandas-dev/pandas/blob/01f73100d5f7b942a796ffd000962dee28b43f9c/pandas/io/formats/format.py#L1462

xhochy · 2020-04-06T10:49:00Z

Passing in a correct numpy.array sadly also triggers the conversion to nanoseconds:

../pandas/pandas/io/formats/format.py:752: in _to_str_columns
    fmt_values = self._format_col(i)
../pandas/pandas/io/formats/format.py:936: in _format_col
    return format_array(
../pandas/pandas/io/formats/format.py:1159: in format_array
    return fmt_obj.get_result()
../pandas/pandas/io/formats/format.py:1190: in get_result
    fmt_values = self._format_strings()
../pandas/pandas/io/formats/format.py:1464: in _format_strings
    fmt_values = format_array(
../pandas/pandas/io/formats/format.py:1159: in format_array
    return fmt_obj.get_result()
../pandas/pandas/io/formats/format.py:1190: in get_result
    fmt_values = self._format_strings()
../pandas/pandas/io/formats/format.py:1439: in _format_strings
    values = DatetimeIndex(values)
../pandas/pandas/core/indexes/datetimes.py:249: in __new__
    dtarr = DatetimeArray._from_sequence(
../pandas/pandas/core/arrays/datetimes.py:312: in _from_sequence
    subarr, tz, inferred_freq = sequence_to_dt64ns(
../pandas/pandas/core/arrays/datetimes.py:1755: in sequence_to_dt64ns
    data = conversion.ensure_datetime64ns(data)

jorisvandenbossche · 2020-04-06T12:10:33Z

That's for sure a bug then. We shouldn't try to coerce again if it's coming from an EA, we should simply call EA._formatter on the individual values.
Can you open an issue for pandas?

jorisvandenbossche · 2020-04-06T12:15:38Z

So just as a test, if you return an object dtype array, does it work then?

xhochy · 2020-04-06T12:19:42Z

So just as a test, if you return an object dtype array, does it work then?

Yes, then it works.

xhochy · 2020-04-06T12:39:54Z

Can you open an issue for pandas?

Yes, made a semi-open issue: pandas-dev/pandas#33319

jorisvandenbossche · 2020-04-06T12:41:04Z

Thanks!

xhochy · 2020-04-07T05:51:28Z

The fletcher side is fixed by #119 while on the pandas side we need pandas-dev/pandas#33319

xhochy · 2023-02-22T15:14:54Z

This project has been archived as development has ceased around 2021.
With the support of Apache Arrow-backed extension arrays in pandas, the major goal of this project has been fulfilled.

xhochy closed this as completed Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extreme dates cannot be renderes in displaying DataFrames #118

Extreme dates cannot be renderes in displaying DataFrames #118

xhochy commented Apr 6, 2020

jorisvandenbossche commented Apr 6, 2020

xhochy commented Apr 6, 2020

xhochy commented Apr 6, 2020

jorisvandenbossche commented Apr 6, 2020

jorisvandenbossche commented Apr 6, 2020

xhochy commented Apr 6, 2020

xhochy commented Apr 6, 2020

jorisvandenbossche commented Apr 6, 2020

xhochy commented Apr 7, 2020

xhochy commented Feb 22, 2023

Extreme dates cannot be renderes in displaying DataFrames #118

Extreme dates cannot be renderes in displaying DataFrames #118

Comments

xhochy commented Apr 6, 2020

jorisvandenbossche commented Apr 6, 2020

xhochy commented Apr 6, 2020

xhochy commented Apr 6, 2020

jorisvandenbossche commented Apr 6, 2020

jorisvandenbossche commented Apr 6, 2020

xhochy commented Apr 6, 2020

xhochy commented Apr 6, 2020

jorisvandenbossche commented Apr 6, 2020

xhochy commented Apr 7, 2020

xhochy commented Feb 22, 2023