Skip to content

Conversation

@dianaclarke
Copy link
Contributor

@dianaclarke dianaclarke commented Sep 30, 2020

See: https://issues.apache.org/jira/browse/ARROW-10147

Before:

>>> import numpy as np
>>> import pyarrow as pa
>>> import pandas as pd
>>> idx = pd.RangeIndex(0, 4, name=np.int64(6))
>>> df = pd.DataFrame(index=idx)
>>> pa.table(df)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 2042, in pyarrow.lib.table
    return Table.from_pandas(data, schema=schema, nthreads=nthreads)
  File "pyarrow/table.pxi", line 1394, in pyarrow.lib.Table.from_pandas
    arrays, schema = dataframe_to_arrays(
  File "/Users/diana/workspace/arrow/python/pyarrow/pandas_compat.py", line 604, in dataframe_to_arrays
    pandas_metadata = construct_metadata(df, column_names, index_columns,
  File "/Users/diana/workspace/arrow/python/pyarrow/pandas_compat.py", line 237, in construct_metadata
    b'pandas': json.dumps({
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type int64 is not JSON serializable

After:

>>> import numpy as np
>>> import pyarrow as pa
>>> import pandas as pd
>>> idx = pd.RangeIndex(0, 4, name=np.int64(6))
>>> df = pd.DataFrame(index=idx)
>>> table = pa.table(df)
>>> metadata = table.schema.pandas_metadata
>>> import pprint
>>> pprint.pprint(metadata)
{'column_indexes': [{'field_name': None,
                     'metadata': None,
                     'name': None,
                     'numpy_type': 'object',
                     'pandas_type': 'empty'}],
 'columns': [],
 'creator': {'library': 'pyarrow',
             'version': '2.0.0.dev381+g06830c954.d20201001'},
 'index_columns': [{'kind': 'range',
                    'name': '6',
                    'start': 0,
                    'step': 1,
                    'stop': 4}],
 'pandas_version': '1.1.2'}

@github-actions
Copy link

Copy link
Contributor

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps name should be coerced to a string on RangeIndex instantiation? Perhaps there are other places where a name that's something like np.int64(6) might be problematic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas allows Index.name to be anything -- we may want to make our best effort to try to preserve the type of the name, in the worst case pickling could be necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that context helps, thanks! I hear name and immediately assume a string is the end goal.

I've amended the commit to at least preserve the type if default serializable. I'll noodle some more on pickling.

@dianaclarke dianaclarke force-pushed the ARROW-10147 branch 3 times, most recently from 9c6befc to a67dbef Compare October 1, 2020 23:17
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I think this is plenty for now -- if preserving the type of atypical names becomes important to someone they can always submit a patch to fix it later. Thanks @dianaclarke!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants