Skip to content

Conversation

@dianaclarke
Copy link
Contributor

@dianaclarke dianaclarke commented Sep 30, 2020

See: https://issues.apache.org/jira/browse/ARROW-9941

Before:

>>> ty
IntegerType(extension<arrow.py_extension_type>)
>>> str(ty)
'extension<arrow.py_extension_type>'
>>> repr(ty)
'IntegerType(extension<arrow.py_extension_type>)'
>>> pa.DataType.__str__(ty)
'extension<arrow.py_extension_type>'

After:

>>> ty
IntegerType(DataType(int64))
>>> str(ty)
'extension<arrow.py_extension_type<IntegerType>>'
>>> repr(ty)
'IntegerType(DataType(int64))'
>>> pa.DataType.__str__(ty)
'extension<arrow.py_extension_type<IntegerType>>'

@pitrou pitrou self-requested a review September 30, 2020 20:12
@github-actions
Copy link

@pitrou
Copy link
Member

pitrou commented Oct 1, 2020

The problem is that this won't be visible from C++ code, which will still get extension<arrow.py_extension_type>.

>>> ty
IntegerType(DataType(int64))
>>> pa.DataType.__str__(ty)
'extension<arrow.py_extension_type>'

Solving this requires some collaboration with the C++ side. I advise you to take a look at cpp/src/arrow/python/extension_type.h. You'll probably need to override the ToString method in arrow::py::PyExtensionType. By default it can return the same as ExtensionType::ToString, but it can also query a Python attribute on type_instance to get a custom string.

@dianaclarke
Copy link
Contributor Author

@pitrou Thanks for the feedback! I figured it must be more involved than this ;) I'll see if I can figure out the C++ side later today.

@dianaclarke
Copy link
Contributor Author

If I do something like the following & revert the Python changes, I get this result.

// cpp/src/arrow/python/extension_type.cc
std::string PyExtensionType::ToString() const {
  return this->storage_type()->ToString();
}

// cpp/src/arrow/python/extension_type.h
std::string ToString() const override;
>>> ty = IntegerType()
>>> ty
IntegerType(int64)
>>> pa.DataType.__str__(ty)
'int64'

And if I keep the Python changes & the C++ changes, I see this:

>>> ty = IntegerType()
>>> ty
IntegerType(DataType(int64))
>>> pa.DataType.__str__(ty)
'int64'

What are the desired return values?

PS. I don't want to take up too much of your time. I won't be offended if you decide it's quicker to just implement this yourself. That, and I don't know C++, so anything I do attempt should be heavily scrutinized ;)

Thanks again for taking the time to review my first attempt and pointing me in the right direction.

@dianaclarke
Copy link
Contributor Author

I amended the commit to return this:

assert pa.DataType.__str__(ty) == "arrow.py_extension_type<int64>"

@pitrou
Copy link
Member

pitrou commented Oct 2, 2020

assert pa.DataType.__str__(ty) == "arrow.py_extension_type<int64>"

The problem is that this doesn't distinguish between different extension types with the same storage type. For example, two extension types representing UUIDs and IPv6 addresses, respectively, would get the same string representation (because both would be backed by a 16-byte fixed-size binary).

I would be more useful if the string representation looked like extension<arrow.py_extension_type<IntegerType>>, where IntegerType is the Python class name.

Optionally this could be made configurable as well.

@dianaclarke dianaclarke force-pushed the ARROW-9941 branch 2 times, most recently from 3295bf0 to 6618c63 Compare October 2, 2020 16:10
@nealrichardson
Copy link
Member

To be clear, the 2 Windows CI failures are unrelated.

@dianaclarke
Copy link
Contributor Author

I had hoped I could just use something like the following, but it's segfaulting. I think I need to go back to first principals and do a few C++ tutorials ;)

internal::PyObject_StdStringRepr(type_instance_.obj());

@dianaclarke dianaclarke force-pushed the ARROW-9941 branch 2 times, most recently from 4fe54cb to 720082c Compare October 2, 2020 22:06
@dianaclarke
Copy link
Contributor Author

I have it returning "extension<arrow.py_extension_type<IntegerType>>" now from the C++ side.

No idea if it's right though.

Thanks for your patience :)

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much, this looks good on the principle now. Just two minor things remaining.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be removed now (after which the str test will need to be fixed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, if I remove the __str__, the output changes as follows. Is this what you're looking for? Thanks!

def test_ext_type_str():
    ty = IntegerType()
    expected = "extension<arrow.py_extension_type<IntegerType>>"
    assert str(ty) == expected
    assert pa.DataType.__str__(ty) == expected


def test_ext_type_repr():
    ty = IntegerType()
    expected = "IntegerType(extension<arrow.py_extension_type<IntegerType>>)"
    assert repr(ty) == expected

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The str looks fine, the repr looks a bit unfortunate (though not terrible). It seems the repr could be just the same as str.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've removed __str__ and implemented __repr__ instead to preserve IntegerType(DataType(int64)) for repr().

Copy link
Member

@pitrou pitrou Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works fine, but there is one thing missing. Since ToString can be called from arbitrary C++ code, the GIL needs to be taken. This is achieved by adding:

PyAcquireGIL lock;

at the beginning of this method definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amended commit to add PyAcquireGIL, thanks.

Before:

>>> ty
IntegerType(extension<arrow.py_extension_type>)
>>> str(ty)
'extension<arrow.py_extension_type>'
>>> pa.DataType.__str__(ty)
'extension<arrow.py_extension_type>'
>>> repr(ty)
'IntegerType(extension<arrow.py_extension_type>)'

After:

>>> ty
IntegerType(DataType(int64))
>>> str(ty)
'extension<arrow.py_extension_type<IntegerType>>'
>>> pa.DataType.__str__(ty)
'extension<arrow.py_extension_type<IntegerType>>'
>>> repr(ty)
'IntegerType(DataType(int64))'
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, this is looking good, thank you!

@pitrou pitrou closed this in 7f84722 Oct 5, 2020
@jorisvandenbossche
Copy link
Member

Late comment on this PR: for the PyExtensionType subclasses, this is certainly a nice enhancement. But for custom extension types directly subclassing ExtesionType, this might not be needed. Using the PeriodType class from the tests, we now have:

In [26]: period_type = PeriodType('D')

In [27]: period_type
Out[27]: PeriodType(DataType(int64))

In [28]: str(period_type)
Out[28]: 'extension<test.period<PeriodType>>'

while before this was:

In [2]: period_type = PeriodType('D')

In [3]: period_type
Out[3]: PeriodType(extension<test.period>)

In [4]: str(period_type)
Out[4]: 'extension<test.period>'

Since here, the extension name is already unique and not the generic "arrow.py_extension_type", adding the class name to __str__ seems to not add much value / give only a longer type name?
For the __repr__ I am less certain about (showing the storage type is also useful), but I think also showing the identifier name of the extension type can be useful.

(happy to work on this if there is agreement to change this)

@pitrou
Copy link
Member

pitrou commented Oct 22, 2020

That would sound ok to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants