-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9941: [Python] Better string representation for extension types #8312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The problem is that this won't be visible from C++ code, which will still get >>> ty
IntegerType(DataType(int64))
>>> pa.DataType.__str__(ty)
'extension<arrow.py_extension_type>'Solving this requires some collaboration with the C++ side. I advise you to take a look at |
|
@pitrou Thanks for the feedback! I figured it must be more involved than this ;) I'll see if I can figure out the C++ side later today. |
|
If I do something like the following & revert the Python changes, I get this result. And if I keep the Python changes & the C++ changes, I see this: What are the desired return values? PS. I don't want to take up too much of your time. I won't be offended if you decide it's quicker to just implement this yourself. That, and I don't know C++, so anything I do attempt should be heavily scrutinized ;) Thanks again for taking the time to review my first attempt and pointing me in the right direction. |
8c68db4 to
e28e27d
Compare
|
I amended the commit to return this:
|
The problem is that this doesn't distinguish between different extension types with the same storage type. For example, two extension types representing UUIDs and IPv6 addresses, respectively, would get the same string representation (because both would be backed by a 16-byte fixed-size binary). I would be more useful if the string representation looked like Optionally this could be made configurable as well. |
3295bf0 to
6618c63
Compare
|
To be clear, the 2 Windows CI failures are unrelated. |
|
I had hoped I could just use something like the following, but it's segfaulting. I think I need to go back to first principals and do a few C++ tutorials ;)
|
4fe54cb to
720082c
Compare
|
I have it returning No idea if it's right though. Thanks for your patience :) |
720082c to
a9ee7e1
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much, this looks good on the principle now. Just two minor things remaining.
python/pyarrow/types.pxi
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be removed now (after which the str test will need to be fixed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, if I remove the __str__, the output changes as follows. Is this what you're looking for? Thanks!
def test_ext_type_str():
ty = IntegerType()
expected = "extension<arrow.py_extension_type<IntegerType>>"
assert str(ty) == expected
assert pa.DataType.__str__(ty) == expected
def test_ext_type_repr():
ty = IntegerType()
expected = "IntegerType(extension<arrow.py_extension_type<IntegerType>>)"
assert repr(ty) == expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The str looks fine, the repr looks a bit unfortunate (though not terrible). It seems the repr could be just the same as str.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've removed __str__ and implemented __repr__ instead to preserve IntegerType(DataType(int64)) for repr().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works fine, but there is one thing missing. Since ToString can be called from arbitrary C++ code, the GIL needs to be taken. This is achieved by adding:
PyAcquireGIL lock;at the beginning of this method definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amended commit to add PyAcquireGIL, thanks.
a9ee7e1 to
fdf5084
Compare
Before: >>> ty IntegerType(extension<arrow.py_extension_type>) >>> str(ty) 'extension<arrow.py_extension_type>' >>> pa.DataType.__str__(ty) 'extension<arrow.py_extension_type>' >>> repr(ty) 'IntegerType(extension<arrow.py_extension_type>)' After: >>> ty IntegerType(DataType(int64)) >>> str(ty) 'extension<arrow.py_extension_type<IntegerType>>' >>> pa.DataType.__str__(ty) 'extension<arrow.py_extension_type<IntegerType>>' >>> repr(ty) 'IntegerType(DataType(int64))'
fdf5084 to
5581c39
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, this is looking good, thank you!
|
Late comment on this PR: for the PyExtensionType subclasses, this is certainly a nice enhancement. But for custom extension types directly subclassing ExtesionType, this might not be needed. Using the PeriodType class from the tests, we now have: while before this was: Since here, the extension name is already unique and not the generic "arrow.py_extension_type", adding the class name to (happy to work on this if there is agreement to change this) |
|
That would sound ok to me. |
See: https://issues.apache.org/jira/browse/ARROW-9941
Before:
After: