ARROW-2611: [Python] Fix Python 2 integer serialization#2055
ARROW-2611: [Python] Fix Python 2 integer serialization#2055pschafhalter wants to merge 3 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
I think it's slightly clearer to do
#if PY_MAJOR_VERSION < 3
// handle Py2 case
#else
// handle Py3 case
#endifinstead of
#if PY_MAJOR_VERSION < 3
// handle Py2 case
#endif
// handle Py3 caseThere was a problem hiding this comment.
There is only a Python 2 case here, the "Python 3" case needs to be handled in both Python 2 and Python 3 (in Python 2 there is both int and long).
There was a problem hiding this comment.
I see, actually, now that I think about it, we don't need the #if, right?
There was a problem hiding this comment.
It's needed to get it to compile, since in Python 3 the function PyInt_FromSsize_t doesn't exist. Arrow has some common code to handle differences like this but I didn't find anything that covers this function.
pitrou
left a comment
There was a problem hiding this comment.
I think this PR should simply touch the deserialization path without creating a distinct serialization format for int (as opposed to long).
There was a problem hiding this comment.
I don't think this it the right approach. int and long are pretty much interchangeable in Python 2. So only the deserialization needs to be changed and attempt to build an int before falling back on long.
There was a problem hiding this comment.
An issue with this approach is that a serialized long with a small numeric value will be deserialized as an int which could cause issues for applications relying on type checking.
For example:
In [9]: value = 0L
In [10]: type(value)
Out[10]: long
In [11]: serialized = pa.serialize(value)
In [12]: deserialized = serialized.deserialize()
In [13]: type(deserialized)
Out[13]: intThere was a problem hiding this comment.
Accepting only long and refusing int sounds like a weird thing to do, though. Can you point an API that works this way?
(there are rare APIs that do the reverse - accept only int and reject long - but usually it's because they were written in C and the programmer couldn't care more)
There was a problem hiding this comment.
Sorry, I don't know one off the top of my head -- I agree that it's bad practice. I do think that being consistent with types is important from a correctness point of view.
Pickle, for reference, does not deserialize a small long as an int.
There was a problem hiding this comment.
That's true. Let's go with this patch, then. Hopefully we'll ditch Python 2 compatibility soon ;-)
|
Great, thanks for the discussion and @pschafhalter for getting this to work, I'll merge it then :) [Retriggering the build first] |
e2adf74 to
7b96b67
Compare
Codecov Report
@@ Coverage Diff @@
## master #2055 +/- ##
==========================================
+ Coverage 86.28% 86.28% +<.01%
==========================================
Files 242 242
Lines 41034 41041 +7
==========================================
+ Hits 35405 35412 +7
Misses 5629 5629
Continue to review full report at Codecov.
|
|
Looking at this late. This fix seems odd to me but I haven't looked closely enough at the nuances to judge for certain. Would it be possible to have only a single int64 array in the serialized representation and then convert back to either PyInt or PyLong depending on the size of the value? In Python 2 it seems like |
|
I think the discussion with @pitrou touched on this. The main issue with converting to PyInt or PyLong depending on the size of the value is that if I serialize a |
Fixes an issue where serialization turns integers into longs in Python 2.