-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Basis for a StringDtype using Arrow #35259
ENH: Basis for a StringDtype using Arrow #35259
Conversation
I will focus on using Arrow master as I expected that we will need to add some functionality to Arrow anyways before this in any mergable state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Added a few quick comments
pandas/core/arrays/string_arrow.py
Outdated
for buf in chunk.buffers(): | ||
if buf is not None: | ||
size += buf.size | ||
return size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ChunkedArray
has an nbytes
property nowawadays, so I think this can be return self.data.nbytes
pandas/core/arrays/string_arrow.py
Outdated
|
||
This should return a 1-D array the same length as 'self'. | ||
""" | ||
return self.data.is_null() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns a pyarrow array, right? Probably want to convert it into a pandas BooleanArray (to use the nullable boolean dtype). BooleanDtype.__from_arrow__
implements a conversion (although I think that needs to be optimized; separate issue though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this cannot be null, I will return a numpy array here. This is also what the current masked pandas arrays do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No comment on what's preferable, but the interface does allow for non-ndarrays here. SparseArray.isna() returns a Sparse[bool] I think.
pandas/core/arrays/string_arrow.py
Outdated
if item < 0: | ||
item += len(self) | ||
if item >= len(self): | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this raise an error instead?
pandas/core/arrays/string_arrow.py
Outdated
if isinstance(value, pa.ChunkedArray): | ||
return type(self)(value) | ||
else: | ||
return value.as_py() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None needs to be replaced here with pd.NA, I think?
General question from the other thread about mutability. If we can pad each string to the UTF-8 max string length in arrow, can we better support mutability without forcing a complete reallocation of the array? |
@Josh-Ring-jisc Please stop posting the same question in multiple places, I'll answer in the first pandas issue. |
@simonjayhawkins Should we coordinate somewhere? (e.g. here?) I do a bit of work over at https://github.com/xhochy/fletcher that would also be helpful here as well can I probably contribute with pointer to things in Arrow if needed. |
@xhochy Over the last few days, I've began getting acquainted with fletcher and the string array work in arrow. At this point I'm not sure where the best place to coordinate would be. I think it would also be beneficial to contribute to fletcher as part of this exercise. Since the different string methods are separate issues in fletcher, could also coordinate there on specific methods. |
There is some discussion in #35169 on fallback options. Should this PR move forward with the current min version of pyarrow that we support? |
The performance improvements of this new dtype will only come when we use the new string functions in Arrow. I now from some colleagues that there are already slight benefits with If we want to get this into |
I would like to keep The string-related and general purpose things from https://github.com/xhochy/fletcher/blob/master/fletcher/base.py are probably the bits that we need to copy&paste&polish in this PR here. That should already bring us to a working but not ultra-fast dtype. I hope that we need nothing from the Otherwise, once everything here is implemented, there is still a place for |
Agreed wih @xhochy's points. We can require pyarrow 1.0 as a minimum version for the string dtype (and keep supporting older versions for other parts, like parquet reading). More string algorithms will only be added step by step to pyarrow in the next releases, so we will need to deal with this incomplete and varying coverage of string algorithms in pyarrow anyway, I think.
Indeed. |
So we raise user-friendly message if pyarrow < 1.0 an if pyarrow version installed includes string kernel, we dispatch otherwise use a fallback so that we can support 1.0 and 2.0? |
Yes, this is indeed ready to go as a first basis. The remaining type-related comments are handled in separate PRs. |
Thanks @xhochy for getting this started, and @simonjayhawkins for the follow up work ! There are many follow-up items from the discussion above, which I think should mostly be summarized at #35169 (comment) (and in addition also the public exposure needs to be worked on) But excited to see the start of this work merged! |
Thanks all! |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff