Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Basis for a StringDtype using Arrow #35259

Merged
merged 91 commits into from
Nov 20, 2020

Conversation

xhochy
Copy link
Contributor

@xhochy xhochy commented Jul 13, 2020

@xhochy
Copy link
Contributor Author

xhochy commented Jul 13, 2020

I will focus on using Arrow master as I expected that we will need to add some functionality to Arrow anyways before this in any mergable state.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Added a few quick comments

for buf in chunk.buffers():
if buf is not None:
size += buf.size
return size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChunkedArray has an nbytes property nowawadays, so I think this can be return self.data.nbytes


This should return a 1-D array the same length as 'self'.
"""
return self.data.is_null()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns a pyarrow array, right? Probably want to convert it into a pandas BooleanArray (to use the nullable boolean dtype). BooleanDtype.__from_arrow__ implements a conversion (although I think that needs to be optimized; separate issue though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this cannot be null, I will return a numpy array here. This is also what the current masked pandas arrays do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No comment on what's preferable, but the interface does allow for non-ndarrays here. SparseArray.isna() returns a Sparse[bool] I think.

if item < 0:
item += len(self)
if item >= len(self):
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this raise an error instead?

if isinstance(value, pa.ChunkedArray):
return type(self)(value)
else:
return value.as_py()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None needs to be replaced here with pd.NA, I think?

@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data labels Jul 16, 2020
@Josh-Ring-jisc
Copy link

General question from the other thread about mutability. If we can pad each string to the UTF-8 max string length in arrow, can we better support mutability without forcing a complete reallocation of the array?

@xhochy
Copy link
Contributor Author

xhochy commented Aug 28, 2020

@Josh-Ring-jisc  Please stop posting the same question in multiple places, I'll answer in the first pandas issue.

@xhochy
Copy link
Contributor Author

xhochy commented Oct 19, 2020

@simonjayhawkins Should we coordinate somewhere? (e.g. here?) I do a bit of work over at https://github.com/xhochy/fletcher that would also be helpful here as well can I probably contribute with pointer to things in Arrow if needed.

@simonjayhawkins
Copy link
Member

@xhochy Over the last few days, I've began getting acquainted with fletcher and the string array work in arrow. At this point I'm not sure where the best place to coordinate would be.

I think it would also be beneficial to contribute to fletcher as part of this exercise. Since the different string methods are separate issues in fletcher, could also coordinate there on specific methods.

@simonjayhawkins
Copy link
Member

I will focus on using Arrow master as I expected that we will need to add some functionality to Arrow anyways before this in any mergable state.

There is some discussion in #35169 on fallback options. Should this PR move forward with the current min version of pyarrow that we support?

@xhochy
Copy link
Contributor Author

xhochy commented Oct 19, 2020

I will focus on using Arrow master as I expected that we will need to add some functionality to Arrow anyways before this in any mergable state.

There is some discussion in #35169 on fallback options. Should this PR move forward with the current min version of pyarrow that we support?

The performance improvements of this new dtype will only come when we use the new string functions in Arrow. I now from some colleagues that there are already slight benefits with pyarrow>=0.17 as the storage is more efficient and simple things like groupby are sometimes a bit faster or use the GIL slightly less but the fallback of going Arrow-to-object-to-Arrow for the algorithms makes it worse than using the object dtype directly.

If we want to get this into pandas as soon as possible, I would recommend to either use pyarrow 1.0 or 2.0 (will be released now/tomorrow) as the minimal version for the string dtype and keep the minimal supported pyarrow version stable for other features like Parquet reading. One benefit of the 1.0 release is that it brings better take support (and the new pyarrow.compute API in general) with it, in 2.0 we get more string algorithms that would just improve performance but don't have a big implication on the implementation thanks to @TomAugspurger's PR that enables bit-by-bit overloading of the string functions.

@xhochy
Copy link
Contributor Author

xhochy commented Oct 19, 2020

@xhochy Over the last few days, I've began getting acquainted with fletcher and the string array work in arrow. At this point I'm not sure where the best place to coordinate would be.

I think it would also be beneficial to contribute to fletcher as part of this exercise. Since the different string methods are separate issues in fletcher, could also coordinate there on specific methods.

I would like to keep fletcher separate but try to minimise its codebase over time. Its main reason of development initially was to give input into Arrow development. For example all the numba-based algorithms should vanish over time and be replaced by their C++ counterparts in Arrow, pandas should only use the Arrow ones.

The string-related and general purpose things from https://github.com/xhochy/fletcher/blob/master/fletcher/base.py are probably the bits that we need to copy&paste&polish in this PR here. That should already bring us to a working but not ultra-fast dtype. I hope that we need nothing from the algorithms/ folder anymore.

Otherwise, once everything here is implemented, there is still a place for fletcher. It will behave slightly different than the pandas.ArrowStringDtype in that it will return for all its results an Arrow-backed Series. I think the dtype here should return the standard pandas-nullable dtypes.

@jorisvandenbossche
Copy link
Member

Agreed wih @xhochy's points. We can require pyarrow 1.0 as a minimum version for the string dtype (and keep supporting older versions for other parts, like parquet reading). More string algorithms will only be added step by step to pyarrow in the next releases, so we will need to deal with this incomplete and varying coverage of string algorithms in pyarrow anyway, I think.

I think the dtype here should return the standard pandas-nullable dtypes.

Indeed.

@simonjayhawkins
Copy link
Member

So we raise user-friendly message if pyarrow < 1.0 an if pyarrow version installed includes string kernel, we dispatch otherwise use a fallback so that we can support 1.0 and 2.0?

@jorisvandenbossche
Copy link
Member

I think ci should be green here now and that we could probably mark as ready for review for other reviewers.

Most of @jreback outstanding comments relate to typing, which I will focus on now, but could be done in parallel if we are happy to merge this so we can get started on the rest of the task.

Yes, this is indeed ready to go as a first basis. The remaining type-related comments are handled in separate PRs.

@jorisvandenbossche jorisvandenbossche merged commit 7077a08 into pandas-dev:master Nov 20, 2020
@jorisvandenbossche
Copy link
Member

Thanks @xhochy for getting this started, and @simonjayhawkins for the follow up work !

There are many follow-up items from the discussion above, which I think should mostly be summarized at #35169 (comment) (and in addition also the public exposure needs to be worked on)

But excited to see the start of this work merged!

@TomAugspurger
Copy link
Contributor

Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants