ENH: Basis for a StringDtype using Arrow #35259

xhochy · 2020-07-13T10:20:01Z

xref Plan for a native string dtype #35169
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

xhochy · 2020-07-13T10:25:21Z

I will focus on using Arrow master as I expected that we will need to add some functionality to Arrow anyways before this in any mergable state.

jorisvandenbossche

Cool! Added a few quick comments

jorisvandenbossche · 2020-07-13T10:55:48Z

pandas/core/arrays/string_arrow.py

+            for buf in chunk.buffers():
+                if buf is not None:
+                    size += buf.size
+        return size


ChunkedArray has an nbytes property nowawadays, so I think this can be return self.data.nbytes

jorisvandenbossche · 2020-07-13T10:57:46Z

pandas/core/arrays/string_arrow.py

+
+        This should return a 1-D array the same length as 'self'.
+        """
+        return self.data.is_null()


This returns a pyarrow array, right? Probably want to convert it into a pandas BooleanArray (to use the nullable boolean dtype). BooleanDtype.__from_arrow__ implements a conversion (although I think that needs to be optimized; separate issue though)

As this cannot be null, I will return a numpy array here. This is also what the current masked pandas arrays do.

No comment on what's preferable, but the interface does allow for non-ndarrays here. SparseArray.isna() returns a Sparse[bool] I think.

jorisvandenbossche · 2020-07-13T10:59:08Z

pandas/core/arrays/string_arrow.py

+            if item < 0:
+                item += len(self)
+            if item >= len(self):
+                return None


should this raise an error instead?

jorisvandenbossche · 2020-07-13T11:00:00Z

pandas/core/arrays/string_arrow.py

+        if isinstance(value, pa.ChunkedArray):
+            return type(self)(value)
+        else:
+            return value.as_py()


None needs to be replaced here with pd.NA, I think?

Josh-Ring-jisc · 2020-08-28T07:59:51Z

General question from the other thread about mutability. If we can pad each string to the UTF-8 max string length in arrow, can we better support mutability without forcing a complete reallocation of the array?

xhochy · 2020-08-28T08:37:56Z

@Josh-Ring-jisc Please stop posting the same question in multiple places, I'll answer in the first pandas issue.

xhochy · 2020-10-19T09:55:57Z

@simonjayhawkins Should we coordinate somewhere? (e.g. here?) I do a bit of work over at https://github.com/xhochy/fletcher that would also be helpful here as well can I probably contribute with pointer to things in Arrow if needed.

simonjayhawkins · 2020-10-19T10:05:59Z

@xhochy Over the last few days, I've began getting acquainted with fletcher and the string array work in arrow. At this point I'm not sure where the best place to coordinate would be.

I think it would also be beneficial to contribute to fletcher as part of this exercise. Since the different string methods are separate issues in fletcher, could also coordinate there on specific methods.

simonjayhawkins · 2020-10-19T15:48:37Z

I will focus on using Arrow master as I expected that we will need to add some functionality to Arrow anyways before this in any mergable state.

There is some discussion in #35169 on fallback options. Should this PR move forward with the current min version of pyarrow that we support?

xhochy · 2020-10-19T17:55:31Z

I will focus on using Arrow master as I expected that we will need to add some functionality to Arrow anyways before this in any mergable state.

There is some discussion in #35169 on fallback options. Should this PR move forward with the current min version of pyarrow that we support?

The performance improvements of this new dtype will only come when we use the new string functions in Arrow. I now from some colleagues that there are already slight benefits with pyarrow>=0.17 as the storage is more efficient and simple things like groupby are sometimes a bit faster or use the GIL slightly less but the fallback of going Arrow-to-object-to-Arrow for the algorithms makes it worse than using the object dtype directly.

If we want to get this into pandas as soon as possible, I would recommend to either use pyarrow 1.0 or 2.0 (will be released now/tomorrow) as the minimal version for the string dtype and keep the minimal supported pyarrow version stable for other features like Parquet reading. One benefit of the 1.0 release is that it brings better take support (and the new pyarrow.compute API in general) with it, in 2.0 we get more string algorithms that would just improve performance but don't have a big implication on the implementation thanks to @TomAugspurger's PR that enables bit-by-bit overloading of the string functions.

xhochy · 2020-10-19T18:01:09Z

@xhochy Over the last few days, I've began getting acquainted with fletcher and the string array work in arrow. At this point I'm not sure where the best place to coordinate would be.

I think it would also be beneficial to contribute to fletcher as part of this exercise. Since the different string methods are separate issues in fletcher, could also coordinate there on specific methods.

I would like to keep fletcher separate but try to minimise its codebase over time. Its main reason of development initially was to give input into Arrow development. For example all the numba-based algorithms should vanish over time and be replaced by their C++ counterparts in Arrow, pandas should only use the Arrow ones.

The string-related and general purpose things from https://github.com/xhochy/fletcher/blob/master/fletcher/base.py are probably the bits that we need to copy&paste&polish in this PR here. That should already bring us to a working but not ultra-fast dtype. I hope that we need nothing from the algorithms/ folder anymore.

Otherwise, once everything here is implemented, there is still a place for fletcher. It will behave slightly different than the pandas.ArrowStringDtype in that it will return for all its results an Arrow-backed Series. I think the dtype here should return the standard pandas-nullable dtypes.

jorisvandenbossche · 2020-10-19T18:14:46Z

Agreed wih @xhochy's points. We can require pyarrow 1.0 as a minimum version for the string dtype (and keep supporting older versions for other parts, like parquet reading). More string algorithms will only be added step by step to pyarrow in the next releases, so we will need to deal with this incomplete and varying coverage of string algorithms in pyarrow anyway, I think.

I think the dtype here should return the standard pandas-nullable dtypes.

Indeed.

simonjayhawkins · 2020-10-19T18:32:05Z

So we raise user-friendly message if pyarrow < 1.0 an if pyarrow version installed includes string kernel, we dispatch otherwise use a fallback so that we can support 1.0 and 2.0?

jorisvandenbossche · 2020-11-20T14:19:34Z

I think ci should be green here now and that we could probably mark as ready for review for other reviewers.

Most of @jreback outstanding comments relate to typing, which I will focus on now, but could be done in parallel if we are happy to merge this so we can get started on the rest of the task.

Yes, this is indeed ready to go as a first basis. The remaining type-related comments are handled in separate PRs.

jorisvandenbossche · 2020-11-20T14:23:43Z

Thanks @xhochy for getting this started, and @simonjayhawkins for the follow up work !

There are many follow-up items from the discussion above, which I think should mostly be summarized at #35169 (comment) (and in addition also the public exposure needs to be worked on)

But excited to see the start of this work merged!

TomAugspurger · 2020-11-20T20:50:34Z

Thanks all!

xhochy added 5 commits July 11, 2020 11:14

Implement BaseDtypeTests for ArrowStringDtype

4c2e37a

Implement getitem

d477ee7

Add basic copy implementation

206f493

Implement getitem for iterables

d58dba6

Remove commented code

7a9e2c3

xhochy mentioned this pull request Jul 13, 2020

Plan for a native string dtype #35169

Closed

jorisvandenbossche reviewed Jul 13, 2020

View reviewed changes

xhochy added 6 commits July 13, 2020 13:44

Implement more Setitem/Getitem variants

ffc4c0f

Review comments by @jorisvandenbossche

c1305ab

Add Arrow issue numbers

13a42f7

Adopt to kernel renamings

decd022

Handle take(indices<0, allow_fill=False)

3145e44

Handle fill_value better

e22b348

jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data labels Jul 16, 2020

TomAugspurger mentioned this pull request Sep 5, 2020

Arrow string array dtype #36142

Closed

github-actions bot added the Stale label Sep 27, 2020

Merge remote-tracking branch 'upstream/master' into arrow-string-array

4b8108c

simonjayhawkins removed the Stale label Oct 19, 2020

fix doctest

2446562

simonjayhawkins added 3 commits November 17, 2020 10:51

add comment on pyarrow compute

27c8de5

privatize data

b6713e9

Merge remote-tracking branch 'upstream/master' into arrow-string-array

125cb6f

jorisvandenbossche approved these changes Nov 20, 2020

View reviewed changes

jorisvandenbossche merged commit 7077a08 into pandas-dev:master Nov 20, 2020

This was referenced Nov 22, 2020

ENH: Arrow backed string array - implement factorize() method without casting to objects #38007

Merged

TYP: __getitem__ method of EA (2nd pass) #37921

Closed

ADraginda mentioned this pull request Dec 30, 2020

BUG: pandas 1.2.0 and Pyarrow [0.16.0, 1.0.0) are incompatible for some column types #38801

Closed

simonjayhawkins mentioned this pull request Dec 30, 2020

BUG: avoid attribute error with pyarrow >=0.16.0 and <1.0.0 #38803

Merged

5 tasks

This was referenced Mar 29, 2021

TST: [ArrowStringArray] add dtype parameterisation to test_astype_float and test_fillna_args #40677

Merged

TST: [ArrowStringArray] remove xfail from test_repr #40678

Merged

TST: [ArrowStringArray] more parameterised testing - part 1 #40679

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Basis for a StringDtype using Arrow #35259

ENH: Basis for a StringDtype using Arrow #35259

xhochy commented Jul 13, 2020 •

edited by simonjayhawkins

Loading

xhochy commented Jul 13, 2020

jorisvandenbossche left a comment

jorisvandenbossche Jul 13, 2020

jorisvandenbossche Jul 13, 2020

xhochy Jul 13, 2020

TomAugspurger Jul 15, 2020

jorisvandenbossche Jul 13, 2020

jorisvandenbossche Jul 13, 2020

Josh-Ring-jisc commented Aug 28, 2020

xhochy commented Aug 28, 2020

xhochy commented Oct 19, 2020

simonjayhawkins commented Oct 19, 2020

simonjayhawkins commented Oct 19, 2020

xhochy commented Oct 19, 2020

xhochy commented Oct 19, 2020

jorisvandenbossche commented Oct 19, 2020

simonjayhawkins commented Oct 19, 2020

jorisvandenbossche commented Nov 20, 2020

jorisvandenbossche commented Nov 20, 2020

TomAugspurger commented Nov 20, 2020

ENH: Basis for a StringDtype using Arrow #35259

ENH: Basis for a StringDtype using Arrow #35259

Conversation

xhochy commented Jul 13, 2020 • edited by simonjayhawkins Loading

xhochy commented Jul 13, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Jul 13, 2020

Choose a reason for hiding this comment

jorisvandenbossche Jul 13, 2020

Choose a reason for hiding this comment

xhochy Jul 13, 2020

Choose a reason for hiding this comment

TomAugspurger Jul 15, 2020

Choose a reason for hiding this comment

jorisvandenbossche Jul 13, 2020

Choose a reason for hiding this comment

jorisvandenbossche Jul 13, 2020

Choose a reason for hiding this comment

Josh-Ring-jisc commented Aug 28, 2020

xhochy commented Aug 28, 2020

xhochy commented Oct 19, 2020

simonjayhawkins commented Oct 19, 2020

simonjayhawkins commented Oct 19, 2020

xhochy commented Oct 19, 2020

xhochy commented Oct 19, 2020

jorisvandenbossche commented Oct 19, 2020

simonjayhawkins commented Oct 19, 2020

jorisvandenbossche commented Nov 20, 2020

jorisvandenbossche commented Nov 20, 2020

TomAugspurger commented Nov 20, 2020

xhochy commented Jul 13, 2020 •

edited by simonjayhawkins

Loading