Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Basis for a StringDtype using Arrow #35259

Merged
merged 91 commits into from
Nov 20, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
4c2e37a
Implement BaseDtypeTests for ArrowStringDtype
xhochy Jul 10, 2020
d477ee7
Implement getitem
xhochy Jul 13, 2020
206f493
Add basic copy implementation
xhochy Jul 13, 2020
d58dba6
Implement getitem for iterables
xhochy Jul 13, 2020
7a9e2c3
Remove commented code
xhochy Jul 13, 2020
ffc4c0f
Implement more Setitem/Getitem variants
xhochy Jul 13, 2020
c1305ab
Review comments by @jorisvandenbossche
xhochy Jul 13, 2020
13a42f7
Add Arrow issue numbers
xhochy Jul 13, 2020
decd022
Adopt to kernel renamings
xhochy Jul 15, 2020
3145e44
Handle take(indices<0, allow_fill=False)
xhochy Jul 15, 2020
e22b348
Handle fill_value better
xhochy Jul 15, 2020
4b8108c
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Oct 19, 2020
2446562
fix doctest
simonjayhawkins Oct 19, 2020
a0dcc85
Revert "fix doctest"
simonjayhawkins Oct 19, 2020
5c42173
change version for versionadded
simonjayhawkins Oct 19, 2020
28c3ef2
code checks
simonjayhawkins Oct 19, 2020
4044d4c
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Oct 21, 2020
1740524
skip tests for pyarrow<1.0
simonjayhawkins Oct 21, 2020
e9bb36f
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Oct 24, 2020
8ad120b
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Nov 2, 2020
34bf57d
raise ImportError in constructors on pyarrow < 1.0.0. or not installed
simonjayhawkins Nov 2, 2020
f92241e
remove size, shape and ndim
simonjayhawkins Nov 2, 2020
c09382d
activate all extension array tests
simonjayhawkins Nov 2, 2020
bac64c1
string array tests
simonjayhawkins Nov 3, 2020
0956147
Update pandas/core/arrays/string_arrow.py
simonjayhawkins Nov 3, 2020
963e1cf
add a to_numpy() method and use from __array__
simonjayhawkins Nov 3, 2020
87b8e67
mypy fixup
simonjayhawkins Nov 3, 2020
1ed0585
remove workaround for ARROW-9407 and ci test on pyarrow=1.0.0
simonjayhawkins Nov 3, 2020
fa954f7
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Nov 4, 2020
82b84bf
add _dtype class attribute
simonjayhawkins Nov 4, 2020
b1a3032
remove redundant integer indexing OOB and negative indexing checks in…
simonjayhawkins Nov 4, 2020
08d34f4
check pyarrow array is string type in constructor
simonjayhawkins Nov 4, 2020
ae49807
basic _from_factorized pending discussion on performant factorisation
simonjayhawkins Nov 4, 2020
2e5d4c7
update constructor error message and move test
simonjayhawkins Nov 4, 2020
c8318cc
add _concat_same_type classmethod
simonjayhawkins Nov 4, 2020
1a200a2
_as_pandas_scalar to method
simonjayhawkins Nov 4, 2020
e10be80
copy/paste fillna from fletcher as baseline (29 failed)
simonjayhawkins Nov 5, 2020
c1d3087
minor cleanup of fillna (29 failed)
simonjayhawkins Nov 5, 2020
34f563d
correct mistake in previous commit (25 failed)
simonjayhawkins Nov 5, 2020
f5fc4fd
add OpsMixin (23 failed)
simonjayhawkins Nov 5, 2020
a5a7c85
add binops (18 failed)
simonjayhawkins Nov 5, 2020
f651563
return Boolean array for comparison ops (12 failed)
simonjayhawkins Nov 5, 2020
f5419b9
fix ValueError: zero-size array to reduction operation maximum which …
simonjayhawkins Nov 5, 2020
3af5ce0
copy/paste value_counts from fletcher as baseline (5 failed)
simonjayhawkins Nov 5, 2020
bdf4ad2
tidy imports
simonjayhawkins Nov 5, 2020
e044c7f
fix test_take_non_na_fill_value (4 failed)
simonjayhawkins Nov 6, 2020
c5625a8
fix test_take_pandas_style_negative_raises (3 failed)
simonjayhawkins Nov 6, 2020
50889fb
parametrize string extension tests (3 failed)
simonjayhawkins Nov 6, 2020
0e1773b
xfail other 2 tests expecting views (1 failed)
simonjayhawkins Nov 6, 2020
7bb9574
add ensure_string_array to _from_sequence (1 failed)
simonjayhawkins Nov 6, 2020
fc45ef7
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Nov 12, 2020
51d7d0a
Apply suggestions from code review
simonjayhawkins Nov 12, 2020
bd76a75
Merge branch 'arrow-string-array' of github.com:xhochy/pandas into ar…
simonjayhawkins Nov 12, 2020
3cf5c91
return NotImplemented in comparisons (7 failed)
simonjayhawkins Nov 12, 2020
07239a0
move arrow function lookup dict to module scope (7 failed)
simonjayhawkins Nov 12, 2020
9a7cfc5
remove isinstance(other, (ABCSeries, ABCDataFrame, ABCIndex)) check
simonjayhawkins Nov 12, 2020
2ba0dcd
remove na_value=cls._dtype.na_value from ensure_string_array call (7 …
simonjayhawkins Nov 13, 2020
97c56e2
coloate _from_sequence_of_strings with _from_sequence (7 failed)
simonjayhawkins Nov 13, 2020
d6d3543
revert change to extra_compile_args in setup.py
simonjayhawkins Nov 13, 2020
ab40dce
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Nov 13, 2020
d71a895
sync fillna docstring with base
simonjayhawkins Nov 13, 2020
f342b62
Apply suggestions from code review
simonjayhawkins Nov 13, 2020
3d05c89
Merge branch 'arrow-string-array' of github.com:xhochy/pandas into ar…
simonjayhawkins Nov 13, 2020
b3c6347
other base.Base*Tests -> super()
simonjayhawkins Nov 13, 2020
26bca25
len(item) == 0 -> not len(item)
simonjayhawkins Nov 13, 2020
9579444
update copy docstring and return type
simonjayhawkins Nov 13, 2020
88094a7
test_constructor_not_string_type_raises with np.ndarray
simonjayhawkins Nov 13, 2020
ba0cee8
update test_from_sequence_no_mutate (7 failed)
simonjayhawkins Nov 13, 2020
6709ac3
change xfail message for base extension array tests (7 failed)
simonjayhawkins Nov 13, 2020
11388b4
change xfail reason message in test_value_counts_na
simonjayhawkins Nov 13, 2020
eb284e7
skip test_memory_usage for ArrowStringArray
simonjayhawkins Nov 13, 2020
27ce19a
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Nov 14, 2020
9b70709
part implementation of na_value in to_numpy
simonjayhawkins Nov 14, 2020
6757feb
remove is_array_like in __getitem__
simonjayhawkins Nov 14, 2020
460ea38
Revert "remove is_array_like in __getitem__"
simonjayhawkins Nov 14, 2020
7bee5e2
remove just is_array_like in __getitem__
simonjayhawkins Nov 14, 2020
91f3763
Update pandas/core/arrays/string_arrow.py
simonjayhawkins Nov 14, 2020
36b662a
Apply suggestions from code review
simonjayhawkins Nov 14, 2020
7a9ef9c
lint fixup
simonjayhawkins Nov 14, 2020
5db8788
xfail test_astype_roundtrip
simonjayhawkins Nov 14, 2020
c76c39f
update expected in test_arrow_array
simonjayhawkins Nov 14, 2020
87b7863
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Nov 15, 2020
24a782d
add fallback for scalar comparison ops
simonjayhawkins Nov 15, 2020
353bff9
dispatch to pyarrow for comparion with np.ndarray (1 failed)
simonjayhawkins Nov 15, 2020
be93947
fix test_reindex_non_na_fill_value
simonjayhawkins Nov 16, 2020
11eb08f
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Nov 16, 2020
52440a7
use fill_mask in pa indices_array
simonjayhawkins Nov 16, 2020
bd05c2c
add comment to __gettem__
simonjayhawkins Nov 16, 2020
27c8de5
add comment on pyarrow compute
simonjayhawkins Nov 17, 2020
b6713e9
privatize `data`
simonjayhawkins Nov 17, 2020
125cb6f
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Nov 17, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -636,7 +636,7 @@ cpdef ndarray[object] ensure_string_array(
----------
arr : array-like
The values to be converted to str, if needed.
na_value : Any
na_value : Any, default np.nan
The value to use for na. For example, np.nan or pd.NA.
convert_na_value : bool, default True
If False, existing na values will be used unchanged in the new array.
Expand Down
7 changes: 6 additions & 1 deletion pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -468,14 +468,19 @@ def astype(self, dtype, copy=True):
NumPy ndarray with 'dtype' for its dtype.
"""
from pandas.core.arrays.string_ import StringDtype
from pandas.core.arrays.string_arrow import ArrowStringDtype

dtype = pandas_dtype(dtype)
if is_dtype_equal(dtype, self.dtype):
if not copy:
return self
else:
return self.copy()
if isinstance(dtype, StringDtype): # allow conversion to StringArrays

# FIXME: Really hard-code here?
if isinstance(
dtype, (ArrowStringDtype, StringDtype)
jreback marked this conversation as resolved.
Show resolved Hide resolved
): # allow conversion to StringArrays
return dtype.construct_array_type()._from_sequence(self, copy=False)

return np.array(self, dtype=dtype, copy=copy)
Expand Down
Loading