-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
ENH: Implement arrow string option for various I/O methods #54431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/io/orc.py
Outdated
| return df | ||
| else: | ||
| return pa_table.to_pandas() | ||
| print("Ts") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
lithomas1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing this forward. I left some comments.
pandas/_libs/lib.pyx
Outdated
| elif seen.str_: | ||
| if is_string_array(objects): | ||
| from pandas._config import get_option | ||
| opt = get_option("future.infer_string") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass this in as a kwarg to maybe_convert_objects instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather only get the option if actually needed
pandas/core/dtypes/cast.py
Outdated
| # coming out as np.str_! | ||
|
|
||
| dtype = _dtype_obj | ||
| opt = get_option("future.infer_string") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using_pyarrow_string_dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow-up. This is introduced in the other pr (little bit confusing, sorry)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
| def arrow_string_types_mapper() -> Callable: | ||
| pa = import_optional_dependency("pyarrow") | ||
|
|
||
| return {pa.string(): pd.ArrowDtype(pa.string())}.get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this a little, is there a situation where you would want to mix pyarrow and numpy dtypes?
(I'm thinking maybe we should force users to pick the pyarrow dtype backend if you are using the pyarrow string type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there are a lot of situations.
NumPy numeric and Arrow strings is still the fastest, numpy numeric is 2D. Forcing them right now is not a good idea
pandas/io/pytables.py
Outdated
| values = self.read_array("values", start=start, stop=stop) | ||
| return Series(values, index=index, name=self.name, copy=False) | ||
| result = Series(values, index=index, name=self.name, copy=False) | ||
| if result.dtype.kind == "O" and using_pyarrow_string_dtype(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too familiar with this code, but do we need to check if results is a string array first if doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense
| dtype = pd.ArrowDtype(pa.string()) | ||
|
|
||
| data = """a,b | ||
| x,1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test case with null/nan/None like in your other PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a missing field, actually having these values doesn't make much sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
pandas/_libs/lib.pyx
Outdated
| seen.object_ = True | ||
|
|
||
| elif seen.str_: | ||
| if is_string_array(objects): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know everywhere else does this, but is there a way to avoid this double parsing?
(Maybe we check the other flags are all false?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you exit the first loop as soon as you find one string
|
Nice thanks @phofl |
…v#54431) * ENH: Implement arrow string option for various I/O methods * ENH: allow opt-in to inferring pyarrow strings * Remove comments and add tests * Add string option to arrow parsers * Update * Update * Adjust csv * Update * Update * Add test * Fix mypy --------- Co-authored-by: Brock <[email protected]>
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.