BUG: Binary operations with empty string arrays produce #46332

vyasr · 2022-03-11T23:14:54Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> import pandas as pd
>>> pd.DataFrame({'a': pd.Series([], dtype='str')}) + 1
Empty DataFrame
Columns: [a]
Index: []
>>> pd.DataFrame({'a': pd.Series([], dtype='str')}) | 1
Empty DataFrame
Columns: [a]
Index: []
>>> pd.DataFrame({'a': pd.Series([None], dtype='str')}) + 1
     a
0  NaN
>>> pd.DataFrame({'a': pd.Series([None], dtype='str')}) | 1
Traceback (most recent call last):
...
TypeError: unsupported operand type(s) for |: 'NoneType' and 'int'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
...
TypeError: Cannot perform 'or_' with a dtyped [object] array and scalar of type [bool]
>>> pd.DataFrame({'a': pd.Series([None, 'a'], dtype='str')}) + 1
Traceback (most recent call last):
...
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
...
TypeError: can only concatenate str (not "int") to str

Issue Description

When a DataFrame contains a column of dtype object that is either empty or contains only None, it may allow certain binary operations to occur that would otherwise be invalid.

Expected Behavior

Errors should occur consistently across different binary operations between string columns and scalars based on the dtype of the scalar irrespective of whether that column is empty, contains only None, or contains actual strings.

Installed Versions

pd.show_versions()

INSTALLED VERSIONS

commit : 66e3805
python : 3.8.12.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-76-generic
Version : #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8

pandas : 1.3.5
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.2
pip : 22.0.4
setuptools : 59.8.0
Cython : 0.29.28
pytest : 7.0.1
hypothesis : 6.39.3
sphinx : 4.4.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.8.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.1.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
fsspec : 2022.02.0
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 6.0.1
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.55.1

The text was updated successfully, but these errors were encountered:

songsol1 · 2022-03-22T19:08:10Z

take

edwhuang23 · 2022-03-22T21:05:18Z

The consistency is certainly an issue here. So it seems like the suggestion is to raise an error consistently irrespective of whether the column is empty, contains only None, or contains actual strings. Wouldn't it make more sense for the data frame to be left alone with no error occurring if the column is empty since there are no elements to work with in the first place? If the column contains only None or actual strings, then we would raise an error. Does that seem reasonable?

Or is the line of reasoning that because we are dealing with a string column, binary operations with a scalar should not succeed even if the column happens to be empty?

vyasr · 2022-03-24T22:38:32Z

I would argue that the second is the correct answer. The behavior of dtypes should not depend on contents. If you think of a DataFrame as a collection of arrays (which is eventually where #39146 moves towards in implementation as well) you would expect that an array be strict with respect to dtypes. This is how numpy behaves, for example:

>>> np.array([]) + np.array([], dtype='str')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
numpy.core._exceptions.UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('float64'), dtype('<U1')) -> None

Especially if you were to define different columns using different types of arrays, I would expect binary operations between types of arrays to behave the same as adding equivalent columns of a DataFrame, but that wouldn't be the case now (granting that StringArray is still experimental):

>>> pd.array([1]) + pd.array([], dtype='str')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nfs/vyasr/local/rapids/compose/etc/conda/cuda_11.5/envs/rapids/lib/python3.8/site-packages/pandas/core/ops/common.py", line 69, in new_method
    return method(self, other)
  File "/home/nfs/vyasr/local/rapids/compose/etc/conda/cuda_11.5/envs/rapids/lib/python3.8/site-packages/pandas/core/arraylike.py", line 92, in __add__
    return self._arith_method(other, operator.add)
  File "/home/nfs/vyasr/local/rapids/compose/etc/conda/cuda_11.5/envs/rapids/lib/python3.8/site-packages/pandas/core/arrays/numeric.py", line 103, in _arith_method
    raise ValueError("Lengths must match")
ValueError: Lengths must match

I probably wouldn't even expect all operators to be defined for different types of arrays since it doesn't make sense to divide strings, for instance.

Moreover, from an implementation perspective the current behavior has a nonzero cost since it requires checking whether the column is empty, or even worse, if it is nonempty but only contains nulls (None/NaN).

edwhuang23 · 2022-03-28T03:04:14Z

Gotcha, so it seems like you're saying that the behavior of dtypes should not depend on content, which I absolutely agree with. Furthermore, I want to clarify: You are saying that performing a binary operation should work between two compatible columns according to their dtype (ex: integer column added to float column similar to the second commented example with two arrays added together) or that an operation between a column and a scalar value should be allowed if their dtypes are compatible, and that two incompatible dtypes should not work together (ex: integer added to a string as in your reproducible example) -- is this correct?

In terms of proposed changes, I'm thinking of the following:

Changing the current implementation so that it does not check whether the column is empty or is nonempty and contains nulls, and instead check by dtypes according to the behavior I described above.
Ensuring consistency of valid binary operations between different dtypes, no matter if we are dealing with two columns or 1 column and a scalar value.

Let me know if I misunderstood anything.

vyasr · 2022-03-31T18:43:07Z

Yup, that all sounds right to me!

vyasr added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 11, 2022

mroeschke added Numeric Operations Arithmetic, Comparison, and Logical operations Strings String extension data type and string data API - Consistency Internal Consistency of API/Behavior and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 17, 2022

github-actions bot assigned songsol1 Mar 22, 2022

vyasr mentioned this issue Mar 24, 2022

Define proper binary operation APIs for columns rapidsai/cudf#10509

Merged

songsol1 removed their assignment Apr 5, 2022

joelostblom mentioned this issue May 2, 2022

BUG: Concatenation of None with numerical values no longer converts None to Nan. #46922

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Binary operations with empty string arrays produce #46332

BUG: Binary operations with empty string arrays produce #46332

vyasr commented Mar 11, 2022 •

edited

Loading

INSTALLED VERSIONS

songsol1 commented Mar 22, 2022

edwhuang23 commented Mar 22, 2022

vyasr commented Mar 24, 2022

edwhuang23 commented Mar 28, 2022

vyasr commented Mar 31, 2022

BUG: Binary operations with empty string arrays produce #46332

BUG: Binary operations with empty string arrays produce #46332

Comments

vyasr commented Mar 11, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

songsol1 commented Mar 22, 2022

edwhuang23 commented Mar 22, 2022

vyasr commented Mar 24, 2022

edwhuang23 commented Mar 28, 2022

vyasr commented Mar 31, 2022

vyasr commented Mar 11, 2022 •

edited

Loading