Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fails with empty DataFrame argument #4059

Closed
c3-cjazra opened this issue Jan 26, 2022 · 3 comments · Fixed by #4391
Closed

add fails with empty DataFrame argument #4059

c3-cjazra opened this issue Jan 26, 2022 · 3 comments · Fixed by #4391
Assignees
Labels
bug 🦗 Something isn't working

Comments

@c3-cjazra
Copy link

System information

Modin: 0.12.0
Ray 1.7.1
python 3.9.7

Describe the problem

import numpy as np
df_empty = modin_pd.DataFrame([])
df1 = modin_pd.DataFrame([1, 2, 3], columns=["col"])
# df_empty.add(df1) <-- OK
df1.add(df_empty) <-- FAILS

error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/var/folders/9l/q_y0w5yx2_1_h65cprqgczyc0000gq/T/ipykernel_10949/3077187932.py in <module>
----> 6 df1.add(df)

/python3.9/site-packages/modin/pandas/dataframe.py in add(self, other, axis, level, fill_value)
    548         Get addition of ``DataFrame`` and `other`, element-wise (binary operator `add`).
    549         """
--> 550         return self._binary_op(
    551             "add",
    552             other,

/python3.9/site-packages/modin/pandas/base.py in _binary_op(self, op, other, **kwargs)
    425         if op in exclude_list:
    426             kwargs.pop("axis")
--> 427         new_query_compiler = getattr(self._query_compiler, op)(other, **kwargs)
    428         return self._create_or_update_from_compiler(new_query_compiler)
    429 

/python3.9/site-packages/modin/core/dataframe/algebra/binary.py in caller(query_compiler, other, broadcast, *args, **kwargs)
     90                 else:
     91                     return query_compiler.__constructor__(
---> 92                         query_compiler._modin_frame.binary_op(
     93                             lambda x, y: func(x, y, *args, **kwargs),
     94                             other._modin_frame,

/python3.9/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py in binary_op(self, op, right_frame, join_type)
   2030         # unwrap list returned by `copartition`.
   2031         right_parts = right_parts[0]
-> 2032         new_frame = self._partition_mgr_cls.binary_operation(
   2033             1, left_parts, lambda l, r: op(l, r), right_parts
   2034         )

/python3.9/site-packages/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py in magic(*args, **kwargs)
     55     @wraps(f)
     56     def magic(*args, **kwargs):
---> 57         result_parts = f(*args, **kwargs)
     58         if ProgressBar.get():
     59             current_frame = inspect.currentframe()

/python3.9/site-packages/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py in binary_operation(cls, axis, left, func, right)
    473             A NumPy array with new partitions.
    474         """
--> 475         return super(PandasOnRayDataframePartitionManager, cls).binary_operation(
    476             axis, left, func, right
    477         )

/python3.9/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py in binary_operation(cls, axis, left, func, right)
   1251         func = cls.preprocess_func(func)
   1252         result = np.array(
-> 1253             [
   1254                 left_partitions[i].apply(
   1255                     func,

/python3.9/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py in <listcomp>(.0)
   1255                     func,
   1256                     num_splits=NPartitions.get(),
-> 1257                     other_axis_partition=right_partitions[i],
   1258                 )
   1259                 for i in range(len(left_partitions))

IndexError: list index out of range

works in pandas

@mvashishtha mvashishtha added the bug 🦗 Something isn't working label Jan 27, 2022
@mvashishtha
Copy link
Collaborator

@c3-cjazra thank you for reporting this! I can reproduce the bug.

@mvashishtha mvashishtha self-assigned this Jan 28, 2022
@mvashishtha
Copy link
Collaborator

I think that this bug applies to any binary operation in Modin. Before we execute a binary operation, the Modin frame calls map_axis_partitions to reindex the second operand. If the second operand is missing rows that are present in the first, i.e. a row with a given index value is in the first frame but not in the second, it seems that this map_axis_partitions call is supposed to fill in the missing values with NaN. For example, if we have

import modin.pandas as pd
pd.DataFrame([[1], [2]]) + pd.DataFrame([[3]])

then at the point where we apply the operation in binary_operation, [x.to_pandas() for x in left_partitions] has just the element:

   0
0  1
1  2

and [x.to_pandas() for x in right_partitions] has just the element:

     0
0  3.0
1  NaN

However, in a simple case of the bug,

import modin.pandas as pd
pd.DataFrame([[1], [2]]) + pd.DataFrame([[]])

, the right frame has no partitions, somap_axis_partitions just reindexes the empty partition list to another empty partition list, so in binary_operation , right_partitions is empty. We then get an index error when we try to apply the binary operation pairwise on the left and right partitions.

One possible fix is that in _copartition, when an empty frame is to be joined to a non-empty frame, we first add a single value NaN to the empty frame.

@modin-project/modin-contributors, what is the correct way to fix this?

@YarShev
Copy link
Collaborator

YarShev commented Feb 1, 2022

Given the docstring's line Perform aligning of partitions, index and partition blocks. I think we should fix that in _copartition.

prutskov added a commit to prutskov/modin that referenced this issue Jun 16, 2022
…ion for binary ops, fix bin

ops for empty dataframes

Signed-off-by: Alexey Prutskov <[email protected]>
YarShev pushed a commit that referenced this issue Jun 16, 2022
…n ops for empty dataframes (#4391)

Signed-off-by: Alexey Prutskov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working
Projects
None yet
3 participants