Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display of built in window functions do not work with struct elements #647

Closed
timsaucer opened this issue Apr 27, 2024 · 1 comment
Closed
Labels
bug Something isn't working

Comments

@timsaucer
Copy link
Contributor

Describe the bug
When attempting to call show() on a DataFrame that contains a built in window function on a column that has struct elements, it produces the error Compute error: concat requires input of at least one array. However other functions such as count() do not have issues. I have less experience with DataFusion, so I just expected count() to do a full evaluation like it does in pyspark, so it's possible that my assumption is incorrect in that having any bearing on this error.

To Reproduce
This minimal example can reproduce the window function working properly on a simple element type and failing with a very simple struct.

import pyarrow as pa
from datafusion import SessionContext
import datafusion.functions as F

# taken from datafusion/tests/test_dataframe.py
def struct_df():
    ctx = SessionContext()

    # create a RecordBatch and a new DataFrame from it
    batch = pa.RecordBatch.from_arrays(
        [pa.array([{"c": 1}, {"c": 2}, {"c": 3}]), pa.array([4, 5, 6])],
        names=["a", "b"],
    )

    return ctx.create_dataframe([[batch]])

df = struct_df()

df.show()

df.select(F.col("a"), F.col("b"), F.window("lag", [F.col("b")]).alias("lag_b")).show()

print("Calling count on lag a: ", df.select(F.col("a"), F.col("b"), F.window("lag", [F.col("a")]).alias("lag_a")).count())

df.select(F.col("a"), F.col("b"), F.window("lag", [F.col("a")]).alias("lag_a")).show()

Produces the following output:

DataFrame()
+--------+---+
| a      | b |
+--------+---+
| {c: 1} | 4 |
| {c: 2} | 5 |
| {c: 3} | 6 |
+--------+---+
DataFrame()
+--------+---+-------+
| a      | b | lag_b |
+--------+---+-------+
| {c: 1} | 4 |       |
| {c: 2} | 5 | 4     |
| {c: 3} | 6 | 5     |
+--------+---+-------+
Calling count on lag a:  3
Traceback (most recent call last):
  File "/Users/tsaucer/src/arrow-datafusion-python/example_lag_struct.py", line 25, in <module>
    df.select(F.col("a"), F.col("b"), F.window("lag", [F.col("a")]).alias("lag_a")).show()
Exception: Arrow error: Compute error: concat requires input of at least one array

In searching the web there was a similar error thrown that this old MR resolved in sort operations: https://github.com/apache/arrow/pull/9275/files#diff-3ee8e6ac2472badc7bb448c360f56ed60f06a787d1f45ea589d9e213eaf2ae82

Expected behavior
Calling show() on a window function with a struct column type should operate similar to simple column types.

Additional context
I'm willing to work on this myself, but I'm not familiar with the internals of the plan execution. I've looked around myself to see if I can find anything obvious, but nothing is jumping out at me. If you can provide any directions or pointers, I would appreciate it.

@timsaucer
Copy link
Contributor Author

Closing in favor of apache/datafusion#10328

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant