Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table scan throws IndexError: list index out of range #1024

Closed
vhnguyenae opened this issue Aug 8, 2024 · 2 comments · Fixed by #1026
Closed

Table scan throws IndexError: list index out of range #1024

vhnguyenae opened this issue Aug 8, 2024 · 2 comments · Fixed by #1026

Comments

@vhnguyenae
Copy link

Apache Iceberg version

0.7.0 (latest release)

Please describe the bug 🐞

from pyiceberg import catalog
from pyiceberg.expressions import EqualTo
from pandas import DataFrame

def read_data_from_table(project_hash: str, database: str, my_table: str) -> DataFrame:
    glue_catalog = catalog.load_glue(name='glue', conf={})
    table = glue_catalog.load_table(f"{database}.{my_table}")
    scan = table.scan(
        row_filter=EqualTo('project_hash', project_hash),
        selected_fields=("issue_id",)
    )
    print(scan)
    return scan.to_pandas()


df_iceberg = read_data_from_table("my_project", "my_db", "my_table")
print(df_iceberg)

Same piece of code, it worked fine on version 0.6.1, meanwhile with version 0.7.0 I got this stack trace error:

Traceback (most recent call last):
  File "/Users/vuhainguyen/Workspace/git/wux/tempo_script.py", line 34, in read_data_from_table
    return scan.to_pandas()
           ^^^^^^^^^^^^^^^^
  File "/Users/vuhainguyen/.pyenv/versions/3.11.5/lib/python3.11/site-packages/pyiceberg/table/__init__.py", line 2043, in to_pandas
    return self.to_arrow().to_pandas(**kwargs)
           ^^^^^^^^^^^^^^^
  File "/Users/vuhainguyen/.pyenv/versions/3.11.5/lib/python3.11/site-packages/pyiceberg/table/__init__.py", line 2013, in to_arrow
    return project_table(
           ^^^^^^^^^^^^^^
  File "/Users/vuhainguyen/.pyenv/versions/3.11.5/lib/python3.11/site-packages/pyiceberg/io/pyarrow.py", line 1335, in project_table
    if table_result := future.result():
                       ^^^^^^^^^^^^^^^
  File "/Users/vuhainguyen/.pyenv/versions/3.11.5/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/vuhainguyen/.pyenv/versions/3.11.5/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/Users/vuhainguyen/.pyenv/versions/3.11.5/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vuhainguyen/.pyenv/versions/3.11.5/lib/python3.11/site-packages/pyiceberg/io/pyarrow.py", line 1237, in _task_to_table
    batches = list(
              ^^^^^
  File "/Users/vuhainguyen/.pyenv/versions/3.11.5/lib/python3.11/site-packages/pyiceberg/io/pyarrow.py", line 1222, in _task_to_record_batches
    batch = arrow_table.to_batches()[0]
            ~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Aug 8, 2024

Thanks for reporting this issue!

Interesting... the error is in _task_to_record_batches

batches = fragment_scanner.to_batches()
for batch in batches:
if positional_deletes:
# Create the mask of indices that we're interested in
indices = _combine_positional_deletes(positional_deletes, current_index, current_index + len(batch))
batch = batch.take(indices)
# Apply the user filter
if pyarrow_filter is not None:
# we need to switch back and forth between RecordBatch and Table
# as Expression filter isn't yet supported in RecordBatch
# https://github.com/apache/arrow/issues/39220
arrow_table = pa.Table.from_batches([batch])
arrow_table = arrow_table.filter(pyarrow_filter)
batch = arrow_table.to_batches()[0]

Somehow arrow_table.to_batches() produce an empty list

FYI @sungwy

@sungwy
Copy link
Collaborator

sungwy commented Aug 8, 2024

Hi @vhnguyenae thank you for reporting this issue! I think the fix should be relatively simple. I will work on replicating the issue with a minimum set up to understand at what state of an Iceberg Table we would expect to see an empty record batch being read

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants