Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic when querying a hive-partitioned parquet dataset created with wrong column name #10020

Open
jwimberl opened this issue Apr 9, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@jwimberl
Copy link

jwimberl commented Apr 9, 2024

Describe the bug

I added an external table and mistakenly gave the wrong name for one of its partition columns. The DDL operation returned successfully, and some basic queries on the table were successful, but others resulted in panics.

To Reproduce

Using the example nyctaxi data set, but typing monht instead of month for one of the partition columns does not yield an error when loading the table, nor when counting its records:

$ RUST_BACKTRACE=1 python3
Python 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import datafusion as df
>>> ctx = df.SessionContext()
>>> ctx.sql("""
... CREATE EXTERNAL TABLE taxi
... STORED AS PARQUET
... PARTITIONED BY (year, monht)
... LOCATION '/path/to/nyctaxi'
... """)
DataFrame()
++
++
>>> ctx.sql("SELECT COUNT(*) FROM taxi")
DataFrame()
+----------+
| COUNT(*) |
+----------+
| 2964624  |
+----------+

Instead, the first error is a panic while reading data from the table:

>>> ctx.sql("SELECT * FROM taxi")
thread 'tokio-runtime-worker' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/datafusion-36.0.0/src/datasource/physical_plan/file_scan_config.rs:248:54:
index out of bounds: the len is 0 but the index is 0
stack backtrace:
thread 'tokio-runtime-worker' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/datafusion-36.0.0/src/datasource/physical_plan/file_scan_config.rs:248:54:
index out of bounds: the len is 0 but the index is 0
thread 'tokio-runtime-worker' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/datafusion-36.0.0/src/datasource/physical_plan/file_scan_config.rs:248:54:
index out of bounds: the len is 0 but the index is 0
   0: rust_begin_unwind
             at /rustc/5119208fd78a77547c705d1695428c88d6791263/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/5119208fd78a77547c705d1695428c88d6791263/library/core/src/panicking.rs:72:14
   2: core::panicking::panic_bounds_check
             at /rustc/5119208fd78a77547c705d1695428c88d6791263/library/core/src/panicking.rs:208:5
   3: datafusion::datasource::physical_plan::file_scan_config::PartitionColumnProjector::project
   4: <datafusion::datasource::physical_plan::file_stream::FileStream<F> as futures_core::stream::Stream>::poll_next
   5: datafusion_physical_plan::stream::RecordBatchReceiverStreamBuilder::run_input::{{closure}}
   6: tokio::runtime::task::raw::poll
   7: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   8: tokio::runtime::task::raw::poll
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
stack backtrace:
   0: rust_begin_unwind
             at /rustc/5119208fd78a77547c705d1695428c88d6791263/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/5119208fd78a77547c705d1695428c88d6791263/library/core/src/panicking.rs:72:14
   2: core::panicking::panic_bounds_check
             at /rustc/5119208fd78a77547c705d1695428c88d6791263/library/core/src/panicking.rs:208:5
   3: datafusion::datasource::physical_plan::file_scan_config::PartitionColumnProjector::project
   4: <datafusion::datasource::physical_plan::file_stream::FileStream<F> as futures_core::stream::Stream>::poll_next
   5: datafusion_physical_plan::stream::RecordBatchReceiverStreamBuilder::run_input::{{closure}}
   6: tokio::runtime::task::raw::poll
   7: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   8: tokio::runtime::task::raw::poll
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
stack backtrace:
  File "<stdin>", line 1, in <module>
pyo3_runtime.PanicException: index out of bounds: the len is 0 but the index is 0
   0: rust_begin_unwind
             at /rustc/5119208fd78a77547c705d1695428c88d6791263/library/std/src/panicking.rs:645:5
   >>> 1: core::panicking::panic_fmt
             at /rustc/5119208fd78a77547c705d1695428c88d6791263/library/core/src/panicking.rs:72:14
   2: core::panicking::panic_bounds_check
             at /rustc/5119208fd78a77547c705d1695428c88d6791263/library/core/src/panicking.rs:208:5
   3: datafusion::datasource::physical_plan::file_scan_config::PartitionColumnProjector::project
   4: <datafusion::datasource::physical_plan::file_stream::FileStream<F> as futures_core::stream::Stream>::poll_next
   5: datafusion_physical_plan::stream::RecordBatchReceiverStreamBuilder::run_input::{{closure}}
   6: tokio::runtime::task::raw::poll
   7: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   8: tokio::runtime::task::raw::poll

Expected behavior

Ideally this misconfiguration would result in an error message when creating the table (as I understand, some initial scan of the filesystem or object store is performed as part of this DDL operation, and so there is an opportunity to validate the supplied partition columns), or barring that I would at least expect an error message during queries, and not a panic.

Additional context

Using datafusion 36.0.0 module for Python 3.11.

@jwimberl jwimberl added the bug Something isn't working label Apr 9, 2024
@devinjdangelo
Copy link
Contributor

Thank you for the report! @MohamedAbdeen21 noticed a similar issue and in #9912 added validation which should raise an error in this scenario during the CREATE EXTERNAL TABLE statement execution. This feature should be included in the 38.0.0 release and is available now on the main branch.

@jwimberl
Copy link
Author

Ah, great, I will be sure to test that release when it is out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants