Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Panic when querying table with wrong partition columns order #9785

Closed
MohamedAbdeen21 opened this issue Mar 24, 2024 · 12 comments · Fixed by #9912
Closed

[BUG] Panic when querying table with wrong partition columns order #9785

MohamedAbdeen21 opened this issue Mar 24, 2024 · 12 comments · Fixed by #9912
Labels
bug Something isn't working

Comments

@MohamedAbdeen21
Copy link
Contributor

Describe the bug

Assume a table test that is partitioned on columns a then b, creating an external table and setting wrong order for the partition columns (b then a) doesn't raise an error, but querying it causes a panic.

To Reproduce

copy these queries into panic.sql

CREATE EXTERNAL TABLE test(name string, year string, month string) PARTITIONED BY (year, month) STORED AS csv LOCATION 'tmp/';
INSERT INTO test VALUES('name', '2024', '03');

-- notice the different ordering in the PARTITIONED BY clause
CREATE EXTERNAL TABLE test2(name string, year string, month string) PARTITIONED BY (month, year) STORED AS csv LOCATION 'tmp/';
SELECT * FROM test2;

And run

$ datafusion-cli -f panic.sql
DataFusion CLI v36.0.0
0 rows in set. Query took 0.002 seconds.

+-------+
| count |
+-------+
| 1     |
+-------+
1 row in set. Query took 0.004 seconds.

0 rows in set. Query took 0.001 seconds.

thread 'main' panicked at /home/user/dev/arrow-datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs:261:54:
index out of bounds: the len is 0 but the index is 0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Expected behavior

Should never panic. Instead, return an error on table creation or at least on the queries after.

Additional context

No response

@MohamedAbdeen21 MohamedAbdeen21 added the bug Something isn't working label Mar 24, 2024
@alamb
Copy link
Contributor

alamb commented Mar 24, 2024

Thanks for the report -- I agree this should never panic but rather erorr

@devinjdangelo
Copy link
Contributor

devinjdangelo commented Mar 24, 2024

I haven't dug into exactly what is going wrong here, but it does make me think a breaking change to a hiveQL inspired syntax where partition columns are only declared following the partition by keyword may avoid edge cases like this more easily.

#9599 is open right now to support a new syntax in a backwards compatible way (so would not fix this outright).

Thank your for finding this @MohamedAbdeen21 ! Are you able to trigger a similar panic using the new syntax as well?

@MohamedAbdeen21
Copy link
Contributor Author

MohamedAbdeen21 commented Mar 24, 2024

Yeah, this is irrelevant of the syntax used.

The new syntax is merely translated to the old syntax as to avoid adding more code to the rest of the system, so as long as the issue is outside the parser, syntax used should be irrelevant.

Just to double-check, here's the same query with the new syntax:

CREATE EXTERNAL TABLE test(name string) PARTITIONED BY (year string, month string) STORED AS csv LOCATION 'tmp/';
INSERT INTO test VALUES('name', '2024', '03');

-- notice the different ordering in the PARTITIONED BY clause
CREATE EXTERNAL TABLE test2(name string) PARTITIONED BY (month string, year string) STORED AS csv LOCATION 'tmp/';
SELECT * FROM test2;

I'd say this is safe to work on regardless of the status of 9599.

@MohamedAbdeen21
Copy link
Contributor Author

By the way, is the create statement lazy?

@alamb
Copy link
Contributor

alamb commented Mar 24, 2024

By the way, is the create statement lazy?

It is lazy in the sense that it doesn't read data until the table appears in a query. CREATE often does do basic operations like reading metadata to determine schema.

It definitely doesn't buffer all the data into memory, the way an "eager" dataframe API like pandas or polars(eager) would

@MohamedAbdeen21
Copy link
Contributor Author

I'm not sure what to do next regarding this. Do we just prevent the panic and leave it at that? do we perform schema validation if the table exists?

Also keep in mind that if we only prevent the panic, queries such as the following can make a table useless.

CREATE EXTERNAL TABLE test(name string) PARTITIONED BY (year string, month string) STORED AS parquet LOCATION 'tmp/';
-- notice the different ordering in the PARTITIONED BY clause
INSERT INTO test VALUES('name', '2024', '03');

CREATE EXTERNAL TABLE test2(name string) PARTITIONED BY (month string, year string) STORED AS parquet LOCATION 'tmp/';

-- now table have both partitions (year -> month and month -> year)
INSERT INTO test2 VALUES('name', '2024', '03');

-- both fail
SELECT * FROM test;
SELECT * FROM test2;

@devinjdangelo
Copy link
Contributor

I'm not sure what to do next regarding this. Do we just prevent the panic and leave it at that? do we perform schema validation if the table exists?

We should prevent the panic, but I think it is reasonable behavior to fail the query and return an error. The examples in this issue are creating two tables with different schemas in the same external storage location and writing to both. This effectively corrupts both tables, since now there are two different table definitions mixed in the same physical location. After the inserts, the directory looks like:

/year=2024/month=03
/month=2024/year=03

When the datafusion read path encounters this, the two options would be:

  1. Silently ignore hive style paths which do not conform to the expected schema (so if the partitions are year/month, only scan within the outer year folder, ignoring the outer month folder)
  2. Return an error that unexpected paths were encountered

Depending on the context, I could see either behavior being desired. We could perhaps provide a configuration option that allows users to control how Datafusion will handle unexpected partition paths.

@MohamedAbdeen21
Copy link
Contributor Author

I don't like the first option. If one person creates a table partitioned on year/month and another person reads it as month/year, the second person won't see any updates the first person have done, and vice versa.

I think inserts using different partitions should be rejected, but again, that requires changing both insert and select statements.

I'm trying to think about which will be easier for users to debug and for us to maintain, error-ing on creation or on queries after? Or maybe don't show any errors, but re-order the partition columns in the create statement to match the data on disk

@devinjdangelo
Copy link
Contributor

I don't like the first option. If one person creates a table partitioned on year/month and another person reads it as month/year, the second person won't see any updates the first person have done, and vice versa.

I see the two tables as distinct tables. They happen to share the same base file path, but reordering the partitions changes the schema and where the files are actually stored within the base path.

You can run into similar issues with non partitioned tables if you write different schemas with different column orderings to the same base path.

We could potentially return an error earlier by checking the schema of any file present during the create external table execution plan. That way we prevent a user from creating a table with an incorrect schema over top of an existing table with a different schema.

@MohamedAbdeen21
Copy link
Contributor Author

Sorry for the late response.

We could potentially return an error earlier by checking the schema of any file present during the create external table execution plan. That way we prevent a user from creating a table with an incorrect schema over top of an existing table with a different schema.

I think this is the solution that makes the most sense from the user's perspective, my only problem is that I feel like it's easier said than done.

I'll try to take a look around if I manage to find some free time. For now, I'll leave the assignment empty in case anyone else is interested in picking it up.

@samuelcolvin
Copy link
Contributor

Would be great to have a new release, I've just run into this too.

@alamb
Copy link
Contributor

alamb commented Apr 30, 2024

Would be great to have a new release, I've just run into this too.

Added to #10217 tracker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants