Iceberg support partitioning on a nested field#9337
Iceberg support partitioning on a nested field#9337Buktoria wants to merge 1 commit intotrinodb:masterfrom
Conversation
0fad4cf to
8113685
Compare
There was a problem hiding this comment.
use fixed time in tests to ensure the test is reproducible
There was a problem hiding this comment.
please inline createTable and insertSql -- they are used once
(same below)
There was a problem hiding this comment.
assert on the result values, not just count
assert(query("...."))
.matches("SELECT val1, val2")
There was a problem hiding this comment.
i would assume that in this test the partition field is all that matters; do we need to have all these data columns?
There was a problem hiding this comment.
how is second table different from the first?
ie what is the important difference?
There was a problem hiding this comment.
thanks for the test. please also add one in TestIcebergSparkCompatibility to make sure our nested-field-partitioned writes are correctly readable by spark
we should also test the reverse.
note that TestIcebergSparkCompatibility is a product test, so to run it you need
./mvnw clean install -DskipTests # project needs to be built first
bin/ptl test run --environment singlenode-spark-iceberg -- -t TestIcebergSparkCompatibility
There was a problem hiding this comment.
what is this bin/ptl command? I this something that should be created by that ./mvnw clean install -DskipTests command? There is no bin directory created after running the instal of the project on my side locally.
There was a problem hiding this comment.
It was moved into the testing directory ./testing/bin/ptl
There was a problem hiding this comment.
It'd be good to have a query with a WHERE clause in this test as well, to verify that the predicate pushdown on the nested column works properly. Pushdown on partition columns is guaranteed by the connector so the engine won't repeat the filter.
There was a problem hiding this comment.
This should be more strict. For example, shouldn't match a...b.
IDENTIFIER = "[a-z_][a-z0-9_]*";
NAME = IDENTIFIER + "(\\." + NAME + ")*";also, since we thus allow partitioning on a function from a field (like year(rec.timestamp)), we should add some test coverage for such usage too
There was a problem hiding this comment.
in #9354 @djsstarburst argues that using getChildren is incorrect.
@djsstarburst would you want to comment here?
|
@Buktoria than you for your PR! |
bddab80 to
64d4144
Compare
|
👋 @Buktoria - this PR has become inactive. If you're still interested in working on it, please let us know, and we can try to get reviewers to help with that. We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks. |
|
Closing this, its too far behind to be able to merge. |
Problem
When querying an iceberg table which is partitioned, you can get the following error.
java.lang.IllegalArgumentException: columns is empty.Deep Dive into the problem
This
columnsvalue is returning an empty listIf we take a look at the
getColumnswe can see that this is a simple iteration over the return value of theschema.columns()from the iceberg api. The problem is that some of these fields can be nested. This method is not unpacking those columns.Now go back to to see how that original
columnsvariable is being generated we see a filter being applied, to the top level column that we got fromgetColumns.It means that all columns get filtered out because none of those columns are the partition field, hence getting the
columns is emptywhen buildingand then error thrown here
Solution
This pr does three things main things.
Closes: #5458