Iceberg support partitioning on a nested field by Buktoria · Pull Request #9337 · trinodb/trino

Buktoria · 2021-09-22T17:32:46Z

Problem

When querying an iceberg table which is partitioned, you can get the following error.
java.lang.IllegalArgumentException: columns is empty.

java.lang.IllegalArgumentException: columns is empty
	at io.trino.spi.connector.DiscretePredicates.<init>(DiscretePredicates.java:31)
	at io.trino.plugin.iceberg.IcebergMetadata.getTableProperties(IcebergMetadata.java:368)
	at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorMetadata.getTableProperties(ClassLoaderSafeConnectorMetadata.java:764)
	at io.trino.metadata.MetadataManager.getTableProperties(MetadataManager.java:489)
	at io.trino.sql.planner.iterative.rule.DetermineTableScanNodePartitioning.apply(DetermineTableScanNodePartitioning.java:59)
	at io.trino.sql.planner.iterative.rule.DetermineTableScanNodePartitioning.apply(DetermineTableScanNodePartitioning.java:33)

...

Deep Dive into the problem

This columns value is returning an empty list

  // Extract identity partition columns
  Map<Integer, IcebergColumnHandle> columns = getColumns(icebergTable.schema(), typeManager).stream()
          .filter(column -> partitionSourceIds.contains(column.getId()))
          .collect(toImmutableMap(IcebergColumnHandle::getId, Function.identity()));

If we take a look at the getColumns we can see that this is a simple iteration over the return value of the schema.columns() from the iceberg api. The problem is that some of these fields can be nested. This method is not unpacking those columns.

public static List<IcebergColumnHandle> getColumns(Schema schema, TypeManager typeManager)
{
    return schema.columns().stream()
            .map(column -> IcebergColumnHandle.create(column, typeManager))
            .collect(toImmutableList());
}

Now go back to to see how that original columns variable is being generated we see a filter being applied, to the top level column that we got from getColumns.

  // Extract identity partition columns
  Map<Integer, IcebergColumnHandle> columns = getColumns(icebergTable.schema(), typeManager).stream()
          .filter(column -> partitionSourceIds.contains(column.getId()))
          .collect(toImmutableMap(IcebergColumnHandle::getId, Function.identity()));

It means that all columns get filtered out because none of those columns are the partition field, hence getting the columns is empty when building

discretePredicates = new DiscretePredicates(
        columns.values().stream()
                .map(ColumnHandle.class::cast)
                .collect(toImmutableList()),
        discreteTupleDomain);

and then error thrown here

  public DiscretePredicates(List<ColumnHandle> columns, Iterable<TupleDomain<ColumnHandle>> predicates)
  {
      requireNonNull(columns, "columns is null");
      if (columns.isEmpty()) {
          throw new IllegalArgumentException("columns is empty");

Solution

This pr does three things main things.

When getting table column handles from IcebergMetadata we get all column, including the ones that are nested.
When looking for Iceberg column partitions we dig out the partition column and store the index positions to get that column from the schema (sourceIds).
The Iceberg page partitioner also digs out the partition column block from a page with the stored sourceIds.

Closes: #5458

findepi · 2021-09-23T20:09:49Z