Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -247,26 +247,29 @@ private TableStatistics makeTableStatistics(StatisticsFileCache statisticsFileCa
for (IcebergColumnHandle columnHandle : selectedColumns) {
int fieldId = columnHandle.getId();
ColumnStatistics.Builder columnBuilder = tableStats.getOrDefault(fieldId, ColumnStatistics.builder());
Long nullCount = summary.getNullCounts().get(fieldId);
if (nullCount != null) {
columnBuilder.setNullsFraction(Estimate.of(nullCount / recordCount));
}

Object min = summary.getMinValues().get(fieldId);
Object max = summary.getMaxValues().get(fieldId);
if (min instanceof Number && max instanceof Number) {
DoubleRange range = new DoubleRange(((Number) min).doubleValue(), ((Number) max).doubleValue());
columnBuilder.setRange(Optional.of(range));

// the histogram is generated by scanning the entire dataset. It is possible that
// the constraint prevents scanning portions of the table. Given that we know the
// range that the scan provides for a particular column, bound the histogram to the
// scanned range.

final DoubleRange histRange = range;
columnBuilder.setHistogram(columnBuilder.getHistogram()
.map(histogram -> DisjointRangeDomainHistogram
.addConjunction(histogram, Range.range(DOUBLE, histRange.getMin(), true, histRange.getMax(), true))));
if (summary.hasValidColumnMetrics()) {
Long nullCount = summary.getNullCounts().get(fieldId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Potential loss of precision in nulls fraction calculation.

Cast nullCount and recordCount to double before dividing to avoid integer division and ensure correct fraction calculation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little curious, in what scenario did you encounter this problem, and is there a way to reproduce it? Even when I set "write.metadata.metrics.max-inferred-column-defaults" to 0, the column statistics still return an empty map rather than null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hantangwangd Thanks for the comment.
I'm trying to submit a PR to support writing to iceberg table from Prestissimo. The basic insertion PR in velox has been merged. And now I want to change the code in Prestissimo to make the insertion works E2E.
And when I add some unit test to cover this in presto-native-execution, I can reproduce this error.
The query is a simple join.

Copy link
Contributor Author

@PingLiuPing PingLiuPing Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic insertion in velox does not return any metrics to coordinator node. And it was planned to be added in following PR.
I think this is the root cause of the null pointer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hantangwangd I think it is possible to have a null instead of empty nullcount map.
In Partition.java and check its constructor and updateNullCount method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PingLiuPing Thanks for the details. I was considering whether we could add a test case for this scenario, and found that this example works:

    public void testTableWithNullColumnStats()
    {
        String tableName1 = "test_null_stats1_" + randomTableSuffix();
        String tableName2 = "test_null_stats2_" + randomTableSuffix();
        try {
            assertUpdate("CREATE TABLE " + tableName1 + "(id int, name varchar) WITH (\"write.format.default\" = 'PARQUET')");
            assertUpdate("INSERT INTO " + tableName1 + " VALUES(1, '1001'), (2, '1002'), (3, '1003')", 3);
            Table icebergTable1 = loadTable(tableName1);
            String dataFilePath = (String) computeActual("SELECT file_path FROM \"" + tableName1 + "$files\" LIMIT 1").getOnlyValue();

            assertUpdate("CREATE TABLE " + tableName2 + "(id int, name varchar) WITH (\"write.format.default\" = 'PARQUET')");
            Table icebergTable2 = loadTable(tableName2);
            Metrics newMetrics = new Metrics(3L, null, null, null, null);
            DataFile dataFile = DataFiles.builder(icebergTable1.spec())
                    .withPath(dataFilePath)
                    .withFormat("PARQUET")
                    .withFileSizeInBytes(1234L)
                    .withMetrics(newMetrics)
                    .build();
            icebergTable2.newAppend().appendFile(dataFile).commit();

            assertQuery("select t1.id, t2.name from " + tableName1 + " t1 inner join " + tableName2 + " t2 on t1.id = t2.id order by t1.id",
                    "values(1, '1001'), (2, '1002'), (3, '1003')");
        }
        finally {
            assertUpdate("DROP TABLE " + tableName2);
            assertUpdate("DROP TABLE " + tableName1);
        }
    }

Could you add a similar test case for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will add a case for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hantangwangd I cannot find this case in prestodb repo, but I think the case is suitable for this scenario, thank you very much. And I add this case into IcebergDistributedTestBase.java.

Copy link
Contributor Author

@PingLiuPing PingLiuPing Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the fix, assertQuery("select t1.id, t2.name from " + tableName1 + " t1 inner join " + tableName2 + " t2 on t1.id = t2.id order by t1.id", "values(1, '1001'), (2, '1002'), (3, '1003')");
reports same error I have encountered.

if (nullCount != null) {
columnBuilder.setNullsFraction(Estimate.of(nullCount / recordCount));
}

Object min = summary.getMinValues().get(fieldId);
Object max = summary.getMaxValues().get(fieldId);
if (min instanceof Number && max instanceof Number) {
DoubleRange range = new DoubleRange(((Number) min).doubleValue(), ((Number) max).doubleValue());
columnBuilder.setRange(Optional.of(range));

// the histogram is generated by scanning the entire dataset. It is possible that
// the constraint prevents scanning portions of the table. Given that we know the
// range that the scan provides for a particular column, bound the histogram to the
// scanned range.

final DoubleRange histRange = range;
columnBuilder.setHistogram(columnBuilder.getHistogram()
.map(histogram -> DisjointRangeDomainHistogram
.addConjunction(histogram, Range.range(DOUBLE, histRange.getMin(), true, histRange.getMax(), true))));
}
}
result.setColumnStatistics(columnHandle, columnBuilder.build());
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,10 @@
import org.apache.hadoop.fs.Path;
import org.apache.iceberg.BaseTable;
import org.apache.iceberg.CatalogUtil;
import org.apache.iceberg.DataFile;
import org.apache.iceberg.DataFiles;
import org.apache.iceberg.FileScanTask;
import org.apache.iceberg.Metrics;
import org.apache.iceberg.PartitionSpec;
import org.apache.iceberg.Schema;
import org.apache.iceberg.Snapshot;
Expand Down Expand Up @@ -3237,4 +3240,46 @@ public void testEqualityDeleteAsJoinWithMaximumFieldsLimitOverLimit()
dropTable(session, tableName);
}
}

@Test
public void testTableWithNullColumnStats()
{
String tableName1 = "test_null_stats1";
String tableName2 = "test_null_stats2";
try {
assertUpdate(String.format("CREATE TABLE %s (id int, name varchar) WITH (\"write.format.default\" = 'PARQUET')", tableName1));
assertUpdate(String.format("INSERT INTO %s VALUES(1, '1001'), (2, '1002'), (3, '1003')", tableName1), 3);
Table icebergTable1 = loadTable(tableName1);
String dataFilePath = (String) computeActual(String.format("SELECT file_path FROM \"%s$files\" LIMIT 1", tableName1)).getOnlyValue();

assertUpdate(String.format("CREATE TABLE %s (id int, name varchar) WITH (\"write.format.default\" = 'PARQUET')", tableName2));
Table icebergTable2 = loadTable(tableName2);
Metrics newMetrics = new Metrics(3L, null, null, null, null);
DataFile dataFile = DataFiles.builder(icebergTable1.spec())
.withPath(dataFilePath)
.withFormat("PARQUET")
.withFileSizeInBytes(1234L)
.withMetrics(newMetrics)
.build();
icebergTable2.newAppend().appendFile(dataFile).commit();

TableStatistics stats = getTableStats(tableName2);
assertEquals(stats.getRowCount(), Estimate.of(3.0));

// Assert that column statistics are present (even if they don't have detailed metrics)
assertFalse(stats.getColumnStatistics().isEmpty());

for (Map.Entry<ColumnHandle, ColumnStatistics> entry : stats.getColumnStatistics().entrySet()) {
ColumnStatistics columnStats = entry.getValue();
assertNotNull(columnStats);
}

assertQuery(String.format("SELECT t1.id, t2.name FROM %s t1 INNER JOIN %s t2 ON t1.id = t2.id ORDER BY t1.id", tableName1, tableName2),
"VALUES(1, '1001'), (2, '1002'), (3, '1003')");
}
finally {
assertUpdate(String.format("DROP TABLE IF EXISTS %s", tableName2));
assertUpdate(String.format("DROP TABLE IF EXISTS %s", tableName1));
}
}
}
Loading