Skip to content

Conversation

@shardulm94
Copy link
Contributor

ORC uses SQL semantics for Search Arguments, so an expression like col != 1 will exclude rows where col is NULL along with rows where col = 1. In contrast, Iceberg's Expressions will keep rows with NULL values, so the equivalent ORC Search Argument for an Iceberg Expression col != x is col IS NULL OR col != x.

This PR fixes the issue of the ORC pushdown returning less rows than what Iceberg expects.

// ORC-623: ORC seems to incorrectly skip a row group for a notIn(column, {X, ...}) predicate on a column which
// has only 1 non-null value X but also has nulls
mentions that this might be a bug in ORC, but in fact its just a case of mismatched semantics. During conversion of Iceberg expression to ORC Search Arguments, we now take care of this semantic difference.

The wider discussion of SQL compatibility for Iceberg expressions is discussed on the dev list.

@rdblue
Copy link
Contributor

rdblue commented Sep 30, 2020

+1

@rdblue rdblue merged commit 680798a into apache:master Sep 30, 2020
@rdblue
Copy link
Contributor

rdblue commented Sep 30, 2020

Thanks for fixing this, @shardulm94!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants