Skip to content

Conversation

@alexeykudinkin
Copy link
Contributor

Tips

What is the purpose of the pull request

Fix handling of the isNotNull predicate in Data Skipping

Brief change log

  • Fixing handling of the isNotNull predicate in Data Skipping
  • Tidying up

Verify this pull request

This change added tests and can be verified as follows:

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@alexeykudinkin alexeykudinkin changed the title [HUDI-3739] Fix handling of the isNotNull predicate in Data Skipping [HUDI-3739][Stacked on 5208] Fix handling of the isNotNull predicate in Data Skipping Apr 5, 2022
@codope codope self-assigned this Apr 5, 2022
private val COLUMN_STATS_INDEX_MIN_VALUE_STAT_NAME = "minValue"
private val COLUMN_STATS_INDEX_MAX_VALUE_STAT_NAME = "maxValue"
private val COLUMN_STATS_INDEX_NUM_NULLS_STAT_NAME = "num_nulls"
private val COLUMN_STATS_INDEX_NULL_COUNT_STAT_NAME = "nullCount"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not reuse HoodieMetadataPayload.* constants here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. This was copied over from a legacy component w/o much afterthought. Addressed

case IsNotNull(attribute: AttributeReference) =>
getTargetIndexedColumnName(attribute, indexSchema)
.map(colName => EqualTo(genColNumNullsExpr(colName), Literal(0)))
.map(colName => LessThan(genColNumNullsExpr(colName), genColValueCountExpr))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter colA is not null is the complement to colA is null then why the two have different translation (one has to depend on the valueCount while the other depends on Literal(0))?

I mean if colA is null is translated to GreaterThan(genColNumNullsExpr(colName), Literal(0)), then shouldn't colA is not null be translated to LessThanOrEqual(genColNumNullsExpr(colName), Literal(0))?

Or if you say that colA is not null should be translated to LessThan(genColNumNullsExpr(colName), genColValueCountExpr), then shouldn't colA is null be translated to GreaterThanOrEqual(genColNumNullsExpr(colName), genColValueCountExpr)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! Great question.

The reason is logical fallacy: "colA is not null" != "there is no null in colA", instead it's "colA contains non-null" (you can check out some other expressions, there are many expressions in this list that carry such properties):

  • "is null" means that the column has to contain null values (ie nullCount > 0)
  • "is not null" means that the column has to contain non-null values (ie nullCount < valueCount)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. makes sense then.

@codope codope added priority:blocker Production down; release blocker priority:critical Production degraded; pipelines stalled and removed priority:critical Production degraded; pipelines stalled labels Apr 5, 2022
@alexeykudinkin alexeykudinkin force-pushed the ak/dskp-fix branch 2 times, most recently from f054124 to 6cabaa7 Compare April 6, 2022 01:28
Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can land once the CI is green.

@alexeykudinkin alexeykudinkin changed the title [HUDI-3739][Stacked on 5208] Fix handling of the isNotNull predicate in Data Skipping [HUDI-3739] Fix handling of the isNotNull predicate in Data Skipping Apr 6, 2022
@hudi-bot
Copy link
Collaborator

hudi-bot commented Apr 6, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan merged commit d43b4cd into apache:master Apr 6, 2022
xushiyan pushed a commit that referenced this pull request Apr 14, 2022
#5224)

- Fix handling of the isNotNull predicate in Data Skipping
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants