Skip to content

Conversation

@kbendick
Copy link
Contributor

@kbendick kbendick commented Jan 13, 2021

This PR closes #2065

This allows for the BinaryUtil.truncateBinary function to allow for a zero length truncation. This is necessary for handling
scans where STARTS_WITH filters are applied to the table (or any filter that uses truncateBinary) when a (possibly non-partition) column has an empty string value in it.

It would be more efficient to return early in each of the given metrics evaluators if the length is zero, but I did not want to introduce too many changes that conflict with this PR: #2062. If we'd like, we can go in and update all of those functions individually, but this is the minimum possible change set (to the non-test code) to fix the issue now and is likely cleaner. Additionally, since we're returning the same empty final ByteBuffer right away, there's not really that much overhead, and certainly no more than currently exists.

I chose to update the TestFilteredScan spark test suites to have one record be an empty string. This will cover unpartitioned tables, tables with filters that are partitioned on a column that has the empty string, as well as tables that are partitioned on columns that do not have the empty string value (which was the case for the original ticket).

I also originally added a test that uses the Iceberg API directly in BaseTableTest, but it was not a partitioned table as writeRecords in that test suite assumes an unpartitioned table spec. I decided to remove it as those code paths are all handled in TestFilteredScan without adding any additional overhead to the CI suite. I did however add a test that better documents the (new) behavior of truncateBinary.

cc @changquanyou @aokolnychyi @RussellSpitzer

private BinaryUtil() {
}

private static final ByteBuffer EMPTY_BYTE_BUFFER = ByteBuffer.allocate(0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make this lazily instantiated if we care. Or just return a new empty byte buffer instance every time in the function, but seems unnecessary to have all of those allocations.

record(schema, 0L, parse("2017-12-22T09:20:44.294658+00:00"), "junction"),
record(schema, 1L, parse("2017-12-22T07:15:34.582910+00:00"), "alligator"),
record(schema, 2L, parse("2017-12-22T06:02:09.243857+00:00"), "forrest"),
record(schema, 2L, parse("2017-12-22T06:02:09.243857+00:00"), ""),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make a separate test if we feel that would be better. However, this test suite covers a very large number of cases so it seems like a smart place to add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that it's a good opportunity to start introducing more edge case data into the tests without adding any overhead, and it tests this code path in a myriad of ways and does not affect the original test (though I'll have to align my open PR that currently touches this file, but that's not a problem at all).

Copy link
Contributor Author

@kbendick kbendick Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, for reference, I did make several separate tests and they all passed. But it just seems cleaner to me to introduce more common edge cases into the existing test cases, especially when it doesn't require us to change the existing tests. I'm surprised that the tests that run on random data hasn't caught this before. But most of the random data tests use a predefined seed, so I'm not that surprised.

That said, the tests in TestFilteredScan all go over the code paths that deal with this. And after we merge this, I can add in two small / inexpensive tests in the NOT_STARTS_WITH pr that will more explicitly cover this edge case - by partitioning a DF using a filter of data like '%' and data not like '%', which would throw an exception due to the truncation length without this change (though hopefully there aren't too many queries like the first one, but it's not my place to tell people how to write queries).

If there is any desire for a more explicit test or anything, please let me know 🙂

Copy link
Contributor Author

@kbendick kbendick Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I actually found an inexpensive existing place to test truncateBinary itself. So the function will be better documented by means of the tests (if we agree that this change on the preconditions is acceptable).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine adding the test here, but I'd rather not change the existing cases. Would adding a new record test this without affecting the existing ones?

Copy link
Contributor Author

@kbendick kbendick Jan 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes unfortunately, adding a new record to be returned from this function would mean we'd have to change many of the existing tests, as many of them have a hardcoded assumption on 10 records.

I can add a new test in another (possibly new) file if we'd prefer. But in order to place the test in here to be run against what seems to be one of the more comprehensive test suites for hitting various evaluators in various scenarios (i.e. it tests against tables partitioned by the empty string record, tables not partitioned by the empty string record, unpartitioned tables), we'd have to either change an existing record or update the tests to be aware of 11 records.

In particular, we need to test on a scan of a table with a filter, which will cause the various evaluators which can return ROWS_MIGHT_MATCH.

I'll take another look to see if there's an existing test suite where a new test can be easily added to cover this scenario without affecting other tests.

Copy link
Contributor Author

@kbendick kbendick Jan 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I add the following test to this file, it does cover some of the codepaths (albeit with a somewhat funny SQL query). This will still go over the truncateBinary function when the input needs to be truncated to a length of zero, although this time it will be due to the predicate literal having a length of zero instead of the input data having a row with a length of zero. Either way, we truncate to the minimum length of either the input field or the predicate literal when using STARTS_WITH.

The following test would throw prior to this patch (and it doesn't require us to touch any of the input data for the other test suites).

  @Test
  public void testPartitionedByDataStartsWithEmptyStringFilter() {
    File location = buildPartitionedTable("partitioned_by_data", PARTITION_BY_DATA, "data_ident", "data");

    DataSourceOptions options = new DataSourceOptions(ImmutableMap.of(
        "path", location.toString())
    );

    IcebergSource source = new IcebergSource();
    DataSourceReader reader = source.createReader(options);
    pushFilters(reader, new StringStartsWith("data", ""));

    Assert.assertEquals(10, reader.planInputPartitions().size());
  }

Alternatively, I could add a new test file entirely and have full control over what the test does.

Copy link
Contributor Author

@kbendick kbendick Jan 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test that does not change any of the exising test data or add a new file at all (only extends existings), in TestInclusiveMetricsEvaluator. https://github.com/apache/iceberg/pull/2081/files#diff-a107d871a1a5d5c4029451db68cf5497ddc9432d90619e263270cf72c3536c23R73

I personally prefer changing the input data on the larger end to end test that covers many portions (TestFilteredScan) as they truly cover the situation where the bug was originally reported occurring (and where it is likely to happen based on usages of truncateBinary) - filtered scans with empty row data.

But I've added another possible way to test the reported scenario -TestInclusiveMetricsEvaluator.

If you'd like me to add a standalone spark test or something, let me know. Or if you prefer the new test, let me know and I can revert the change to the input records of the other test suite.

Thanks for your review @rdblue 🙂

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me, given the tests all passed. But I wonder why empty string was not handled and added for testing from the very beginning, I may lack some context here. I see @aokolnychyi added the check for length > 0, was there any particular reason for that? Do you expect length == 0 to be handled before the pushdown?

…With and other expression evaluators to evaluate rows that contain the empty string
@kbendick kbendick force-pushed the handle-zero-width-truncation-for-startswith branch from 50c36ab to 174c3a5 Compare January 13, 2021 09:05
@kbendick
Copy link
Contributor Author

This looks fine to me, given the tests all passed. But I wonder why empty string was not handled and added for testing from the very beginning, I may lack some context here. I see @aokolnychyi added the check for length > 0, was there any particular reason for that? Do you expect length == 0 to be handled before the pushdown?

Yes I would also be very interested to hear if there's any reason to not allow for the truncation function to alter the precondition. I'm open to there being something I didn't catch.

For reference, I checked and the only place that truncateBinary is used is in metrics evaluation - namely for startsWith (and soon to be notStartsWith).

To be exact, in InclusiveMetricsEvaluator, ManifestEvaluator, and ParquetMetricsRowGroupFilter (all of them in the visitor and exclusively with startsWith - which is where the bug occurs presumably due to a row with an empty string). It's also used in updating the lower bounds and upper bounds for min/max parquet metrics for FIXED and BINARY types (in ParquetUtils). So it's only used with metrics and literals and it's overall relatively well tested. I can add more tests as a follow up on the binary and fixed types if we feel it necessary, but overall those are all decently well covered in the tests - especially after I add my tests from the NOT_STARTS_WITH PR, which covers it further: #2062.

@github-actions github-actions bot added the core label Jan 13, 2021
@kbendick
Copy link
Contributor Author

kbendick commented Jan 13, 2021

I see [Anton] added the check for length > 0, was there any particular reason for that? Do you expect length == 0 to be handled before the pushdown?

As for whether or not length == 0 should be handled before the pushdown, it should be if I understand correctly. By the time that this function is invoked in the places I listed elsewhere, all expressions should have been rewritten to remove NOT nodes, and then been bound (both the term to the correct id or predicate id as well as the literal used for comparison in the literal expressions).

Given that this is always called either in metrics evaluation or row value evaluation against a predicate - typically START_WITH or soon to be NOT_STARTS_WITH - the expressions being evaluated are all BoundLiteralPredicates and the input of an empty predicate in that situation (e.g. data not like '%') is not necessarily an invalid expression (nor are empty rows such as title = '' showing up in a row that must be evaluated once the title = 'apple%' predicate needs to be evaluated at the row level, which is where the precondition causes problems and which is what's most likely causing the user's error).

Though for that particular query, it does get rewritten to be data != '' or both title is not null and title like 'apple%' as well as title != 'apple' in strict metrics evaluation at one point via parsing, binding, and rewriting.

I'm not as closely familiar with the min and max functions, but my understanding after spending a decent period of time working with it when implementing NOT_STARTS_WITH, is that they too should be under these same conditions and only applied to parquet metadata after expressions have also become rewritten into some kind of BoundPredicate that can be applied on parquet Binary type lower and upper bounds metrics.

Please correct me if I am mistaken 🙂 .

@kbendick
Copy link
Contributor Author

kbendick commented Jan 15, 2021

My guess is that the truncation width constraint was added to potentially catch the case where somebody has set a column's metrics to truncation[0]. However, this isn't the best place for that constraint and as evidenced by the original issue, it's having unintended effects when there are empty string elements in the table that wind up being evaluated (e.g. they have stats that get evaluated, etc).

There already is a precondition check for disallowing a truncation length == 0 to be defined for the metrics config, as well as a test for it.
Code that disallows a zero length for configuring truncation for a columns metrics - https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/MetricsModes.java#L105-L108
Test that shows that trying to set a truncation length of zero for a columns metrics will throw - https://github.com/apache/iceberg/blob/master/core/src/test/java/org/apache/iceberg/TestMetricsModes.java#L50-L55

@rdblue rdblue merged commit a59a169 into apache:master Jan 16, 2021
@rdblue
Copy link
Contributor

rdblue commented Jan 16, 2021

Thanks, @kbendick! Good catch.

@kbendick
Copy link
Contributor Author

Thanks, @kbendick! Good catch.

My pleasure!

@kbendick kbendick deleted the handle-zero-width-truncation-for-startswith branch January 17, 2021 23:35
XuQianJin-Stars pushed a commit to XuQianJin-Stars/iceberg that referenced this pull request Mar 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

java.lang.IllegalArgumentException: Truncate length should be positive

3 participants