-
Notifications
You must be signed in to change notification settings - Fork 3k
Allow binary truncation length to be zero to handle evaluators that encounter empty string values #2081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow binary truncation length to be zero to handle evaluators that encounter empty string values #2081
Conversation
| private BinaryUtil() { | ||
| } | ||
|
|
||
| private static final ByteBuffer EMPTY_BYTE_BUFFER = ByteBuffer.allocate(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make this lazily instantiated if we care. Or just return a new empty byte buffer instance every time in the function, but seems unnecessary to have all of those allocations.
| record(schema, 0L, parse("2017-12-22T09:20:44.294658+00:00"), "junction"), | ||
| record(schema, 1L, parse("2017-12-22T07:15:34.582910+00:00"), "alligator"), | ||
| record(schema, 2L, parse("2017-12-22T06:02:09.243857+00:00"), "forrest"), | ||
| record(schema, 2L, parse("2017-12-22T06:02:09.243857+00:00"), ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make a separate test if we feel that would be better. However, this test suite covers a very large number of cases so it seems like a smart place to add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that it's a good opportunity to start introducing more edge case data into the tests without adding any overhead, and it tests this code path in a myriad of ways and does not affect the original test (though I'll have to align my open PR that currently touches this file, but that's not a problem at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, for reference, I did make several separate tests and they all passed. But it just seems cleaner to me to introduce more common edge cases into the existing test cases, especially when it doesn't require us to change the existing tests. I'm surprised that the tests that run on random data hasn't caught this before. But most of the random data tests use a predefined seed, so I'm not that surprised.
That said, the tests in TestFilteredScan all go over the code paths that deal with this. And after we merge this, I can add in two small / inexpensive tests in the NOT_STARTS_WITH pr that will more explicitly cover this edge case - by partitioning a DF using a filter of data like '%' and data not like '%', which would throw an exception due to the truncation length without this change (though hopefully there aren't too many queries like the first one, but it's not my place to tell people how to write queries).
If there is any desire for a more explicit test or anything, please let me know 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I actually found an inexpensive existing place to test truncateBinary itself. So the function will be better documented by means of the tests (if we agree that this change on the preconditions is acceptable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine adding the test here, but I'd rather not change the existing cases. Would adding a new record test this without affecting the existing ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes unfortunately, adding a new record to be returned from this function would mean we'd have to change many of the existing tests, as many of them have a hardcoded assumption on 10 records.
I can add a new test in another (possibly new) file if we'd prefer. But in order to place the test in here to be run against what seems to be one of the more comprehensive test suites for hitting various evaluators in various scenarios (i.e. it tests against tables partitioned by the empty string record, tables not partitioned by the empty string record, unpartitioned tables), we'd have to either change an existing record or update the tests to be aware of 11 records.
In particular, we need to test on a scan of a table with a filter, which will cause the various evaluators which can return ROWS_MIGHT_MATCH.
I'll take another look to see if there's an existing test suite where a new test can be easily added to cover this scenario without affecting other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I add the following test to this file, it does cover some of the codepaths (albeit with a somewhat funny SQL query). This will still go over the truncateBinary function when the input needs to be truncated to a length of zero, although this time it will be due to the predicate literal having a length of zero instead of the input data having a row with a length of zero. Either way, we truncate to the minimum length of either the input field or the predicate literal when using STARTS_WITH.
The following test would throw prior to this patch (and it doesn't require us to touch any of the input data for the other test suites).
@Test
public void testPartitionedByDataStartsWithEmptyStringFilter() {
File location = buildPartitionedTable("partitioned_by_data", PARTITION_BY_DATA, "data_ident", "data");
DataSourceOptions options = new DataSourceOptions(ImmutableMap.of(
"path", location.toString())
);
IcebergSource source = new IcebergSource();
DataSourceReader reader = source.createReader(options);
pushFilters(reader, new StringStartsWith("data", ""));
Assert.assertEquals(10, reader.planInputPartitions().size());
}Alternatively, I could add a new test file entirely and have full control over what the test does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test that does not change any of the exising test data or add a new file at all (only extends existings), in TestInclusiveMetricsEvaluator. https://github.com/apache/iceberg/pull/2081/files#diff-a107d871a1a5d5c4029451db68cf5497ddc9432d90619e263270cf72c3536c23R73
I personally prefer changing the input data on the larger end to end test that covers many portions (TestFilteredScan) as they truly cover the situation where the bug was originally reported occurring (and where it is likely to happen based on usages of truncateBinary) - filtered scans with empty row data.
But I've added another possible way to test the reported scenario -TestInclusiveMetricsEvaluator.
If you'd like me to add a standalone spark test or something, let me know. Or if you prefer the new test, let me know and I can revert the change to the input records of the other test suite.
Thanks for your review @rdblue 🙂
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me, given the tests all passed. But I wonder why empty string was not handled and added for testing from the very beginning, I may lack some context here. I see @aokolnychyi added the check for length > 0, was there any particular reason for that? Do you expect length == 0 to be handled before the pushdown?
…With and other expression evaluators to evaluate rows that contain the empty string
50c36ab to
174c3a5
Compare
Yes I would also be very interested to hear if there's any reason to not allow for the truncation function to alter the precondition. I'm open to there being something I didn't catch. For reference, I checked and the only place that To be exact, in |
As for whether or not length == 0 should be handled before the pushdown, it should be if I understand correctly. By the time that this function is invoked in the places I listed elsewhere, all expressions should have been rewritten to remove NOT nodes, and then been bound (both the term to the correct id or predicate id as well as the literal used for comparison in the literal expressions). Given that this is always called either in metrics evaluation or row value evaluation against a predicate - typically START_WITH or soon to be NOT_STARTS_WITH - the expressions being evaluated are all Though for that particular query, it does get rewritten to be I'm not as closely familiar with the Please correct me if I am mistaken 🙂 . |
|
My guess is that the truncation width constraint was added to potentially catch the case where somebody has set a column's metrics to There already is a precondition check for disallowing a truncation |
|
Thanks, @kbendick! Good catch. |
My pleasure! |
This PR closes #2065
This allows for the
BinaryUtil.truncateBinaryfunction to allow for a zero length truncation. This is necessary for handlingscans where STARTS_WITH filters are applied to the table (or any filter that uses
truncateBinary) when a (possibly non-partition) column has an empty string value in it.It would be more efficient to return early in each of the given metrics evaluators if the length is zero, but I did not want to introduce too many changes that conflict with this PR: #2062. If we'd like, we can go in and update all of those functions individually, but this is the minimum possible change set (to the non-test code) to fix the issue now and is likely cleaner. Additionally, since we're returning the same empty final ByteBuffer right away, there's not really that much overhead, and certainly no more than currently exists.
I chose to update the
TestFilteredScanspark test suites to have one record be an empty string. This will cover unpartitioned tables, tables with filters that are partitioned on a column that has the empty string, as well as tables that are partitioned on columns that do not have the empty string value (which was the case for the original ticket).I also originally added a test that uses the Iceberg API directly in
BaseTableTest, but it was not a partitioned table aswriteRecordsin that test suite assumes an unpartitioned table spec. I decided to remove it as those code paths are all handled inTestFilteredScanwithout adding any additional overhead to the CI suite. I did however add a test that better documents the (new) behavior oftruncateBinary.cc @changquanyou @aokolnychyi @RussellSpitzer