-
Notifications
You must be signed in to change notification settings - Fork 3k
Allow binary truncation length to be zero to handle evaluators that encounter empty string values #2081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow binary truncation length to be zero to handle evaluators that encounter empty string values #2081
Changes from all commits
174c3a5
010c1e5
e2195c3
276c975
1416cb5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -529,7 +529,7 @@ private List<Record> testRecords(Schema schema) { | |
| return Lists.newArrayList( | ||
| record(schema, 0L, parse("2017-12-22T09:20:44.294658+00:00"), "junction"), | ||
| record(schema, 1L, parse("2017-12-22T07:15:34.582910+00:00"), "alligator"), | ||
| record(schema, 2L, parse("2017-12-22T06:02:09.243857+00:00"), "forrest"), | ||
| record(schema, 2L, parse("2017-12-22T06:02:09.243857+00:00"), ""), | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can make a separate test if we feel that would be better. However, this test suite covers a very large number of cases so it seems like a smart place to add it.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also think that it's a good opportunity to start introducing more edge case data into the tests without adding any overhead, and it tests this code path in a myriad of ways and does not affect the original test (though I'll have to align my open PR that currently touches this file, but that's not a problem at all).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, for reference, I did make several separate tests and they all passed. But it just seems cleaner to me to introduce more common edge cases into the existing test cases, especially when it doesn't require us to change the existing tests. I'm surprised that the tests that run on random data hasn't caught this before. But most of the random data tests use a predefined seed, so I'm not that surprised. That said, the tests in If there is any desire for a more explicit test or anything, please let me know 🙂
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So I actually found an inexpensive existing place to test
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm fine adding the test here, but I'd rather not change the existing cases. Would adding a new record test this without affecting the existing ones?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes unfortunately, adding a new record to be returned from this function would mean we'd have to change many of the existing tests, as many of them have a hardcoded assumption on 10 records. I can add a new test in another (possibly new) file if we'd prefer. But in order to place the test in here to be run against what seems to be one of the more comprehensive test suites for hitting various evaluators in various scenarios (i.e. it tests against tables partitioned by the empty string record, tables not partitioned by the empty string record, unpartitioned tables), we'd have to either change an existing record or update the tests to be aware of 11 records. In particular, we need to test on a scan of a table with a filter, which will cause the various evaluators which can return I'll take another look to see if there's an existing test suite where a new test can be easily added to cover this scenario without affecting other tests.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I add the following test to this file, it does cover some of the codepaths (albeit with a somewhat funny SQL query). This will still go over the The following test would throw prior to this patch (and it doesn't require us to touch any of the input data for the other test suites). @Test
public void testPartitionedByDataStartsWithEmptyStringFilter() {
File location = buildPartitionedTable("partitioned_by_data", PARTITION_BY_DATA, "data_ident", "data");
DataSourceOptions options = new DataSourceOptions(ImmutableMap.of(
"path", location.toString())
);
IcebergSource source = new IcebergSource();
DataSourceReader reader = source.createReader(options);
pushFilters(reader, new StringStartsWith("data", ""));
Assert.assertEquals(10, reader.planInputPartitions().size());
}Alternatively, I could add a new test file entirely and have full control over what the test does.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added a test that does not change any of the exising test data or add a new file at all (only extends existings), in I personally prefer changing the input data on the larger end to end test that covers many portions ( But I've added another possible way to test the reported scenario - If you'd like me to add a standalone spark test or something, let me know. Or if you prefer the new test, let me know and I can revert the change to the input records of the other test suite. Thanks for your review @rdblue 🙂 |
||
| record(schema, 3L, parse("2017-12-22T03:10:11.134509+00:00"), "clapping"), | ||
| record(schema, 4L, parse("2017-12-22T00:34:00.184671+00:00"), "brush"), | ||
| record(schema, 5L, parse("2017-12-21T22:20:08.935889+00:00"), "trap"), | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make this lazily instantiated if we care. Or just return a new empty byte buffer instance every time in the function, but seems unnecessary to have all of those allocations.