Apply info column filters during split generation#23411
Apply info column filters during split generation#23411nmahadevuni merged 1 commit intoprestodb:masterfrom
Conversation
|
@yingsu00 @aditi-pandit Can you please review this? |
|
Nit, suggest minor revision of release note entry for consistency with the Order of Changes in the Release Notes Guidelines, and add the PR number. Maybe something like this: |
aditi-pandit
left a comment
There was a problem hiding this comment.
Thanks @nmahadevuni for this fix. Had just one question.
| assertQuery(format("SELECT * from test_hidden_columns WHERE \"$file_size\"=%d", fileSize)); | ||
|
|
||
| // A bug used to return all rows even though filters on hidden column were present in the query. | ||
| // So checking the count here too to verify the number of rows returned is correct. Since the bug was present |
There was a problem hiding this comment.
Nit : "too" should be "to"
|
|
||
| // A bug used to return all rows even though filters on hidden column were present in the query. | ||
| // So checking the count here too to verify the number of rows returned is correct. Since the bug was present | ||
| // for both Java and Native Presto for other than $path column, the assertQuery test above used to pass. |
There was a problem hiding this comment.
Nit : Rephrase "non-$path info column" instead of "other than $path column".
| long fileModificationTime) | ||
| { | ||
| if (constraints.isEmpty()) { | ||
| return true; |
There was a problem hiding this comment.
Should this be false instead of true ? I'm not super familiar with this code. But when would constraints be empty ? Would be great to have a test for this situation.
There was a problem hiding this comment.
Constraints would be empty when there are no filters on info columns in the query. We should return true, so the split will be scheduled to worker. Any simple SELECT query without filters or filters on non-info columns will already be testing this scenario.
|
@tdcmeehan Can you please review this? |
|
@nmahadevuni can you apply @steveburnett's suggestion for the release note, or comment on what you think about it? |
@steveburnett Its really a fix, not an improvement. I would like to change it to "Fix filtering by info columns $file_size and $file_modified_time correctly, which were ignored before. :pr: |
Thank you for the explanation, that makes perfect sense to me! Sounds great. |
756b786 to
25b9e35
Compare
|
Thanks @aditi-pandit @steveburnett @tdcmeehan for the review. |
| // Inserting two rows with 2 seconds delay to have a different modified timestamp for each file. | ||
| queryRunner.execute("INSERT INTO test_hidden_columns SELECT * FROM region where regionkey = 0"); | ||
| try { | ||
| TimeUnit.SECONDS.sleep(2); |
There was a problem hiding this comment.
@nmahadevuni I am seeing tests failing with errors. Wondering if any flakiness involved.
[INFO] -------------------------------------------------------
[INFO] T E S T S
[INFO] -------------------------------------------------------
[INFO] Running com.facebook.presto.facebook.nativeworker.oss.TestPrestoNativeGeneralQueriesThrift
[ERROR] Tests run: 61, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 191.177 s <<< FAILURE! - in com.facebook.presto.facebook.nativeworker.oss.TestPrestoNativeGeneralQueriesThrift
[ERROR] com.facebook.presto.facebook.nativeworker.oss.TestPrestoNativeGeneralQueriesThrift.testFileModifiedTimeHiddenColumn Time elapsed: 0.257 s <<< FAILURE!
java.lang.AssertionError: expected [1] but found [2]
at org.testng.Assert.fail(Assert.java:110)
at org.testng.Assert.failNotEquals(Assert.java:1413)
at org.testng.Assert.assertEqualsImpl(Assert.java:149)
at org.testng.Assert.assertEquals(Assert.java:131)
at org.testng.Assert.assertEquals(Assert.java:954)
at com.facebook.presto.nativeworker.AbstractTestNativeGeneralQueries.testFileModifiedTimeHiddenColumn(AbstractTestNativeGeneralQueries.java:1069)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:135)
at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:673)
at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:220)
at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:50)
at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:945)
at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:193)
at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:128)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
[ERROR] com.facebook.presto.facebook.nativeworker.oss.TestPrestoNativeGeneralQueriesThrift.testFileSizeHiddenColumn Time elapsed: 0.693 s <<< FAILURE!
java.lang.AssertionError: expected [1] but found [2]
at org.testng.Assert.fail(Assert.java:110)
at org.testng.Assert.failNotEquals(Assert.java:1413)
at org.testng.Assert.assertEqualsImpl(Assert.java:149)
at org.testng.Assert.assertEquals(Assert.java:131)
at org.testng.Assert.assertEquals(Assert.java:954)
at com.facebook.presto.nativeworker.AbstractTestNativeGeneralQueries.testFileSizeHiddenColumn(AbstractTestNativeGeneralQueries.java:1054)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:135)
at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:673)
at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:220)
at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:50)
at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:945)
at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:193)
at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:128)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
[INFO]
[INFO] Results:
[INFO]
[ERROR] Failures:
[ERROR] TestPrestoNativeGeneralQueriesThrift>AbstractTestNativeGeneralQueries.testFileModifiedTimeHiddenColumn:1069 expected [1] but found [2]
[ERROR] TestPrestoNativeGeneralQueriesThrift>AbstractTestNativeGeneralQueries.testFileSizeHiddenColumn:1054 expected [1] but found [2]
[INFO]
[ERROR] Tests run: 61, Failures: 2, Errors: 0, Skipped: 0
Any thoughts what can go wrong?
CC: @xiaoxmeng
There was a problem hiding this comment.
This is not specific to any format. I have tested this locally and verified that the file sizes generated are different for these two inserts and with the delay, the file modified timestamp is also different.
There was a problem hiding this comment.
I assume this doesn't need any change at worker side? Is there any flakiness in this test? Thanks!
There was a problem hiding this comment.
There is native check failure message but not sure if it is related. Json file reader is not registered
There was a problem hiding this comment.
No, this doesn't have any changes to be made in the worker side. I have checked that the test is not flaky. I have run it multiple times even today. The native check failure is unrelated.
There was a problem hiding this comment.
The only way I see these can fail is if the source table region somehow ends with duplicate rows. And when we create test_hidden_columns tables, with each insert, we get two rows rather than the expected one row. Otherwise, I don't see how even the file_modified_timestamp can be same with even 2 seconds delay between the inserts. Can we get the files from the table location when this failure occurs? I can confirm by looking at the file stat info and the number of rows in each file. Or output of ls -lrth from the table location.
There was a problem hiding this comment.
After I run the test in my Mac, the contents of the table are below.
..tpch/test_hidden_columns % ls -lrth
total 16
-rw-r--r-- 1 nmahadevuni wheel 701B Aug 15 12:10 20240815_064042_00002_79tnd_d18c9b52-ba08-4abe-9e6e-e7d4b45810ad
-rw-r--r-- 1 nmahadevuni wheel 756B Aug 15 12:10 20240815_064054_00003_79tnd_ab091114-d5c4-47e7-9187-782d73e6e0be
nmahadevuni::tpch/test_hidden_columns % stat 20240815_064042_00002_79tnd_d18c9b52-ba08-4abe-9e6e-e7d4b45810ad
16777232 84182810 -rw-r--r-- 1 nmahadevuni wheel 0 701 "Aug 15 12:10:52 2024" "Aug 15 12:10:52 2024" "Aug 15 12:10:52 2024" "Aug 15 12:10:51 2024" 4096 8 0 20240815_064042_00002_79tnd_d18c9b52-ba08-4abe-9e6e-e7d4b45810ad
nmahadevuni::tpch/test_hidden_columns % stat 20240815_064054_00003_79tnd_ab091114-d5c4-47e7-9187-782d73e6e0be
16777232 84182823 -rw-r--r-- 1 nmahadevuni wheel 0 756 "Aug 15 12:10:57 2024" "Aug 15 12:10:56 2024" "Aug 15 12:10:57 2024" "Aug 15 12:10:56 2024" 4096 8 0 20240815_064054_00003_79tnd_ab091114-d5c4-47e7-9187-782d73e6e0be```
There was a problem hiding this comment.
can confirm by looking at the file stat info and the number of rows in each file. Or output of
ls -lrthfrom the table location.
@nmahadevuni so the filtering happens at split generation at coordinator side. The worker doesn't seem handle the filer on SYNTHESIZED column. Without this change, the filter works for path but not file size, and I haven't tried the other info column. This PR fixes the filter for other info or hidden columns like file size? Maybe it is due to some test flakiness. @spershin found something flakiness in the test and is going to apply a followup fix on the test.
There was a problem hiding this comment.
@xiaoxmeng : The synthesized columns were done with #21965 and facebookincubator/velox#8800
| Table table, | ||
| Iterable<HivePartitionMetadata> partitions, | ||
| Optional<Domain> pathDomain, | ||
| Map<Integer, Domain> infoColumnConstraints, |
There was a problem hiding this comment.
infoColumn -> metadataColumn
Description
Fix incorrect results when filtering on info columns like $file_size, $file_modified_time. Below is the explanation of the bug encountered.
Run multiple inserts on the table to have at least two different files created for the table.
presto:tpch> select regionkey,name,"$file_size" from region2 order by "$file_size";
regionkey | name | $file_size
-----------+-------------+------------
0 | AFRICA | 701
1 | AMERICA | 881
If we have a filter "$file_size"=701 in the query, we expect it to return only the first row from above. But it returns both the rows with file_size set to 701, the incorrect result is below.
presto:tpch> select regionkey,name,"$file_size" from region2 where "$file_size"=701;
regionkey | name | $file_size
-----------+-------------+------------
0 | AFRICA | 701
1 | AMERICA | 701
Motivation and Context
To fix filtering by info columns in coordinator during split generation. Currently, we do this only for
$pathcolumn, extended this for$file_sizeand$file_modified_time.Impact
No user facing impact.
Test Plan
Added new tests and modified existing tests for
$pathand$bucketto verify the result correctness.