Skip to content

Apply info column filters during split generation#23411

Merged
nmahadevuni merged 1 commit intoprestodb:masterfrom
nmahadevuni:fix_filtering_by_info_columns
Aug 14, 2024
Merged

Apply info column filters during split generation#23411
nmahadevuni merged 1 commit intoprestodb:masterfrom
nmahadevuni:fix_filtering_by_info_columns

Conversation

@nmahadevuni
Copy link
Copy Markdown
Member

@nmahadevuni nmahadevuni commented Aug 9, 2024

Description

Fix incorrect results when filtering on info columns like $file_size, $file_modified_time. Below is the explanation of the bug encountered.

Run multiple inserts on the table to have at least two different files created for the table.

presto:tpch> select regionkey,name,"$file_size" from region2 order by "$file_size";

regionkey | name | $file_size
-----------+-------------+------------
0 | AFRICA | 701
1 | AMERICA | 881

If we have a filter "$file_size"=701 in the query, we expect it to return only the first row from above. But it returns both the rows with file_size set to 701, the incorrect result is below.

presto:tpch> select regionkey,name,"$file_size" from region2 where "$file_size"=701;

regionkey | name | $file_size
-----------+-------------+------------
0 | AFRICA | 701
1 | AMERICA | 701

Motivation and Context

To fix filtering by info columns in coordinator during split generation. Currently, we do this only for $path column, extended this for $file_size and $file_modified_time.

Impact

No user facing impact.

Test Plan

Added new tests and modified existing tests for $path and $bucket to verify the result correctness.

== RELEASE NOTES ==

Hive Connector Changes
*  Fix filtering by info columns $file_size and $file_modified_time, which were ignored before. :pr:`23411`

@nmahadevuni
Copy link
Copy Markdown
Member Author

@yingsu00 @aditi-pandit Can you please review this?

@steveburnett
Copy link
Copy Markdown
Contributor

Nit, suggest minor revision of release note entry for consistency with the Order of Changes in the Release Notes Guidelines, and add the PR number. Maybe something like this:

== RELEASE NOTES ==

Hive Connector Changes
*  Improve filtering by applying it to info columns $file_size and $file_modified_time correctly, which were ignored before. :pr:`23411`

Copy link
Copy Markdown
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nmahadevuni for this fix. Had just one question.

assertQuery(format("SELECT * from test_hidden_columns WHERE \"$file_size\"=%d", fileSize));

// A bug used to return all rows even though filters on hidden column were present in the query.
// So checking the count here too to verify the number of rows returned is correct. Since the bug was present
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : "too" should be "to"


// A bug used to return all rows even though filters on hidden column were present in the query.
// So checking the count here too to verify the number of rows returned is correct. Since the bug was present
// for both Java and Native Presto for other than $path column, the assertQuery test above used to pass.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : Rephrase "non-$path info column" instead of "other than $path column".

long fileModificationTime)
{
if (constraints.isEmpty()) {
return true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be false instead of true ? I'm not super familiar with this code. But when would constraints be empty ? Would be great to have a test for this situation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constraints would be empty when there are no filters on info columns in the query. We should return true, so the split will be scheduled to worker. Any simple SELECT query without filters or filters on non-info columns will already be testing this scenario.

@nmahadevuni nmahadevuni requested a review from tdcmeehan August 13, 2024 05:59
@nmahadevuni
Copy link
Copy Markdown
Member Author

@tdcmeehan Can you please review this?

tdcmeehan
tdcmeehan previously approved these changes Aug 14, 2024
Copy link
Copy Markdown
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tdcmeehan
Copy link
Copy Markdown
Contributor

@nmahadevuni can you apply @steveburnett's suggestion for the release note, or comment on what you think about it?

@nmahadevuni
Copy link
Copy Markdown
Member Author

Nit, suggest minor revision of release note entry for consistency with the Order of Changes in the Release Notes Guidelines, and add the PR number. Maybe something like this:

== RELEASE NOTES ==

Hive Connector Changes
*  Improve filtering by applying it to info columns $file_size and $file_modified_time correctly, which were ignored before. :pr:`23411`

@steveburnett Its really a fix, not an improvement. I would like to change it to "Fix filtering by info columns $file_size and $file_modified_time correctly, which were ignored before. :pr:23411

@steveburnett
Copy link
Copy Markdown
Contributor

@steveburnett Its really a fix, not an improvement. I would like to change it to "Fix filtering by info columns $file_size and $file_modified_time correctly, which were ignored before. :pr:23411

Thank you for the explanation, that makes perfect sense to me! Sounds great.

@nmahadevuni nmahadevuni merged commit c8a4c55 into prestodb:master Aug 14, 2024
@nmahadevuni
Copy link
Copy Markdown
Member Author

Thanks @aditi-pandit @steveburnett @tdcmeehan for the review.

// Inserting two rows with 2 seconds delay to have a different modified timestamp for each file.
queryRunner.execute("INSERT INTO test_hidden_columns SELECT * FROM region where regionkey = 0");
try {
TimeUnit.SECONDS.sleep(2);
Copy link
Copy Markdown
Contributor

@amitkdutta amitkdutta Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nmahadevuni I am seeing tests failing with errors. Wondering if any flakiness involved.

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running com.facebook.presto.facebook.nativeworker.oss.TestPrestoNativeGeneralQueriesThrift
[ERROR] Tests run: 61, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 191.177 s <<< FAILURE! - in com.facebook.presto.facebook.nativeworker.oss.TestPrestoNativeGeneralQueriesThrift
[ERROR] com.facebook.presto.facebook.nativeworker.oss.TestPrestoNativeGeneralQueriesThrift.testFileModifiedTimeHiddenColumn  Time elapsed: 0.257 s  <<< FAILURE!
java.lang.AssertionError: expected [1] but found [2]
	at org.testng.Assert.fail(Assert.java:110)
	at org.testng.Assert.failNotEquals(Assert.java:1413)
	at org.testng.Assert.assertEqualsImpl(Assert.java:149)
	at org.testng.Assert.assertEquals(Assert.java:131)
	at org.testng.Assert.assertEquals(Assert.java:954)
	at com.facebook.presto.nativeworker.AbstractTestNativeGeneralQueries.testFileModifiedTimeHiddenColumn(AbstractTestNativeGeneralQueries.java:1069)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:135)
	at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:673)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:220)
	at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:50)
	at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:945)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:193)
	at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
	at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:128)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

[ERROR] com.facebook.presto.facebook.nativeworker.oss.TestPrestoNativeGeneralQueriesThrift.testFileSizeHiddenColumn  Time elapsed: 0.693 s  <<< FAILURE!
java.lang.AssertionError: expected [1] but found [2]
	at org.testng.Assert.fail(Assert.java:110)
	at org.testng.Assert.failNotEquals(Assert.java:1413)
	at org.testng.Assert.assertEqualsImpl(Assert.java:149)
	at org.testng.Assert.assertEquals(Assert.java:131)
	at org.testng.Assert.assertEquals(Assert.java:954)
	at com.facebook.presto.nativeworker.AbstractTestNativeGeneralQueries.testFileSizeHiddenColumn(AbstractTestNativeGeneralQueries.java:1054)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:135)
	at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:673)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:220)
	at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:50)
	at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:945)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:193)
	at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
	at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:128)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

[INFO] 
[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   TestPrestoNativeGeneralQueriesThrift>AbstractTestNativeGeneralQueries.testFileModifiedTimeHiddenColumn:1069 expected [1] but found [2]
[ERROR]   TestPrestoNativeGeneralQueriesThrift>AbstractTestNativeGeneralQueries.testFileSizeHiddenColumn:1054 expected [1] but found [2]
[INFO] 
[ERROR] Tests run: 61, Failures: 2, Errors: 0, Skipped: 0

Any thoughts what can go wrong?

CC: @xiaoxmeng

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not specific to any format. I have tested this locally and verified that the file sizes generated are different for these two inserts and with the delay, the file modified timestamp is also different.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this doesn't need any change at worker side? Is there any flakiness in this test? Thanks!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is native check failure message but not sure if it is related. Json file reader is not registered

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this doesn't have any changes to be made in the worker side. I have checked that the test is not flaky. I have run it multiple times even today. The native check failure is unrelated.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way I see these can fail is if the source table region somehow ends with duplicate rows. And when we create test_hidden_columns tables, with each insert, we get two rows rather than the expected one row. Otherwise, I don't see how even the file_modified_timestamp can be same with even 2 seconds delay between the inserts. Can we get the files from the table location when this failure occurs? I can confirm by looking at the file stat info and the number of rows in each file. Or output of ls -lrth from the table location.

Copy link
Copy Markdown
Member Author

@nmahadevuni nmahadevuni Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After I run the test in my Mac, the contents of the table are below.

..tpch/test_hidden_columns % ls -lrth
total 16
-rw-r--r--  1 nmahadevuni  wheel   701B Aug 15 12:10 20240815_064042_00002_79tnd_d18c9b52-ba08-4abe-9e6e-e7d4b45810ad
-rw-r--r--  1 nmahadevuni  wheel   756B Aug 15 12:10 20240815_064054_00003_79tnd_ab091114-d5c4-47e7-9187-782d73e6e0be

nmahadevuni::tpch/test_hidden_columns % stat 20240815_064042_00002_79tnd_d18c9b52-ba08-4abe-9e6e-e7d4b45810ad
16777232 84182810 -rw-r--r-- 1 nmahadevuni wheel 0 701 "Aug 15 12:10:52 2024" "Aug 15 12:10:52 2024" "Aug 15 12:10:52 2024" "Aug 15 12:10:51 2024" 4096 8 0 20240815_064042_00002_79tnd_d18c9b52-ba08-4abe-9e6e-e7d4b45810ad

nmahadevuni::tpch/test_hidden_columns % stat 20240815_064054_00003_79tnd_ab091114-d5c4-47e7-9187-782d73e6e0be
16777232 84182823 -rw-r--r-- 1 nmahadevuni wheel 0 756 "Aug 15 12:10:57 2024" "Aug 15 12:10:56 2024" "Aug 15 12:10:57 2024" "Aug 15 12:10:56 2024" 4096 8 0 20240815_064054_00003_79tnd_ab091114-d5c4-47e7-9187-782d73e6e0be```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can confirm by looking at the file stat info and the number of rows in each file. Or output of ls -lrth from the table location.

@nmahadevuni so the filtering happens at split generation at coordinator side. The worker doesn't seem handle the filer on SYNTHESIZED column. Without this change, the filter works for path but not file size, and I haven't tried the other info column. This PR fixes the filter for other info or hidden columns like file size? Maybe it is due to some test flakiness. @spershin found something flakiness in the test and is going to apply a followup fix on the test.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaoxmeng : The synthesized columns were done with #21965 and facebookincubator/velox#8800

Table table,
Iterable<HivePartitionMetadata> partitions,
Optional<Domain> pathDomain,
Map<Integer, Domain> infoColumnConstraints,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infoColumn -> metadataColumn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants