Skip to content

Conversation

@JkSelf
Copy link
Collaborator

@JkSelf JkSelf commented Dec 6, 2024

semi and anti

When we were validating the results of SMJ on TPCH, we discovered an issue where semi join and anti join produce inconsistent results when the left join key has multiple right matched rows.
For semi join: Suppose the left input is 2, and it matches with two right inputs, right input1: 2 and right input2: 2. The current code produces an output of

2 2
2 2

, resulting in two records that meet the condition, which is inconsistent with the semantics of left semi join. In this PR, we leverage the features of JoinTracker, using the firstMatch variable to ensure that in a group of left key matches, the output only records the first matching record, and other matching records are not recorded.
For anti join: In the case of anti join with a filter, we encountered a similar issue. The solution is similar to that of semi join, utilizing the features of JoinTracker to retain only the rows in the same left key match group that have no matches on the right side.

full outer join fix

Assume the left table has columns a and b:

a       b
2	100
2	1
2	1

The right table has columns c and d:

c       d
2	3
2	-1
2	-1
2	3

The two tables are joined using a full outer join on the condition a == c and b < d. During the doGetOutput phase, the result is matched using a left join, resulting in 3 * 4 = 12 records:

No      a        b       c      d
0	2	100	 2	3
1	2	100	 2	-1
2	2	100	 2	-1
3	2	100	 2	3
4	2	1	 2	3
5	2	1	 2	-1
6	2	1	 2	-1
7	2	1	 2	3
8	2	1	 2	3
9	2	1	 2	-1
10	2	1	 2	-1
11	2	1	 2	3

Then, in the filter method, the records are filtered based on the condition b < d, resulting in the following:

No	a	b	c	d	matched
0	2	100	2	3	FALSE
1	2	100	2	-1	FALSE
2	2	100	2	-1	FALSE
3	2	100	2	3	FALSE
4	2	1	2	3	TRUE
5	2	1	2	-1	FALSE
6	2	1	2	-1	FALSE
7	2	1	2	3	TRUE
8	2	1	2	3	TRUE
9	2	1	2	-1	FALSE
10	2	1	2	-1	FALSE
11	2	1	2	3	TRUE

Finally, records from the left table that do not have a match are filled with nulls, resulting in the following final output:

No	a	b	c	 d
0	2	100	null null
1	2	1	2	 3
2	2	1	2	 3
3	2	1	2	 3
4	2	1	2	 3

The above result is incorrect because it is missing rows from the right table that do not have a match. Among the 12 rows above, rows 0, 4, and 8 correspond to the first record (2, 3) from the right table, rows 1, 5, and 9 correspond to the second record (2, -1) from the right table, rows 2, 6, and 10 correspond to the third record (2, -1) from the right table, and rows 3, 7, and 11 correspond to the fourth record (2, 3) from the right table. From the matching results above, rows 1, 5, and 9, as well as rows 2, 6, and 10, are all false, meaning that the third and fourth records from the right table do not have matching rows. Therefore, the final result is missing rows from the right table that do not have matches. The correct final result should be:

No	a	b	c	 d
0	2	100	null   null
1	2	1	2	 3
2	2	1	2	 3
3	2	1	2	 3
4	2	1	2	 3
5       null    null    2       -1
6       null    null    2       -1

This PR calls the filter function when the keys are the same to filter out rows from the right table that do not have matches. If a row from the right table does not have a match, a new record is inserted with the corresponding columns from the left table set to null.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 6, 2024
@netlify
Copy link

netlify bot commented Dec 6, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit cb21c6a
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/68b9448434f9a800081777de

@JkSelf
Copy link
Collaborator Author

JkSelf commented Dec 6, 2024

@pedroerp @xiaoxmeng Can you help to review this PR? Thanks.

@JkSelf JkSelf force-pushed the semi-anti-fix branch 2 times, most recently from c598eef to 77d2d90 Compare December 9, 2024 07:30
@JkSelf JkSelf changed the title fix: Fix semi join and anti join result mismatch issue fix: Fix smj result mismatch issue in semi, anit and full outer join Dec 31, 2024
Comment on lines 1317 to 1397

if (leftMatch_ && !previousLeftMatch_) {
joinTracker_->noMoreFilterResults(onMiss);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment this?
we found a bug due to this code @JkSelf

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fzhedu It aims to fix the MergeJoinTest#antiJoinWithFilterWithMultiMatchedRowsInDifferentBatches test. Although leftMatch is not null, it pertains to the next batch rather than the current one. Therefore, we also need to call noMoreFilterResults to update the status of the current batch.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf here exists a bug: when leftMatch is set a new one, but the output is only contains results from previous leftMatch when compare results 0.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fzhedu Sorry for delay response. Can you help to provide the unit tests that can be reproduced? Thanks.

@zhouyuan
Copy link
Collaborator

zhouyuan commented Jun 7, 2025

@JkSelf the unit tests for merge join are not compiling, could you please check?

@JkSelf
Copy link
Collaborator Author

JkSelf commented Jun 11, 2025

@zhouyuan I have fixed the failed unit tests.

@JkSelf JkSelf force-pushed the semi-anti-fix branch 2 times, most recently from 291396e to 6bdbe13 Compare June 29, 2025 00:19
@zhouyuan
Copy link
Collaborator

@xiaoxmeng gentle ping

@zhouyuan
Copy link
Collaborator

@xiaoxmeng gentle ping

@zhouyuan
Copy link
Collaborator

@xiaoxmeng would you please help to take a look?

@FelixYBW
Copy link

FelixYBW commented Sep 3, 2025

@pedroerp can you find someone to review the PR? The PR is essential to Gluten project. It solved a bug Gluten customer observed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. merge-join

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants