Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-35740][mysql] Allow column as chunk key even if not in Primary Keys #3448

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

SML0127
Copy link
Contributor

@SML0127 SML0127 commented Jul 2, 2024

Allow column as chunk key even if it is not in the primary keys.

There are cases where the primary key is not a numeric type, such as varchar or varbinary.
In this case, the distributed factor and chunk range may be calculated incorrectly, resulting in one chunk containing too many records.

But there was no conditions for restrict column type of primary keys or input chunk key, so this may cause out of memory in the flink task manager.

Actually, in our company, there was a mysql tables that PK is a varbinary column and the above situation occurred.

https://issues.apache.org/jira/browse/FLINK-35740

@SML0127 SML0127 changed the title Allow column as chunk key even if not in Primary Keys [FLINK-35740][mysql] Allow column as chunk key even if not in Primary Keys Jul 2, 2024
@SML0127
Copy link
Contributor Author

SML0127 commented Jul 2, 2024

PTAL @leonardBang @lvyanquan @PatrickRen 🙇🏻‍♂️🙇🏻‍♂️

@SML0127
Copy link
Contributor Author

SML0127 commented Jul 2, 2024

I was wondering whether to allow a column that is not a primary key as a chunk key.
But I think it would be better to relax the chunk key restrictions in the code, just whether or not input chunk key column is in the columns of table.
So It would be better to leave it up to the user to decide whether the correct chunk key has been passed.

Copy link
Contributor

@yuxiqian yuxiqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps testAssignCompositePkTableWithWrongChunkKeyColumn test case should be updated, too?

@SML0127
Copy link
Contributor Author

SML0127 commented Jul 2, 2024

Perhaps testAssignCompositePkTableWithWrongChunkKeyColumn test case should be updated, too?

yes! I have to change those test code.

@SML0127
Copy link
Contributor Author

SML0127 commented Jul 2, 2024

Perhaps testAssignCompositePkTableWithWrongChunkKeyColumn test case should be updated, too?

yes! I have to change those test code.
@yuxiqian
I added some test code testAssignTableWithoutPrimaryKeyWithChunkKeyColumn

@yuxiqian
Copy link
Contributor

yuxiqian commented Jul 4, 2024

Generally LGTM, but my concern is arbitrarily chosen columns can be risky, since chunking by columns without index is very slow.

I think we need to update related documentations and address warnings about this. Looking forward to thoughts from @SML0127.

@SML0127
Copy link
Contributor Author

SML0127 commented Jul 4, 2024

Generally LGTM, but my concern is arbitrarily chosen columns can be risky, since chunking by columns without index is very slow.

I think we need to update related documentations and address warnings about this. Looking forward to thoughts from @SML0127.

From flink cdc v2.4.0, we have enabled to use incremental snapshots for tables without primary keys.
(https://nightlies.apache.org/flink/flink-cdc-docs-release-3.1/docs/connectors/flink-sources/mysql-cdc/)
So I think it seems reasonable to allow column as chunk key not in primary keys.
But we also have to improve related documentations and address warnings as you suggested.

I will update documentations from this PR, too.

@SML0127 SML0127 force-pushed the feature/allow_chunk_key_even_if_not_pk branch from 45106d4 to 09e65e1 Compare July 5, 2024 06:43
@github-actions github-actions bot added the base label Jul 5, 2024
@github-actions github-actions bot added the docs Improvements or additions to documentation label Jul 9, 2024
@SML0127
Copy link
Contributor Author

SML0127 commented Jul 9, 2024

@leonardBang PTAL

@SML0127
Copy link
Contributor Author

SML0127 commented Jul 12, 2024

@leonardBang PTAL

@leonardBang @yuxiqian PTAL

Copy link
Contributor

@yuxiqian yuxiqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just left a minor comment about unit testing.

Comment on lines +419 to +439
@Test
public void testAssignTableWithoutPrimaryKeyWithChunkKeyColumn() {
String tableWithoutPrimaryKey = "customers_no_pk";
List<String> expected =
Arrays.asList(
"customers_no_pk null [462]",
"customers_no_pk [462] [823]",
"customers_no_pk [823] [1184]",
"customers_no_pk [1184] [1545]",
"customers_no_pk [1545] [1906]",
"customers_no_pk [1906] null");
List<String> splits =
getTestAssignSnapshotSplits(
customerDatabase,
4,
CHUNK_KEY_EVEN_DISTRIBUTION_FACTOR_UPPER_BOUND.defaultValue(),
CHUNK_KEY_EVEN_DISTRIBUTION_FACTOR_LOWER_BOUND.defaultValue(),
new String[] {tableWithoutPrimaryKey},
"id");
assertEquals(expected, splits);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we can also test using non-primary key columns as chunk keys for table with primary keys.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
base docs Improvements or additions to documentation mysql-cdc-connector
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants