-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[Feature][Connector-V2][Milvus] Sink writer flush by interval #9961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
[Feature][Connector-V2][Milvus] Sink writer flush by interval #9961
Conversation
| if (scheduler != null) { | ||
| log.info("create Milvus sink writer with batch interval: {}", batchInterval); | ||
| scheduler.scheduleAtFixedRate( | ||
| new BatchWriterFlushRunnable(batchWriter), | ||
| 0, | ||
| batchInterval, | ||
| TimeUnit.MILLISECONDS); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not recommend opening a separate thread to handle this issue. You can solve this problem through prepareCommit()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangshenghang Do you mean to periodically flush data through the prepareCommit() triggered by checkpoint.interval in STREAMING mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangshenghang Do you mean to periodically flush data through the
prepareCommit()triggered by checkpoint.interval in STREAMING mode?
Yes, you can refer to other connectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangshenghang Got it. There is a question, if i want to reduce data visibility needs to be ensured, the checkpoint.interval needs to be set relatively small, but frequent checkpoints are likely to affect performance. Adding a new batch_interval parameter can avoid frequent checkpoints to ensure data visibility, but the visibility delay of the last few data entries will still be checkpoint.inteval rather than batch_interval, it can not ensure a unified visibility delay.
Need your suggestion, adding a new parameter batch_interval or using checkpoint.inteval to control interval flush?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not willing to add this parameter because we already have checkpoint.interval, which can achieve the same effect. Are you willing to try to modify checkpoint.interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion, i will usecheckpoint.interval to achieve this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @loupipalien ! Overall LGTM except some minior problem.
| String collection = "streaming_simple_example"; | ||
| String vectorField = "book_intro"; | ||
| int checkpointInterval = 30000; | ||
| new Thread( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about use CompletedFuture.runAsync?
| waitCollectionReady(database, collection, vectorField); | ||
| do { | ||
| count = countCollectionEntities(database, collection); | ||
| } while (count < 9); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use Awaitility.await() .atMost(60, TimeUnit.SECONDS) .pollInterval(2, TimeUnit.SECONDS) to verify data in loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
88221c3 to
3978ae4
Compare
|
@Hisoka-X @zhangshenghang The ci of some modules still fails after rebase dev, it doesn't seem to be caused by the changes in the PR, can you help ❤️ |
|
Please pull the latest changes and try again. @loupipalien |
3978ae4 to
75455f3
Compare
@dybyte Thanks, it works |

Purpose of this pull request
Added a
batch_intervalparameter to start a scheduler task for flushing by intervalDoes this PR introduce any user-facing change?
Yes.
Now, when running a source2milvus task in STREAMING mode, the milvus sink will only perform a flush after accumulating enough data records to reach the
batch_size. If thebatch_sizeis not reached, the data read from the source will remain in memory and will not be flushed to milvus until new data is written to reach thebatch_size.After adding this feature, it will trigger a flush either
batch_sizeorbatch_intervalis reachedHow was this patch tested?
Added a new test case
Check list
New License Guide