-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-6647 KafkaStreams.cleanUp creates .lock file in directory it tries to clean #5650
Conversation
@guozhangwang I was thinking about your compatibility concerns. Could we fix is with the following approach: we encode which lock structure to use in rebalance protocol (we can simply pump up the version) -- if at least one instance is on old version, we still use old locks -- after all instances are on new version, we switch from old lock files to new lock files (for this, code must hold old lock, get new lock, releases old lock). Thoughts? |
|
||
try ( | ||
final FileChannel channel = FileChannel.open( | ||
new File(taskDirectory, StateDirectory.LOCK_FILE_NAME).toPath(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we remove this test? Seems, we should update the FileChannel
here to use the new lock file name?
@@ -196,15 +176,14 @@ public void shouldCleanUpTaskStateDirectoriesThatAreNotCurrentlyLocked() throws | |||
directory.directoryForTask(new TaskId(2, 0)); | |||
|
|||
List<File> files = Arrays.asList(appDir.listFiles()); | |||
assertEquals(3, files.size()); | |||
assertEquals(1, files.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this? Shouldn't directory.lock(task0);
and directory.lock(task1);
have create a lock file each?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will dig some more on these tests once Guozhang confirms the plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was due to the specification of StandardOpenOption.DELETE_ON_CLOSE to FileChannel.open call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mjsax
I want to get your opinion on whether StandardOpenOption.DELETE_ON_CLOSE should be kept in the PR.
This would affect how test is modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for late reply. It was a little crazy the last weeks and I did not find time earlier.
I cannot remember why we want to add the DELETE_ON_CLOSE option? Can you refresh my mind?
Also, I am not sure why this option reduced the file count? I understand that the task directories are actually not created any longer, however, we moved both lock files up the hierarchy and thus the count should not change?
Also, did you see this older comment: #5650 (comment) For a clean upgrade path, addressing this issue is important.
Just to clarify, this decision is done at the leader side when assigning partitions right? If yes, that sounds good to me. |
Yes, the consumer group leader collects all consumer versions, and downgrades via version probing if necessary. |
I ran the failed tests from Java 11 locally which passed. |
w.r.t. consumer group leader collecting consumer versions, a bit more pointer is appreciated. Thanks |
The failed tests in jdk8 run were not related to PR. |
The rebalance version thing is basically based on KIP-268 https://cwiki.apache.org/confluence/display/KAFKA/KIP-268%3A+Simplify+Kafka+Streams+Rebalance+Metadata+Upgrade We can exploit this, by bumping the version number to indicate a change (ie, we need to dump the number in Does this make sense? I can provide more details if necessary. Hope it's good starting point for you to dig into it. |
Thanks for the hint. |
Currently taskManager#taskCreator is not accessible to onAssignment method. |
StateDirectory is constructed and passed to StreamThread. |
I think this should never happen, because we only lock a task directory after the task was assigned. Maybe we can put a guard to avoid bug with For the first question, it seems ok to me to add a method to Does this help? |
I have been checking this KAFKA-6647 with two locking policy approach. Some questions: So can this useOldLocking field be one per StateDirectory or should it be set by each TaskId or StreamThread? When this locking of StateDirectory is used? If the rebalance is happening and there is new onAssignment, is this closeStateManager called before it or can the task be already locked?
Related to AssignmentInfo and SubscriptionInfo: |
Closing this stale PR. The corresponding Jira ticket is marked as "resolved" already. |
Specify StandardOpenOption#DELETE_ON_CLOSE when creating the FileChannel.
Move lock file up one level.
Committer Checklist (excluded from commit message)