Fix issue with cloned RuleBasedBreakIterator failing under heavy parallel load, #95#96
Merged
Merged
Conversation
This was referenced Nov 24, 2024
NightOwl888
requested changes
Nov 25, 2024
NightOwl888
approved these changes
Nov 25, 2024
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #95
This fixes an issue with RuleBasedBreakIterator failing under heavy parallel load. This originally was found via Lucene.NET's TestRandomStrings methods, which spawn multiple threads and generate random Unicode strings to run through analysis. See apache/lucenenet#269 for some examples.
This PR adds a test that isolates the problem, and can be easily reproduced if you revert the change to
DequeI. The problem was that cloned RuleBasedBreakIterator instances did not properly clone all of their object graph's data, and DequeI modified itself rather than modifying the clone. This created invalid state and thus the modified integer array would accidentally leak across threads to multiple instances incorrectly.This surfaced reliably when
new CjkBreakEngine(korean: false)was in the set of break engines, and would generally not be a problem when it wasn't in the list. I'm thinking this is possibly due to the large set of characters in this version of CjkBreakEngine making it more likely to "hit" and handle the random characters, rather than anything particularly problematic about CjkBreakEngine itself.