Skip to content

Bumped ICU4N to 60.1.0-alpha.440#1353

Merged
NightOwl888 merged 3 commits into
apache:masterfrom
NightOwl888:fix/GH-998-bump-ICU4N-to-60.1.0-alpha.440
Jun 16, 2026
Merged

Bumped ICU4N to 60.1.0-alpha.440#1353
NightOwl888 merged 3 commits into
apache:masterfrom
NightOwl888:fix/GH-998-bump-ICU4N-to-60.1.0-alpha.440

Conversation

@NightOwl888

@NightOwl888 NightOwl888 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor
  • You've read the Contributor Guide and Code of Conduct.
  • You've included unit or integration tests for your change, where applicable.
  • You've included inline docs for your change, where applicable.
  • There's an open issue for the PR that you are making. If you'd like to propose a change, please open an issue to discuss the change or find an existing issue.

Fixes #998

Description

This bumps ICU4N to 60.1.0-alpha.440 which contains the Normalizer2 concurrency patch in NightOwl888/ICU4N#122 that addresses the problems with random failures of TestICUNormalizer2CharFilter.TestRandomStrings() in concurrent environments.

ThaiTokenizer

I made an attempt at documenting the "differences" between how ThaiTokenizer behaves and how ICUTokenizer behaves with Thai text. However, after adding several tests to check how they deal with transitions between Thai and non-Thai words and numerals, it turns out they behave identically in all cases I checked. I ended up temporarily replacing the ThaiTokenizer with the implementation of ICUTokenizer and discovered that all of the tests pass. So, the ThaiWordBreaker effectively is a poor-man's way of implementing part of the UAX #29 http://unicode.org/reports/tr29/ spec, but we cannot guarantee that it does completely comply for the Thai language.

I ended up adding documentation explaining that it does not guarantee to be compliant with the Unicode spec and recommended to use ICUTokenizer instead. But I am wondering whether we should just rewire the ICUTokenizer into the ThaiTokenizer and eliminate the "JDK compatible" attempt that is not guaranteed to be compatible with the JDK, anyway.

The JDK RuleBasedBreakIterator behaves differently across different vendors and versions, so there really is no stable target to hit, anyway. Some implementations don't even include a dictionary-based BreakIterator implementation, so Thai tokenization is impossible. Historically, this is one of the main motivations for creating the lucene-analysis-icu package which guarantees stability across JDK versions.

@NightOwl888 NightOwl888 requested a review from paulirwin June 16, 2026 12:43
@NightOwl888 NightOwl888 added the notes:bug-fix Contains a fix for a bug label Jun 16, 2026

@paulirwin paulirwin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved; note the typo that is failing the pre-commit check. Once fixed looks good to merge.

Comment thread src/Lucene.Net.Analysis.Common/Analysis/Th/ThaiTokenizer.cs Outdated
@NightOwl888 NightOwl888 merged commit 7bceef4 into apache:master Jun 16, 2026
211 checks passed
paulirwin pushed a commit to paulirwin/lucene.net that referenced this pull request Jun 16, 2026
This was referenced Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

notes:bug-fix Contains a fix for a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failing random test seed/culture: en-IE with seed 0x9a2b7430d6d33f0d

2 participants