Bumped ICU4N to 60.1.0-alpha.440 by NightOwl888 · Pull Request #1353 · apache/lucenenet

NightOwl888 · 2026-06-16T12:43:16Z

You've read the Contributor Guide and Code of Conduct.
You've included unit or integration tests for your change, where applicable.
You've included inline docs for your change, where applicable.
There's an open issue for the PR that you are making. If you'd like to propose a change, please open an issue to discuss the change or find an existing issue.

Fixes #998

Description

This bumps ICU4N to 60.1.0-alpha.440 which contains the Normalizer2 concurrency patch in NightOwl888/ICU4N#122 that addresses the problems with random failures of TestICUNormalizer2CharFilter.TestRandomStrings() in concurrent environments.

`ThaiTokenizer`

I made an attempt at documenting the "differences" between how ThaiTokenizer behaves and how ICUTokenizer behaves with Thai text. However, after adding several tests to check how they deal with transitions between Thai and non-Thai words and numerals, it turns out they behave identically in all cases I checked. I ended up temporarily replacing the ThaiTokenizer with the implementation of ICUTokenizer and discovered that all of the tests pass. So, the ThaiWordBreaker effectively is a poor-man's way of implementing part of the UAX #29 http://unicode.org/reports/tr29/ spec, but we cannot guarantee that it does completely comply for the Thai language.

I ended up adding documentation explaining that it does not guarantee to be compliant with the Unicode spec and recommended to use ICUTokenizer instead. But I am wondering whether we should just rewire the ICUTokenizer into the ThaiTokenizer and eliminate the "JDK compatible" attempt that is not guaranteed to be compatible with the JDK, anyway.

The JDK RuleBasedBreakIterator behaves differently across different vendors and versions, so there really is no stable target to hit, anyway. Some implementations don't even include a dictionary-based BreakIterator implementation, so Thai tokenization is impossible. Historically, this is one of the main motivations for creating the lucene-analysis-icu package which guarantees stability across JDK versions.

…d documentation recommending to use the ICUTokenizer for Unicode compliance

paulirwin

Approved; note the typo that is failing the pre-commit check. Once fixed looks good to merge.

NightOwl888 added 2 commits June 16, 2026 16:34

dependencies.props: Bumped ICU4N. to 60.1.0-alpha.440

a55feb5

Lucnene.Net.Analysis.Th.ThaiTokenizer: Removed unused usings and adde…

a892fbb

…d documentation recommending to use the ICUTokenizer for Unicode compliance

NightOwl888 requested a review from paulirwin June 16, 2026 12:43

NightOwl888 added the notes:bug-fix Contains a fix for a bug label Jun 16, 2026

paulirwin approved these changes Jun 16, 2026

View reviewed changes

Comment thread src/Lucene.Net.Analysis.Common/Analysis/Th/ThaiTokenizer.cs Outdated

Fixed typo

f46fc7a

NightOwl888 merged commit 7bceef4 into apache:master Jun 16, 2026
211 checks passed

paulirwin pushed a commit to paulirwin/lucene.net that referenced this pull request Jun 16, 2026

Bumped ICU4N to 60.1.0-alpha.440 (apache#1353)

51aae34

dependabot Bot mentioned this pull request Jun 30, 2026

Bump the minor-and-patch group with 1 update liamgold/goldfinch.me#233

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bumped ICU4N to 60.1.0-alpha.440#1353

Bumped ICU4N to 60.1.0-alpha.440#1353
NightOwl888 merged 3 commits into
apache:masterfrom
NightOwl888:fix/GH-998-bump-ICU4N-to-60.1.0-alpha.440

NightOwl888 commented Jun 16, 2026 •

edited

Loading

Uh oh!

paulirwin left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

NightOwl888 commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

ThaiTokenizer

Uh oh!

paulirwin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NightOwl888 commented Jun 16, 2026 •

edited

Loading

`ThaiTokenizer`