Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Kanji Recognition in Japanese Language Detection #381

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

michaelbennieUFL
Copy link

@michaelbennieUFL michaelbennieUFL commented Sep 26, 2024

Related Issues

Solution

This pull request improves the language detection between Japanese and Chinese, specifically when dealing with Kanji characters that are common in Japanese texts. By introducing a new script called Japanese_Han (abbreviated as JHAN), which includes Kanji characters commonly used in Japan, the detection logic is adjusted to better distinguish Japanese from Chinese when only Kanji characters are present. While there is a slight decrease in accuracy for single and dual-character inputs in the high-accuracy model, the overall sentence-level accuracy remains at 100%. This enhancement allows for accurate detection of Japanese sentences that use only Kanji.

Context

In previous versions, the language detector struggled to differentiate between Japanese and Chinese texts that contained only Kanji characters. This issue arose because both languages share many Han characters (Kanji in Japanese), leading to incorrect classification of Japanese Kanji-only texts as Chinese with high confidence. For example, words like "経済" (economy), "労働" (labor), and "勉強中" (studying) are uniquely Japanese but were being detected as Chinese.

Technical Approach

1. Introducing Japanese_Han Script

  • Created a new script Japanese_Han (JHAN) that contains Kanji characters commonly used in Japanese.
  • The JHAN script is added to the src/script.rs file with ranges of Kanji characters specific to Japanese usage.
  • This script excludes Han characters that are unique to Chinese, focusing on those prevalent in Japanese texts.

2. Adjusting Detection Logic

  • Updated the JAPANESE_CHARACTER_SET in src/constant.rs to use the new Japanese_Han script:

    pub(crate) static JAPANESE_CHARACTER_SET: Lazy<CharSet> =
        Lazy::new(|| CharSet::from_char_classes(&["Hiragana", "Katakana", "Japanese_Han"]));
  • Modified the detect_language_with_rules method in src/detector.rs:

    • When a character is not matched by any of the one-language alphabets, it checks against JAPANESE_CHARACTER_SET and Alphabet::Han.
    • If the character matches JAPANESE_CHARACTER_SET, it increments the count for Japanese.
    • If the character matches Alphabet::Han, it increments the count for Chinese.
    • Adjusted the logic to handle uncertainty when both Chinese and Japanese counts are present, deciding based on counts and enabling a low-accuracy mode if necessary.

3. Handling Uncertainty

  • Introduced a variable cjk_lang_uncertainty to track cases where it's challenging to distinguish between Chinese and Japanese.
  • If both languages are detected with high uncertainty, the detector compares the counts and returns the language with the higher count when is_low_accuracy_mode_enabled is true.

Limits

  • Single Character Accuracy: The accuracy for single-character inputs has decreased slightly due to overlapping Kanji characters between Japanese and Chinese.
  • Dual Character Accuracy: Similar decrease observed in dual-character inputs for the same reasons.
  • Ambiguity with Shared Kanji: Some Kanji characters are used in both languages, making it inherently challenging to distinguish based solely on character analysis.
  • Complexity of Kanji Usage: The Japanese language occasionally uses Kanji characters that are uncommon or used differently in Chinese, and vice versa.

4. Test Cases and Accuracy Reports

  • Updated the accuracy reports to reflect the changes:
    • Slight decrease in accuracy for single-character and dual-character inputs in the high-accuracy model.
    • Sentence-level accuracy remains at 100%.
    • Is now able to detect Kanji only Japanese sentences as Japanese
  • Tested the updated detector with various Kanji-only Japanese texts:
Test Case 1: "勉強中" 
--------------------
  Chinese: 0.50
  Japanese: 0.50

Test Case 2: "沒"  (This is meant to be Chinese)
--------------------
  Chinese: 1.00

Test Case 3: "我是你的"  (This is meant to be Chinese)
--------------------
  Chinese: 1.00

Test Case 4: "労働"
--------------------
  Japanese: 1.00

Test Case 5: "御免"
--------------------
  Japanese: 0.55
  Chinese: 0.45

Test Case 6: "漢字"  (This can be both)
--------------------
  Chinese: 0.79
  Japanese: 0.21

Test Case 7: "桜"
--------------------
  Japanese: 1.00

Test Case 8: "峠" (It doesn't catch this one)
--------------------
  Chinese: 1.00

Test Case 9: "畑"
--------------------
  Japanese: 1.00
  Chinese: 0.00

Test Case 10: "塀"
--------------------
  Japanese: 1.00
  Chinese: 0.00

Test Case 11: "経済"
--------------------
  Japanese: 1.00
  Chinese: 0.00

Test Case 12: "和製漢字"  (This can be both)
--------------------
  Chinese: 0.78
  Japanese: 0.22

Test Case 13: "雫"
--------------------
  Japanese: 0.95
  Chinese: 0.05

Test Case 14: "労働"
--------------------
  Japanese: 1.00

Test Case 15: "豆腐"  (This can be both)
--------------------
  Chinese: 0.73
  Japanese: 0.27

Test Case 16: "自動販売機"
--------------------
  Japanese: 0.88
  Chinese: 0.12

Test Case 17: "関西国際空港"
--------------------
  Chinese: 0.59
  Japanese: 0.41

Test Case 18: "関西国际空港" (This is meant to be Chinese)
--------------------
  Chinese: 1.00

Test Case 19: "大阪" (This can be both)
--------------------
  Japanese: 0.80
  Chinese: 0.20

Test Case 20: "東京"  (This can be both)
--------------------
  Japanese: 0.53
  Chinese: 0.47

Test Case 21: "今日は"
--------------------
  Japanese: 1.00

Updated the character set initialization to include "Japanese_Han" in constant.rs. Also added the corresponding JHAN constant in script.rs to support the new character set.
Updated the character set initialization to include "Japanese_Han" in constant.rs. Also added the corresponding JHAN constant in script.rs to support the new character set.
Increased CJK language uncertainty max ratio from 0.4 to 0.6. Added increment call for Chinese when both Chinese and Japanese are present. Removed premature return for Japanese and added a new decrement_counter method for future use.
Eliminated unnecessary `FromStr` import from `language.rs` and `Itertools` import from `model.rs`. These imports were not being used, thus removing them improves code cleanliness and reduces potential confusion.
The CJK_lang_uncertainty variables are renamed to cjk_lang_uncertainty for consistency in naming conventions. Additionally, adjusted cjk_lang_uncertainty_max_ratio for improved accuracy and removed the unused decrement_counter function from the code.
Revision updates show decreased accuracy figures for Chinese language detection across both high and low accuracy reports. The aggregated accuracy values were also modified accordingly to reflect the updated results.
Improved high and low accuracy statistics in Chinese accuracy reports. Corrected mappings for 'い' to 'あ', among other script character updates in `script.rs`.
Refactor code to correctly increment language counters and update logic for handling uncertainty between Chinese and Japanese languages. Adjust accuracy reports to reflect updated detection accuracy metrics.
@michaelbennieUFL
Copy link
Author

It also fixes these issues in the python release

pemistahl/lingua-py#231

pemistahl/lingua-py#202

@pemistahl
Copy link
Owner

Hi Michael, thank you very much for this PR. Finally there is someone who knows Chinese and Japanese well enough to help me distinguish them better. Awesome. :) As I'm planning to make a new release of my library in October, I will evaluate your changes soon and most likely merge them, eventually. Great work!

Copy link
Owner

@pemistahl pemistahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add unit tests for the test cases you have listed in your PR.

&& total_language_counts.contains_key(&Some(Language::Chinese))
&& total_language_counts.contains_key(&Some(Language::Japanese))
&& (cjk_lang_uncertainty as f32 / words.len() as f32) >= cjk_lang_uncertainty_max_ratio
&& self.is_low_accuracy_mode_enabled
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this rule applied in low accuracy mode only? The rule engine should operate independently of the selected accuracy mode.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's due to to the fact that on low accuracy mode, the lingua-rs doesn't use the n-gram model after running detect_language_with_rules (idk if that's right). Regardless, by adding this case in, a lot more Chinese words get recognized as Chinese in low accuracy. Otherwise, they would be misidentified as unknown. If you want I can move this logic to compute_language_confidence_values_for_languages .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you want me to add the unit tests?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add a new unit test method in file detector.rs. I think it's best to use a parameterized test method. Just take a look at the other test methods and do it analogously.

src/detector.rs Outdated
@@ -896,12 +905,28 @@ impl LanguageDetector {
if total_language_counts.len() == 2
&& cfg!(feature = "chinese")
&& cfg!(feature = "japanese")
&& total_language_counts.contains_key(&Some(Language::from_str("Chinese").unwrap()))
&& total_language_counts.contains_key(&Some(Language::from_str("Japanese").unwrap()))
&& total_language_counts.contains_key(&Some(Language::Chinese))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace Language::Chinese with Language::from_str("Chinese") as it was before. The same goes for Japanese. Otherwise, the code won't compile if Chinese and / or Japanese are not among the selected language dependencies.

Replace direct enum references with `Language::from_str`. This change standardizes how language enums are handled, improving code readability and consistency, especially when additional languages are added.
@tats-u
Copy link

tats-u commented Nov 11, 2024

Test Case 8: "峠" (It doesn't catch this one)
--------------------
  Chinese: 1.00

It doesn't have Chinese-derived readings. (Similar as 辻, 畑, and 榊)
I think we can add it to JHAN (in a later PR).

@azagniotov
Copy link

@pemistahl hello. The JVM-based Lingua library can also benefit from the Japanese-language enhancements introduced in the current PR. I can help out with adding these changes in Kotlin. Please let me know your thoughts

@azagniotov
Copy link

azagniotov commented Dec 13, 2024

@michaelbennieUFL and @tats-u ,
what are your thoughts of using the Kanji list https://en.wikipedia.org/wiki/J%C5%8Dy%C5%8D_kanji (常用漢字), 2,136 regular-use Kanji (as officially announced by the Japanese Ministry of Education) instead of the JHan provided in the current PR?

You have the list as a CSV in the following repo that you can download and parse out: https://github.com/sph-mn/nihongo/blob/master/data/jouyou-kanji.csv

@tats-u
Copy link

tats-u commented Dec 14, 2024

Many of them are used also in China (including Taiwan and Hong Kong).

https://github.com/orgs/meilisearch/discussions/532#discussioncomment-3705382

Examples of exceptions:

畑: created in Japan (not used in Chinese)
広: Japanese-specific simplified form

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants