Releases: komodojp/tinyld
Releases · komodojp/tinyld
Release 1.3.4
Release 1.3.3
Description
- Fix issue with missing bin script file : #21
- Update few dependencies
Release 1.3.2
Description
Maintenance version
- Update deps
- Update tatoeba
- Add heavy flavor
Release 1.3.1
Description
Maintenance version with only small modifications
- update package.json : #16
- update few dependencies (esbuild, typescript)
Release 1.3.0
Description
- Few Chores
- Update Tatoeba Dataset
- Update Node to
18.x
- Update Dependencies (typescript, esbuild, ...)
- Tuning
- Increase the amount for chunk being analyzed for long text #14
- Change a bit verbose log to be more readable
detect('これは日本語です.', { verbose: true })
- Few Fixes
"exports": {
".": {
"require": "./dist/tinyld.normal.node.js",
"import": "./dist/tinyld.normal.node.mjs",
"browser": "./dist/tinyld.normal.browser.js"
},
"./light": {
"require": "./dist/tinyld.light.node.js",
"import": "./dist/tinyld.light.node.mjs",
"browser": "./dist/tinyld.light.browser.js"
}
},
Release 1.2.3
Description
Small maintenance version
- Update few dependencies
- Fix and issue related to TS types (#9)
- Update documentation
Type Declaration
Npm repository does not contains the src/
folder anymore, but type definitions directly in the dist folder.
Release 1.2.2
Description
- Fix an issue with
tinyld-light
which was returning the wrong supportedLanguage list - Update documentation (autogenerated graphs)
- Change charset setup of esbuild
- Optimize profile files to take less space (replace json objects per short string in base36)
- Reduce tinyld
930KB
->590KB
- Reduce tinyld-light
110KB
->68KB
- Reduce tinyld
Full Changelog: 1.2.0...1.2.2
Release 1.2.0
Description
After lot of unsuccessful experimentations, I'm glad to have find a way to improve the accuracy and release it.
I decided to focus on accuracy over quantity for the moment. Making sure the algorithm work properly before trying to scale it up.
With this version 1.2.0
:
- Both
tinyld
andtinyld-light
are over 97% accuracy on 16 most common languages tinyld
global accuracy on all language (64) is over 95% and each language has an accuracy > 80%- This change cause a small disk size increase
Change
Change to the algorithm
- Remove the word ranking step
- Improve the n-gram ranking (based on a variable number of gram)
- Per language coefficient to more accurately specify how much ngram to store per language (optimize space storage)
- use 4-gram and 5-gram more often (as a replacement of word)
New API
Few new API to get the list of supported language and their names
import { supportedLanguages, langName, langRegion } from 'tinyld'
// all supported languages (ISO3 format)
supportedLanguages // ['jpn', 'cmn', ...]
// and few utils about langs
langName('jpn') // Japanese
langRegion('jpn') // east-asia
Language support
- Few languages were disabled
- Few languages were added
- The total number of language is now 64, for the ones removed it's mostly because of bad accuracy (often because of a not good enough training dataset). I will try to bring them back as soon a possible when their accuracy pass over the 80% accuracy threshold.
Per language Detection Accuracy
- Greek (ell) - 100%
- Hindi (hin) - 100%
- Bengali (ben) - 100%
- Thai (tha) - 100%
- Telugu (tel) - 100%
- Gujarati (guj) - 100%
- Tamil (tam) - 100%
- Amharic (amh) - 100%
- Kannada (kan) - 100%
- Burmese (mya) - 100%
- Armenian (hye) - 99.9555%
- Japanese (jpn) - 99.9333%
- Vietnamese (vie) - 99.9067%
- Korean (kor) - 99.8134%
- Khmer (khm) - 99.7354%
- Urdu (urd) - 99.2537%
- Hebrew (heb) - 99.1068%
- Berber (ber) - 99.0135%
- German (deu) - 98.9601%
- Toki Pona (toki) - 98.8801%
- Russian (rus) - 98.8268%
- Persian (pes) - 98.8135%
- Polish (pol) - 98.8002%
- Chinese (cmn) - 98.7602%
- French (fra) - 98.7068%
- Arabic (ara) - 98.4669%
- Finnish (fin) - 98.0936%
- English (eng) - 98.0136%
- Yiddish (yid) - 97.9869%
- Romanian (ron) - 97.9336%
- Mongolian (mon) - 97.8058%
- Lithuanian (lit) - 97.8003%
- Icelandic (isl) - 97.7203%
- Klingon (tlh) - 97.6803%
- Hungarian (hun) - 97.5603%
- Kazakh (kaz) - 97.4214%
- Indonesian (ind) - 97.267%
- Dutch (nld) - 96.8937%
- Tatar (tat) - 96.8271%
- Latvian (lvs) - 96.4734%
- Tagalog (tgl) - 95.8539%
- Ukrainian (ukr) - 95.4673%
- Turkish (tur) - 95.214%
- Portuguese (por) - 95.054%
- Kirundi (run) - 94.6058%
- Turkmen (tuk) - 94.5193%
- Italian (ita) - 94.4541%
- Belarusian (bel) - 94.2808%
- Esperanto (epo) - 93.9475%
- Spanish (spa) - 93.4009%
- Volapuk (vol) - 92.6978%
- Swedish (swe) - 91.9344%
- Irish (gle) - 89.6735%
- Latin (lat) - 89.0948%
- Estonian (est) - 88.6921%
- Czech (ces) - 88.5749%
- Catalan (cat) - 88.0949%
- Danish (dan) - 87.375%
- Afrikaans (afr) - 86.578%
- Bulgarian (bul) - 84.5754%
- Slovak (slk) - 83.4555%
- Serbian (srp) - 83.0823%
- Macedonian (mkd) - 82.709%
- Norwegian (nob) - 81.5358%