Releases · komodojp/tinyld

Few Chores
- Update Tatoeba Dataset
- Update Node to 18.x
- Update Dependencies (typescript, esbuild, ...)
Tuning
- Increase the amount for chunk being analyzed for long text #14
- Change a bit verbose log to be more readable

detect('これは日本語です.', { verbose: true })

Few Fixes
- Fix a compatibility issue between Deno and esbuild #12
- Fix an issue with ESM, the library is now exported in 2 flavor, the node ESM and the browser ESM. This is managed in package.json #13

"exports": {
    ".": {
      "require": "./dist/tinyld.normal.node.js",
      "import": "./dist/tinyld.normal.node.mjs",
      "browser": "./dist/tinyld.normal.browser.js"
    },
    "./light": {
      "require": "./dist/tinyld.light.node.js",
      "import": "./dist/tinyld.light.node.mjs",
      "browser": "./dist/tinyld.light.browser.js"
    }
},

Assets 2

22 Jan 10:16

kefniark

1.2.3

251b89f

Release 1.2.3

Description

Small maintenance version

Update few dependencies
Fix and issue related to TS types (#9)
Update documentation

Type Declaration

Npm repository does not contains the src/ folder anymore, but type definitions directly in the dist folder.

Assets 2

12 Jan 15:38

kefniark

1.2.2

1d75072

Release 1.2.2

Description

Fix an issue with tinyld-light which was returning the wrong supportedLanguage list
Update documentation (autogenerated graphs)
Change charset setup of esbuild
Optimize profile files to take less space (replace json objects per short string in base36)
- Reduce tinyld 930KB -> 590KB
- Reduce tinyld-light 110KB -> 68KB

Full Changelog: 1.2.0...1.2.2

Assets 2

05 Jan 15:49

kefniark

1.2.0

e206be0

Release 1.2.0

Description

After lot of unsuccessful experimentations, I'm glad to have find a way to improve the accuracy and release it.
I decided to focus on accuracy over quantity for the moment. Making sure the algorithm work properly before trying to scale it up.

With this version 1.2.0:

Both tinyld and tinyld-light are over 97% accuracy on 16 most common languages
tinyld global accuracy on all language (64) is over 95% and each language has an accuracy > 80%
This change cause a small disk size increase

Change

Change to the algorithm

Remove the word ranking step
Improve the n-gram ranking (based on a variable number of gram)
Per language coefficient to more accurately specify how much ngram to store per language (optimize space storage)
use 4-gram and 5-gram more often (as a replacement of word)

New API

Few new API to get the list of supported language and their names

import { supportedLanguages, langName, langRegion } from 'tinyld'

// all supported languages (ISO3 format)
supportedLanguages // ['jpn', 'cmn', ...]

// and few utils about langs
langName('jpn') // Japanese
langRegion('jpn') // east-asia

Language support

Few languages were disabled
Few languages were added
The total number of language is now 64, for the ones removed it's mostly because of bad accuracy (often because of a not good enough training dataset). I will try to bring them back as soon a possible when their accuracy pass over the 80% accuracy threshold.

Per language Detection Accuracy

 - Greek (ell) - 100%
 - Hindi (hin) - 100%
 - Bengali (ben) - 100%
 - Thai (tha) - 100%
 - Telugu (tel) - 100%
 - Gujarati (guj) - 100%
 - Tamil (tam) - 100%
 - Amharic (amh) - 100%
 - Kannada (kan) - 100%
 - Burmese (mya) - 100%
 - Armenian (hye) - 99.9555%
 - Japanese (jpn) - 99.9333%
 - Vietnamese (vie) - 99.9067%
 - Korean (kor) - 99.8134%
 - Khmer (khm) - 99.7354%
 - Urdu (urd) - 99.2537%
 - Hebrew (heb) - 99.1068%
 - Berber (ber) - 99.0135%
 - German (deu) - 98.9601%
 - Toki Pona (toki) - 98.8801%
 - Russian (rus) - 98.8268%
 - Persian (pes) - 98.8135%
 - Polish (pol) - 98.8002%
 - Chinese (cmn) - 98.7602%
 - French (fra) - 98.7068%
 - Arabic (ara) - 98.4669%
 - Finnish (fin) - 98.0936%
 - English (eng) - 98.0136%
 - Yiddish (yid) - 97.9869%
 - Romanian (ron) - 97.9336%
 - Mongolian (mon) - 97.8058%
 - Lithuanian (lit) - 97.8003%
 - Icelandic (isl) - 97.7203%
 - Klingon (tlh) - 97.6803%
 - Hungarian (hun) - 97.5603%
 - Kazakh (kaz) - 97.4214%
 - Indonesian (ind) - 97.267%
 - Dutch (nld) - 96.8937%
 - Tatar (tat) - 96.8271%
 - Latvian (lvs) - 96.4734%
 - Tagalog (tgl) - 95.8539%
 - Ukrainian (ukr) - 95.4673%
 - Turkish (tur) - 95.214%
 - Portuguese (por) - 95.054%
 - Kirundi (run) - 94.6058%
 - Turkmen (tuk) - 94.5193%
 - Italian (ita) - 94.4541%
 - Belarusian (bel) - 94.2808%
 - Esperanto (epo) - 93.9475%
 - Spanish (spa) - 93.4009%
 - Volapuk (vol) - 92.6978%
 - Swedish (swe) - 91.9344%
 - Irish (gle) - 89.6735%
 - Latin (lat) - 89.0948%
 - Estonian (est) - 88.6921%
 - Czech (ces) - 88.5749%
 - Catalan (cat) - 88.0949%
 - Danish (dan) - 87.375%
 - Afrikaans (afr) - 86.578%
 - Bulgarian (bul) - 84.5754%
 - Slovak (slk) - 83.4555%
 - Serbian (srp) - 83.0823%
 - Macedonian (mkd) - 82.709%
 - Norwegian (nob) - 81.5358%

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Description

Screenshot

Description

Description

Description

Description

Description

Type Declaration

Description

Description

Change

Change to the algorithm

New API

Language support

Per language Detection Accuracy

Releases: komodojp/tinyld

Release 1.3.4

Description

Screenshot

Release 1.3.3

Description

Release 1.3.2

Description

Release 1.3.1

Description

Release 1.3.0

Description

Release 1.2.3

Description

Type Declaration

Release 1.2.2

Description

Release 1.2.0

Description

Change

Change to the algorithm

New API

Language support

Per language Detection Accuracy