-
Notifications
You must be signed in to change notification settings - Fork 11
/
README
63 lines (51 loc) · 2.12 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
================
``langid.c`` readme
================
Introduction
------------
`langid.c` is an experimental implementation of the language identifier
described by [1] in pure C. It is largely based on the design of
`langid.py`[2], and uses `langid.py` to train models.
Planned features
----------------
See TODO
Speed
-----
Initial comparisons against Google's cld2[3] suggest that `langid.c` is about
twice as fast.
(langid.c) @mlui langid.c git:[master] wc -l wikifiles
28600 wikifiles
(langid.c) @mlui langid.c git:[master] time cat wikifiles | ./compact_lang_det_batch > xxx
cat wikifiles 0.00s user 0.00s system 0% cpu 7.989 total
./compact_lang_det_batch > xxx 7.77s user 0.60s system 98% cpu 8.479 total
(langid.c) @mlui langid.c git:[master] time cat wikifiles | ./langidOs -b > xxx
cat wikifiles 0.00s user 0.00s system 0% cpu 3.577 total
./langidOs -b > xxx 3.44s user 0.24s system 97% cpu 3.759 total
(langid.c) @mlui langid.c git:[master] wc -l rcv2files
20000 rcv2files
(langid.c) @mlui langid.c git:[master] time cat rcv2files | ./langidO2 -b > xxx
cat rcv2files 0.00s user 0.00s system 0% cpu 31.702 total
./langidO2 -b > xxx 8.23s user 0.54s system 22% cpu 38.644 total
(langid.c) @mlui langid.c git:[master] time cat rcv2files | ./compact_lang_det_batch > xxx
cat rcv2files 0.00s user 0.00s system 0% cpu 18.343 total
./compact_lang_det_batch > xxx 18.14s user 0.53s system 97% cpu 19.155 total
Model Training
--------------
Google's protocol buffers [4] are used to transfer models between languages. The
Python program `ldpy2ldc.py` can convert a model produced by langid.py [2] into
the protocol-buffer format, and also the C source format used to compile an
in-built model directly into executable.
Dependencies
------------
Protocol buffers [4]
protobuf-c [5]
Contact
-------
Marco Lui <[email protected]>
References
----------
[1] http://aclweb.org/anthology-new/I/I11/I11-1062.pdf
[2] https://github.com/saffsd/langid.py
[3] https://code.google.com/p/cld2/
[4] https://github.com/google/protobuf/
[5] https://github.com/protobuf-c/protobuf-c