The EVBCopus contains over 20,000,000 words (20 million) from 15 bilingual books, 100 parallel English-Vietnamese / Vietnamese-English texts, 250 parallel law and ordinance texts, 5,000 news articles, and 2,000 film subtitles. The composition, annotation, encoding and availability of the corpus are meant to facilitate developments of language technology and studies in bilingual terminology extraction, primarily for the English-Vietnamese-English language pair.
The building EVBCorpus process includes four main steps:
- Collect data and align bitext at the paragraph level;
- Align bitext at the sentence level,
- Linguistic analysis and tagging;
- Annotate and correct corpus with toolkits. As result, the EVBCopus was aligned at the sentence level; and a part of this corpus containing 5,000 news articles was aligned at the word level by tool and annotators.
Release EVBNews v.1.0 with 1,000 parallel documents, download at: https://github.com/qhungngo/EVBCorpus/blob/master/EVBCorpus_EVBNews_v1.0.rar https://sites.google.com/a/uit.edu.vn/hungnq/evbcorpus/EVBCorpus_EVBNews_v1.0.rar?attredirects=0&d=1
**Release EVBNews v.2.0 with 1,000 word aligned parallel documents, download at: ** https://github.com/qhungngo/EVBCorpus/blob/master/EVBCorpus_EVBNews_v2.0.rar https://sites.google.com/a/uit.edu.vn/hungnq/evbcorpus/EVBCorpus_EVBNews_v2.0.rar?attredirects=0&d=1
If you are interested in the corpus, please email to hungnq(at)uit.edu.vn to have more details.
Source | Document | Paragraph | Sentence | Word |
---|---|---|---|---|
Books | 15 | 14,195 | 61,167 | 1,335,180 |
Fictions | 100 | 192,898 | 489,787 | 6,129,161 |
Laws | 250 | 86,848 | 98,064 | 1,981,932 |
ETests | 500 | 20,288 | 21,575 | 411,093 |
News | 5,000 | 94,933 | 173,903 | 2,965,590 |
Subtitles | 2,000 | 1,302,839 | 1,447,581 | 8,150,080 |
Total | 7,865 | 1,712,001 | 2,292,077 | 20,973,036 |
Source | Document | Paragraph | Sentence | Word |
---|---|---|---|---|
Books | 15 | 13,980 | 80,323 | 1,375,492 |
Fictions | 100 | 192,723 | 491,703 | 6,307,613 |
Laws | 250 | 86,803 | 98,102 | 1,912,055 |
News | 1,000 | 24,523 | 45,531 | 740,534 |
Total | 1,365 | 318,029 | 715,659 | 10,431,592 |
English-Vietnamese Word Alignment Corpus (EVWACorpus)
The EVWACorpus contains 1,000 news articles with 45,531 sentence pairs and 740,534 words which are aligned manually at the word level between English and Vietnamese sentence. Details of the EVWACorpus:
-- | English | Vietnamese |
---|---|---|
Files | 1,000 | 1,000 |
Sentences | 45,531 | 45,531 |
Words | 740,534 | 832,441 |
Sure Alignments | 447,906 | 447,906 |
Possible Alignments | 560,215 | 560,215 |
Words in Alignments | 654,060 | 768,031 |
English-Vietnamese Chunker Corpus (EVChkCorpus)
The EVChkCorpus contains 1,000 news articles with 45,531 sentence pairs. It is tagged 5 raw chunker tags in both English and Vietnamese text. Details of the EVChkCorpus:
Tag | Name | English | Vietnamese |
---|---|---|---|
NP | Noun Phrase | 212,500 | 209,824 |
VP | Verb Phrase | 90,784 | 123,600 |
PP | Preposition Phrase | 79,853 | 70,457 |
ADVP | Adjective Phrase | 18,318 | |
ADJP | Adverb Phrase | 8,367 | 15,104 |
English-Vietnamese Named Entities Corpus (EVNECorpus)
The EVNECorpus contains 1,000 news articles with 45,531 sentence pairs. It is tagged named entities in both English and Vietnamese text. Details of the EVNECorpus:
Label | Name | English | Vietnamese |
---|---|---|---|
LOC | Location | 10,115 | 10,006 |
PER | Person | 6,869 | 6,741 |
ORG | Oganization | 7,837 | 7,549 |
PCT | Percentage | 1,107 | 921 |
MON | Money | 898 | 823 |
TIM | Time | 4,244 | 4,100 |
- | Total | 35,879 | 34,732 |
The canonical publication for the EVBNews or EVBCorpus is:
Quoc Hung Ngo, Werner Winiwarter, and Bartholomaus Wloka, (2013). "EVBCorpus - A Multi-Layer English-Vietnamese Bilingual Corpus for Studying Tasks in Comparative Linguistics", In Proceedings of the 11th Workshop on Asian Language Resources (11th ALR within the IJCNLP2013), pp. 1-9. Asian Federation of Natural Language Processing, 2013.
Quoc-Hung Ngo, Werner Winiwarter, (2012). "Building an English-Vietnamese Bilingual Corpus for Machine Translation", International Conference on Asian Language Processing 2012 (IALP 2012), pp. 157-160. IEEE Computer Society, 2012.
The canonical publication for the EVNECorpus is:
Quoc Hung Ngo, Dinh Dien, and Werner Winiwarter, (2014). "Building English-Vietnamese Named Entity Corpus with Aligned Bilingual News Articles", The 5th Workshop on South and Southeast Asian Natural Languages Processing (5th SSANLP within the COLING2014). Association for Computational Linguistics, 2014.
The canonical publication for the Annotation Tool is:
Quoc-Hung Ngo, Werner Winiwarter (2012). "A Visualizing Annotation Tool for Semi-Automatically Building a Bilingual Corpus", In Proceedings of the 5th Workshop on Building and Using Comparable Corpora, LREC2012 Workshop, pages 67-74. Association for Computational Linguistics, 2012.
The canonical publication for the GetWebContent tool is:
Quoc-Hung Ngo, Dinh Dien, Werner Winiwarter, (2012). "Automatic Searching for English-Vietnamese Documents on the Internet", The 3rd Workshop on South and Southeast Asian Natural Languages Processing (3rd SSANLP within the COLING2012), pp. 211-220. Association for Computational Linguistics, 2012.
In Use with academic purposes:
- Trieu, Hai Long, Vu Tran, and Nguyen Le Minh. "Investigating phrase-based and neural-based machine translation on low-resource settings." Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation. 2017.
- Trieu, Long Hai. "A Study On Machine Translation For Low-Resource Languages". Thesis of Doctor of Philosophy, JAIST, 2017. Phuoc, Nguyen Quang, Yingxiu Quan, and Cheol-Young Ock. "Building a bidirectional English-Vietnamese statistical machine translation system by using MOSES." International Journal of Computer and Electrical Engineering 8.2 (2016): 161.
- Song Cong Nguyen Duc; Q.Hung Ngo; JIAMTHAPTHAKSIN, Rachsuda. State-of-the-art Vietnamese word segmentation. In: Science in Information Technology (ICSITech), 2016 2nd International Conference on. IEEE, 2016. p. 119-124.
- Nguyen, L. H., Dinh, D., & Tran, P. (2016). An Approach to Construct a Named Entity Annotated English-Vietnamese Bilingual Corpus. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 16(2), 9.
- Dawborn, Timothy James. "DOCREP: Document Representation for Natural Language Processing." Thesis of Doctor of Philosophy, The University of Sydney, 2015.
- Lam, Khang Nhut. "Automatically creating multilingual lexical resources." Proceedings of the Nineteenth AAAI/SIGAI Doctoral Consortium. 2014.
- Huy, Dang Ngoc, and Pusadee Seresangtakul. "Vietnamese-Thai Lexicon for Machine Translation." The Tenth Symposium on Natural Language Processing (SNLP2013), Phuket, Thailand. 2013.
- GIANG, Lam Tung; HUNG, Vo Trung; PHAP, Huynh Cong. Experiments with query translation and re-ranking methods in Vietnamese-English bilingual information retrieval. In: Proceedings of the Fourth Symposium on Information and Communication Technology. ACM, 2013. p. 118-122.
If you are interested in the corpus, please email to hungnq(at)uit.edu.vn to have more details.