Giant_ja-en_parallel_corpus: 2.8M Ja/En Subtitle Corpus

This directory includes a giant Japanese-English subtitle corpus. The raw data comes from the Stanford’s JESC project.

Data Example

# test.ja
顔面 パンチ かい ?
お姉ちゃん 、 何で ?
もしくは 実際 の 私 の 要求 を 満たす こと も かのう でしょ う 。
分かっ た 、 リジー 。
夫 を 自分 で 、 けがす こと に なり ます 。
あの 、 それ くらい に 、 し て おい て くれ ない ?
お 掛け 下さい 。

# test.en
so face punch , huh ?
lisa , no !
or you could actually meet my need .
me ! ok , lizzy .
my husband would defile himself .
hey , can you leave it at that ?
we can sit in here .

Modifications

Several pre-processing has been done to make the dataset easier to use.

Overall:

Delete the pair that Japanese phrase only have only one word.
The data has been split into train/dev/test set with following size
- train: 2,795,067 phrase pairs
- dev: 2,800 phrase pairs
- test: 2,800 phrase pairs

For English text:

Add ‘.’ to the end of English phrase if it do not end with punctuation.
Tokenize text with `nltk.

For Japanese text:

Add ‘。’ to the end of Japanese phrase if it do not end with punctuation.
Replace space inside the phrase with ‘、’.
Tokenize text with tokenizer Mecab and dictionary mecab-ipadic-neologd.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
en_ja_data.zip		en_ja_data.zip
preprocessing.py		preprocessing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Giant_ja-en_parallel_corpus: 2.8M Ja/En Subtitle Corpus

Data Example

Contents

Modifications

About

Releases

Packages

Languages

DayuanJiang/giant_ja-en_parallel_corpus

Folders and files

Latest commit

History

Repository files navigation

Giant_ja-en_parallel_corpus: 2.8M Ja/En Subtitle Corpus

Data Example

Contents

Modifications

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages