This repository provide you python scripts for decoding images of the EMNIST from the official binary format files. The EMNIST contains not only digits images but also alphabet ones.
$ sh download_binary.sh
$ unzip gzip.zip
$ cd gzip/
$ gzip -d emnist-*.gz
The EMNIST dataset consists of some datasets.
Choose desired one from the following list.
EMNIST Balanced
: 131,600 characters. 47 balanced classes.EMNIST ByClass
: 814,255 characters. 62 unbalanced classes.EMNIST ByMerge
: 814,255 characters. 47 unbalanced classes.EMNIST Digits
: 280,000 characters. 10 balanced classes.EMNIST Letters
: 145,600 characters. 26 balanced classes.
Here I'll show an example in the case of EMNIST Balanced
(whatever you choose, the following process is exactly the same).
First, comment out the other lines than the part of EMNIST Balanced
from decode.py
as follows.
#!/bin/sh
# EMNIST Balanced: 131,600 characters. 47 balanced classes.
python decode.py gzip/emnist-balanced-train-images-idx3-ubyte gzip/emnist-balanced-train-labels-idx1-ubyte gzip/emnist-balanced-mapping.txt ./emnist_balanced/train
python decode.py gzip/emnist-balanced-test-images-idx3-ubyte gzip/emnist-balanced-test-labels-idx1-ubyte gzip/emnist-balanced-mapping.txt ./emnist_balanced/test
# EMNIST ByClass: 814,255 characters. 62 unbalanced classes.
#python decode.py gzip/emnist-byclass-train-images-idx3-ubyte gzip/emnist-byclass-train-labels-idx1-ubyte gzip/emnist-byclass-mapping.txt ./emnist_byclass/train
#python decode.py gzip/emnist-byclass-test-images-idx3-ubyte gzip/emnist-byclass-test-labels-idx1-ubyte gzip/emnist-byclass-mapping.txt ./emnist_byclass/test
# EMNIST ByMerge: 814,255 characters. 47 unbalanced classes.
#python decode.py gzip/emnist-bymerge-train-images-idx3-ubyte gzip/emnist-bymerge-train-labels-idx1-ubyte gzip/emnist-bymerge-mapping.txt ./emnist_bymerge/train
#python decode.py gzip/emnist-bymerge-test-images-idx3-ubyte gzip/emnist-bymerge-test-labels-idx1-ubyte gzip/emnist-bymerge-mapping.txt ./emnist_bymerge/test
# EMNIST Digits: 280,000 characters. 10 balanced classes.
#python decode.py gzip/emnist-digits-train-images-idx3-ubyte gzip/emnist-digits-train-labels-idx1-ubyte gzip/emnist-digits-mapping.txt ./emnist_digits/train
#python decode.py gzip/emnist-digits-test-images-idx3-ubyte gzip/emnist-digits-test-labels-idx1-ubyte gzip/emnist-digits-mapping.txt ./emnist_digits/test
# EMNIST Letters: 145,600 characters. 26 balanced classes.
#python decode.py gzip/emnist-letters-train-images-idx3-ubyte gzip/emnist-letters-train-labels-idx1-ubyte gzip/emnist-letters-mapping.txt ./emnist_letters/train
#python decode.py gzip/emnist-letters-test-images-idx3-ubyte gzip/emnist-letters-test-labels-idx1-ubyte gzip/emnist-letters-mapping.txt ./emnist_letters/test
Second, run decode.sh
in your command line.
$ sh decode.sh
After running decode.sh
, A directory emnist_balanced
is created in the current directory.
The training images are saved in emnist_balanced/train
, while the validation images are in emnist_balanced/test
.