Skip to content

Commit

Permalink
Merge pull request #21 from keithito/data-instructions
Browse files Browse the repository at this point in the history
Add documentation on preprocessing training data.
  • Loading branch information
keithito authored Aug 14, 2017
2 parents 516ff9d + 3c211e7 commit ae82a7c
Show file tree
Hide file tree
Showing 3 changed files with 105 additions and 2 deletions.
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,7 @@ pip install -r requirements.txt
* [LJ Speech](https://keithito.com/LJ-Speech-Dataset/) (Public Domain)
* [Blizzard 2012](http://www.cstr.ed.ac.uk/projects/blizzard/2012/phase_one) (Creative Commons Attribution Share-Alike)

You can use other datasets if you convert them to the right format. See
[ljspeech.py](datasets/ljspeech.py) for an example.
You can use other datasets if you convert them to the right format. See [TRAINING_DATA.md](TRAINING_DATA.md) for more info.


2. **Unpack the dataset into `~/tacotron`**
Expand Down
66 changes: 66 additions & 0 deletions TRAINING_DATA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Training Data


This repo supports the following speech datasets:
* [LJ Speech](https://keithito.com/LJ-Speech-Dataset/) (Public Domain)
* [Blizzard 2012](http://www.cstr.ed.ac.uk/projects/blizzard/2012/phase_one) (Creative Commons Attribution Share-Alike)

You can use any other dataset if you write a preprocessor for it.


### Writing a Preprocessor

Each training example consists of:
1. The text that was spoken
2. A mel-scale spectrogram of the audio
3. A linear-scale spectrogram of the audio

The preprocessor is responsible for generating these. See [ljspeech.py](datasets/ljspeech.py) for a
heavily-commented example.

For each training example, a preprocessor should:

1. Load the audio file:
```python
wav = audio.load_wav(wav_path)
```

2. Compute linear-scale and mel-scale spectrograms (float32 numpy arrays):
```python
spectrogram = audio.spectrogram(wav).astype(np.float32)
mel_spectrogram = audio.melspectrogram(wav).astype(np.float32)
```

3. Save the spectrograms to disk:
```python
np.save(os.path.join(out_dir, spectrogram_filename), spectrogram.T, allow_pickle=False)
np.save(os.path.join(out_dir, mel_spectrogram_filename), mel_spectrogram.T, allow_pickle=False)
```
Note that the transpose of the matrix returned by `audio.spectrogram` is saved so that it's
in time-major format.

4. Generate a tuple `(spectrogram_filename, mel_spectrogram_filename, n_frames, text)` to
write to train.txt. n_frames is just the length of the time axis of the spectrogram.


After you've written your preprocessor, you can add it to [preprocess.py](preprocess.py) by
following the example of the other preprocessors in that file.



### Text Processing During Training and Eval

Some additional processing is done to the text during training and eval. The text is run
through the `to_sequence` function in [textinput.py](util/textinput.py).

This performs several transformations:
1. Leading and trailing whitespace and quotation marks are removed.
2. Text is converted to ASCII by removing diacritics (e.g. "Crème brûlée" becomes "Creme brulee").
3. Numbers are converted to strings using the heuristics in [numbers.py](util/numbers.py).
*This is specific to English*.
4. Abbreviations are expanded (e.g. "Mr" becomes "Mister"). *This is also specific to English*.
5. Characters outside the input alphabet (ASCII characters and some punctuation) are removed.
6. Whitespace is collapsed so that every sequence of whitespace becomes a single ASCII space.

**Several of these steps are inappropriate for non-English text and you may want to disable or
modify them if you are not using English training data.**
38 changes: 38 additions & 0 deletions datasets/ljspeech.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,20 @@


def build_from_path(in_dir, out_dir, num_workers=1, tqdm=lambda x: x):
'''Preprocesses the LJ Speech dataset from a given input path into a given output directory.
Args:
in_dir: The directory where you have downloaded the LJ Speech dataset
out_dir: The directory to write the output into
num_workers: Optional number of worker processes to parallelize across
tqdm: You can optionally pass tqdm to get a nice progress bar
Returns:
A list of tuples describing the training examples. This should be written to train.txt
'''

# We use ProcessPoolExecutor to parallize across processes. This is just an optimization and you
# can omit it and just call _process_utterance on each input if you want.
executor = ProcessPoolExecutor(max_workers=num_workers)
futures = []
index = 1
Expand All @@ -20,12 +34,36 @@ def build_from_path(in_dir, out_dir, num_workers=1, tqdm=lambda x: x):


def _process_utterance(out_dir, index, wav_path, text):
'''Preprocesses a single utterance audio/text pair.
This writes the mel and linear scale spectrograms to disk and returns a tuple to write
to the train.txt file.
Args:
out_dir: The directory to write the spectrograms into
index: The numeric index to use in the spectrogram filenames.
wav_path: Path to the audio file containing the speech input
text: The text spoken in the input audio file
Returns:
A (spectrogram_filename, mel_filename, n_frames, text) tuple to write to train.txt
'''

# Load the audio to a numpy array:
wav = audio.load_wav(wav_path)

# Compute the linear-scale spectrogram from the wav:
spectrogram = audio.spectrogram(wav).astype(np.float32)
n_frames = spectrogram.shape[1]

# Compute a mel-scale spectrogram from the wav:
mel_spectrogram = audio.melspectrogram(wav).astype(np.float32)

# Write the spectrograms to disk:
spectrogram_filename = 'ljspeech-spec-%05d.npy' % index
mel_filename = 'ljspeech-mel-%05d.npy' % index
np.save(os.path.join(out_dir, spectrogram_filename), spectrogram.T, allow_pickle=False)
np.save(os.path.join(out_dir, mel_filename), mel_spectrogram.T, allow_pickle=False)

# Return a tuple describing this training example:
return (spectrogram_filename, mel_filename, n_frames, text)

0 comments on commit ae82a7c

Please sign in to comment.