Name	Name	Last commit message	Last commit date
parent directory ..
01-download.sh	01-download.sh
02-audio_feature_extraction.sh	02-audio_feature_extraction.sh
03-preprocess.sh	03-preprocess.sh
README.md	README.md
RESULTS.md	RESULTS.md
asr_prediction_args.yml	asr_prediction_args.yml
asr_training_args.yml	asr_training_args.yml
asr_validation_args.yml	asr_validation_args.yml
mt_prediction_args.yml	mt_prediction_args.yml
mt_training_args.yml	mt_training_args.yml
mt_validation_args.yml	mt_validation_args.yml
st_prediction_args.yml	st_prediction_args.yml
st_training_args.yml	st_training_args.yml
st_validation_args.yml	st_validation_args.yml

Speech Translation on Argumented LibriSpeech

Augmented LibriSpeech is a small EN->FR speech translation corpus which was originally started from the LibriSpeech corpus. The English utterances were automatically aligned to the e-books in French and 236 hours of English speech aligned to French translations at utterance level were finally extracted. It has been widely used in previous studies. As such, we use the clean 100-hour portion plus the augmented machine translation from Google Translate as the training data.

The final performance of speech translation on Argumented LibriSpeech is:

See RESULTS for the comparison with counterparts.

ASR (dmodel=256, WER)

Model	Dev	Test
Transformer ASR	8.8	8.8

MT and ST (dmodel=256, case-sensitive, tokenized BLEU/detokenized BLEU)

Model	Dev	Test
Transformer MT	20.8 / 19.3	19.3 / 17.6
cascade ST (Transformer ASR -> Transformer MT)	18.3 / 17.0	17.4 / 16.0
Transformer ST + ASR pretrain	18.3 / 16.9	16.9 / 15.5
Transformer ST + ASR pretrain + SpecAug	19.3 / 17.8	17.8 / 16.3
Transformer ST ensemble above 2 models	19.3 / 18.0	18.3 / 16.8

MT and ST (dmodel=256, case-insensitive, tokenized BLEU/detokenized BLEU)

Model	Dev	Test
Transformer MT	21.7 / 20.2	20.2 / 18.5
cascade ST (Transformer ASR -> Transformer MT)	19.2 / 17.8	18.2 / 16.8
Transformer ST + ASR pretrain	19.2 / 17.8	17.9 / 16.5
Transformer ST + ASR pretrain + SpecAug	20.2 / 18.7	18.7 / 17.2
Transformer ST ensemble above 2 models	20.3 / 18.9	19.2 / 17.7

In this recipe, we will introduce how to pre-process the Augmented LibriSpeech corpus and train/evaluate a speech translation model using neurst.

Requirements
Data preprocessing
- Step 1: Download Data
- Step 2: Extract audio features
- Step 3: Preprocess transcriptions and translations
Training and evaluation

Requirements

apt

libsndfile1

pip

TensorFlow >=2.3.0
soundfile
python_speech_features
subword-nmt
pyyaml
sacrebleu
sacremoses

others

$ git clone https://github.com/moses-smt/mosesdecoder.git

Data preprocessing

Step 1: Download Data

First, we download the original zip files into directory /path_to_data/raw/ and we have

/path_to_data/
└── raw
    ├── train_100h.zip
    ├── dev.zip
    └── test.zip

Step 2: Extract audio features

The speech translation corpus contains source raw audio files, texts in a target language and other optional information (e.g. transcriptions of the corresponding audio files). Here we pre-compute audio features (that is, log-mel filterbank coefficients) because the computation is time-consuming and features are usually fixed during training and evaluation.

Though NeurST supports preprocessing audio inputs on-the-fly, we recommend to pack the extracted features into TF Records to alleviate the I/O and CPU overhead.

We can extract audio features with

$ ./examples/speech_to_text/augmented_librispeech/02-audio_feature_extraction.sh /path_to_data

By default, it extracts 80-channel log-mel filterbank coefficients using a lightweight python package python_speech_features with windows of 25ms and steps of 10ms. Then we have

/path_to_data/
├── devtest
│   ├── dev.tfrecords-00000-of-00001
│   └── test.tfrecords-00000-of-00001
├── train
│   ├── train.tfrecords-00000-of-00064
│   ├── ......
│   └── train.tfrecords-00063-of-00064
└── transcripts
    ├── dev.en.txt
    ├── dev.fr.txt
    ├── test.en.txt
    ├── test.fr.txt
    ├── train.en.txt
    └── train.fr.txt

where the directory /path_to_data/train/(/path_to_data/devtest) contains the extracted audio features and the corresponding transcriptions (and translations) in TF Record format for training (and evaluation). Transcriptions and translations in txt format are stored in /path_to_data/transcripts.

Furthermore, to examine the elements in the TF Record files, we can simply run the command line tool view_tfrecord:

$ python3 -m neurst.cli.view_tfrecord /path_to_data/train/

features {
  feature {
    key: "audio"
    value {
      float_list {
        value: -0.3024393916130066
        value: -0.4108518660068512
        ......
      }
    }
  }
  feature {
    key: "transcript"
    value {
      bytes_list {
        value: "valentine"
      }
    }
  }
  feature {
    key: "translation"
    value {
      bytes_list {
        value: "Valentin?"
      }
    }
  }
}

elements: {
    "transcript": bytes (str)
    "translation": bytes (str)
    "audio": float32
}

Step 3 Preprocess transcriptions and translations

As is mentioned above, we can map the word tokens to IDs aforehand, to speed up the training process.

By running with

$ ./examples/speech_to_text/augmented_librispeech/03-preprocess.sh /path_to_moses /path_to_data

we learn vocabulary based on BPE rules with 8,000 merge operations. The learnt BPE and vocabulary are shared across ASR, MT and ST tasks. Note that, we lowercase the transcriptions and remove all punctuations while the cases and punctuations of translations are reserved and we simply apply moses tokenizer. As a result, we obtain

/path_to_data/
├── asr_st
│   ├── asr_prediction_args.yml
│   ├── asr_training_args.yml
│   ├── asr_validation_args.yml
│   ├── codes.bpe
│   ├── st_prediction_args.yml
│   ├── st_training_args.yml
│   ├── st_validation_args.yml
│   ├── train
│   │   ├── train.tfrecords-00000-of-00064
│   │   ├── ......
│   │   └── train.tfrecords-00063-of-00064
│   ├── vocab.en
│   └── vocab.fr
└── mt
    ├── codes.bpe
    ├── mt_prediction_args.yml
    ├── mt_training_args.yml
    ├── mt_validation_args.yml
    ├── train
    │   ├── train.en.bpe.txt
    │   └── train.fr.tok.bpe.txt
    ├── vocab.en
    └── vocab.fr

Here, we use txt files (not TF Record) for MT tasks, while the pre-processed training samples for ASR/ST are stored in TF Record files (/path_to_data/asr_st/train/).

In addition, configuration files (*.yml) are generated for the following training/evaluation process. In detail,

*_training_args.yml: defines the arguments for training, such as batch size, optimizer, paths of training data and data pre-processing pipelines.
*_validation_args.yml: defines the arguments for validation during training, containing validation dataset, interval between two validation procedures, metrics and configurations about automatic checkpoint average.
*_prediction_args.yml: defines the arguments for inference and evaluation, containing testsets, inferece options (like beam size) and metric.

Training and evaluation

Training with validation

Let's take ASR as an example:

python3 -m neurst.cli.run_exp \
    --config_paths /path_to_data/asr_st/asr_training_args.yml,/path_to_data/asr_st/asr_validation_args.yml \
    --hparams_set speech_transformer_s \
    --model_dir /path_to_data/asr_st/asr_benchmark

where /path_to_data/asr_st/asr_benchmark is the root path for checkpoints. Here we use --hparams_set speech_transformer_s to train a transformer model including 12 encoder layers and 6 decoder layers with dmodel=256.

Alternatively, we can set --hparams_set speech_transformer_m to use the dmodel=512 version, which usually achives better performance.

We train the ASR model on multiple GPUs, as long as there is no GPU out-of-memory exception. Moreover, we can set --update_cycle n --batch_size 120000//n to simulate n GPUs with 1 GPU.

Accelerating training with TensorFlow XLA

To accelerate the training speed, we can simply enable TensorFlow XLA via --enable_xla option and separate the validation procedure from the training, that is

python3 -m neurst.cli.run_exp \
    --config_paths /path_to_data/asr_st/asr_training_args.yml \
    --hparams_set speech_transformer_s \
    --model_dir /path_to_data/asr_st/asr_benchmark \
    --enable_xla

Then, we start another process with one GPU for validation by

python3 -m neurst.cli.run_exp \
    --entry validation \
    --config_paths /path_to_data/asr_st/asr_training_args.yml \
    --model_dir /path_to_data/asr_st/asr_benchmark

This process will constantly scan the model_dir, evaluate each checkpoint and store the checkpoints with best metrics (e.g. WER for ASR) into {model_dir}/best directory along with the corresponding averaged version (by default 10 latest checkpoints) into {model_dir}/best_avg.

Evaluation on Testset

By running with

python3 -m bytdseq.cli.run_exp \
    --config_paths /path_to_data/asr_st/asr_prediction_args.yml \
    --model_dir /path_to_data/asr_st/asr_benchmark/best_avg

WER will be reported on both dev and test set.

One can replace the yaml files and model directory of ASR with MT/ST's to train and evaluate MT/ST models.

Training ST with ASR Pretraining

In ST literature, training ST is more difficult than ASR and MT. Transfer learning from ASR and MT tasks is an effective approach to this problem. To do so, we can initialize the ST encoder with ASR encoder by two additional options for training:

    --pretrain_model /path_to_data/asr_st/asr_benchmark/best_avg \
    --pretrain_variable_pattern "(TransformerEncoder)|(input_audio)"

The variables that match the regular expression provided by --pretrain_variable_pattern will be initialized.

On this basis, we can further initialize the ST decoder with MT decoder by following options:

    --pretrain_model /path_to_data/asr_st/asr_benchmark/best_avg /path_to_data/mt/mt_benchmark/best_avg \
    --pretrain_variable_pattern "(TransformerEncoder)|(input_audio)" "(TransformerDecoder)|(target_symbol)"

To inspect the names of model variables, use inspect_checkpoint tool (see neurst/cli/README.md).

SpecAugment

To further improve the performance of ASR or ST, we can apply SpecAugment (Park et al., 2019) by option --specaug VALUE. Alternatively, the VALUE can be set to LB, LD, SM and SS (described in the original paper), or a json-like string defining the detailed arguments (see neurst/utils/audio_lib.py))

Cascade ST

NeurST provides cascade_st tool for easily combining ASR and MT models, e.g.

python3 -m neurst.cli.cascade_st \
    --dataset AudioTripleTFRecordDataset
    --dataset.params "{'data_path':'/path_to_data/devtest/test.tfrecords-00000-of-00001'}" \
    --asr_model_dir /path_to_data/asr_st/asr_benchmark/best_avg \
    --asr_search_method beam_search \
    --asr_search_method.params "{'beam_size':4,'length_penalty':-1,'maximum_decode_length':150}" \
    --mt_model_dir /path_to_data/mt/mt_benchmark/best_avg \
    --mt_search_method beam_search \
    --mt_search_method.params "{'beam_size':4,'length_penalty':-1,'maximum_decode_length':180}"

For more details about the arguments, use -h option.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

augmented_librispeech

augmented_librispeech

README.md

Speech Translation on Argumented LibriSpeech

Contents

Requirements

Data preprocessing

Step 1: Download Data

Step 2: Extract audio features

Step 3 Preprocess transcriptions and translations

Training and evaluation

Training with validation

Accelerating training with TensorFlow XLA

Evaluation on Testset

Training ST with ASR Pretraining

SpecAugment

Cascade ST

Files

augmented_librispeech

Directory actions

More options

Directory actions

More options

Latest commit

History

augmented_librispeech

Folders and files

parent directory

README.md

Speech Translation on Argumented LibriSpeech

Contents

Requirements

Data preprocessing

Step 1: Download Data

Step 2: Extract audio features

Step 3 Preprocess transcriptions and translations

Training and evaluation

Training with validation

Accelerating training with TensorFlow XLA

Evaluation on Testset

Training ST with ASR Pretraining

SpecAugment

Cascade ST