Merge pull request #1 from bytedance/rc

Upgrade and release v0.1
bytedance · Dec 25, 2020 · 671e88f · 671e88f
2 parents 634a71e + c22ce9e
commit 671e88f
Show file tree

Hide file tree

Showing 91 changed files with 4,913 additions and 1,417 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 ### Added
-- Init repo
 
 ### Changed
+
+
+
+## [0.1.0] - 25th Dec., 2020
+### Added
+- Basic code structure for Encoder, Decoder, Model, DataPipeline, Tokenizer, Experiment, Metric, and Dataset.
+- (Model) Adds implementation of pre-norm/post-norm Transformer, Speech Transformer, BERT, GPT-2, and Wav2Vec2.0.
+- (Task) Adds implementation of sequence to sequence task and speech to text task (ASR, ST).
+- (DataPipeline, Tokenizer) Adds wrappers for commonly used tokenizers: moses, bpe, jieba, character, sentencepiece, etc.
+- (Dataset) Adds support for reading parallel corpus, speech corpora (libri-trans, MuST-C, and LibriSpeech), and TFRecords.
+- (Experiment) Adds implementation of common training procedure with mixed precision training and various distributed strategies (`MirroredStrategy`, `Horovod`, `Byteps`).
+- (Metric) Adds implementation of BLEU and WER metrics.
+- (Converter) Adds implementation of converting checkpoints from google BERT, OpenAI GPT-2, fairseq Transformer, and fairseq Wav2Vec2.0.
+- Add support for converting checkpoints from publicly 
+- Beam search decoding and top-k/p sampling.
+- Supports averaging checkpoints, TFRecord generation, model restoring (see [cli/README.md](/neurst/cli/README.md)).
+- Step-by-step recipes for training an end-to-end speech translation model (see [examples/speech_to_text](/examples/speech_to_text)).
+
diff --git a/README.md b/README.md
@@ -1,6 +1,7 @@
 # NeurST: Neural Speech Translation Toolkit
 NeurST aims at easily building and training end-to-end speech translation, which has the careful design for extensibility and scalability. We believe this design can make it easier for NLP researchers to get started. In addition, NeurST allows researchers to train custom models for translation, summarization and so on.
 
+> NeurST is based on TensorFlow2 and we are working on the pytorch version.
 
 ## Features
 
@@ -21,6 +22,7 @@ NeurST provides several **strong and reproducible benchmarks** for various tasks
 
 - Speech-to-Text
     - [Augmented Librispeech](/examples/speech_to_text/augmented_librispeech)
+    - [MuST-C](/examples/speech_to_text/must-c)
 
 
 ### Additionally

diff --git a/examples/speech_to_text/augmented_librispeech/README.md b/examples/speech_to_text/augmented_librispeech/README.md
@@ -4,32 +4,34 @@
 
 The final performance of speech translation on Argumented LibriSpeech is: 
 
+> See [RESULTS](/examples/speech_to_text/augmented_librispeech/RESULTS.md) for the comparison with counterparts. 
+
 - **ASR (dmodel=256, WER)** 
 
-|Framework|Model|Dev|Test| |
-|---|---|---|---|---|
-|NeurST|Transformer ASR |8.3|8.9| pure end-to-end, beam=4, no length penalty |
-|Espnet (Inaguma et al., 2020)| Transformer ASR + ctc | 6.5 | 6.4 | multi-task training with ctc loss | 
+|Model|Dev|Test|
+|---|---|---|
+|Transformer ASR |8.3|8.9|
+
 
 - **MT/ST (dmodel=256, case-sensitive, tokenized BLEU/detokenized BLEU)**
 
-|Framework|Model|Dev|Test|
-|---|---|---|---|
-|NeurST|Transformer MT |20.8 / 19.3 | 19.3 / 17.6 |
-|NeurST|cascade ST (Transformer ASR -> Transformer MT) | 18.3 / 17.0| 17.4 / 16.0 |
-|NeurST|end2end Transformer ST + ASR pretrain | 18.3 / 16.9 | 16.9 / 15.5  |
+|Model|Dev|Test|
+|---|---|---|
+|Transformer MT |20.8 / 19.3 | 19.3 / 17.6 |
+|cascade ST (Transformer ASR -> Transformer MT) | 18.3 / 17.0| 17.4 / 16.0 |
+|Transformer ST + ASR pretrain | 18.3 / 16.9 | 16.9 / 15.5  |
+|Transformer ST + ASR pretrain + SpecAug | 19.3 / 17.8 | 17.8 / 16.3  |
+|Transformer ST ensemble above 2 models | **19.3** / **18.0** | **18.3 / 16.8**  |
 
 - **MT/ST (dmodel=256, case-insensitive, tokenized BLEU/detokenized BLEU)**
 
-|Framework|Model|Dev|Test|
-|---|---|---|---|
-|NeurST|Transformer MT | 21.7 / 20.2 | 20.2 / 18.5 |
-|Espnet (Inaguma et al., 2020)| Transformer MT| ---- / 19.6 | ---- / 18.1 |
-|NeurST|cascade ST (Transformer ASR -> Transformer MT) | 19.2 / 17.8 | 18.2 / 16.8 |
-|Espnet (Inaguma et al., 2020)| cascade ST (Transformer ASR + ctc -> Transformer MT) | ---- / ---- | ---- / 17.0 |
-|NeurST|end2end Transformer ST + ASR pretrain | 19.2 / 17.8 | 17.9 / 16.5 |
-|Espnet (Inaguma et al., 2020)|end2end Transformer ST + ASR pretrain | ---- / ---- | ---- / 15.5 |
-|Espnet (Inaguma et al., 2020)|end2end Transformer ST + ASR/MT pretrain + SpecAug | ---- / ---- | ---- / 16.7 |
+|Model|Dev|Test|
+|---|---|---|
+|Transformer MT | 21.7 / 20.2 | 20.2 / 18.5 |
+|cascade ST (Transformer ASR -> Transformer MT) | 19.2 / 17.8 | 18.2 / 16.8 |
+|Transformer ST + ASR pretrain | 19.2 / 17.8 | 17.9 / 16.5 |
+|Transformer ST + ASR pretrain + SpecAug | 20.2 / 18.7 | 18.7 / 17.2  |
+|Transformer ST ensemble above 2 models | **20.3** / **18.9** | **19.2** / **17.7**  |
 
 In this recipe, we will introduce how to pre-process the Augmented LibriSpeech corpus and train/evaluate a speech translation model using neurst.
 
@@ -44,6 +46,7 @@ In this recipe, we will introduce how to pre-process the Augmented LibriSpeech c
     * [Accelerating Training with TensorFlow XLA](#accelerating-training-with-tensorflow-xla)
     * [Evaluation on Testset](#evaluation-on-testset)
     * [Training ST with ASR pretraining](#training-st-with-asr-pretraining)
+    * [SpecAugment](#specaugment)
     * [Cascade ST](#cascade-st)
 
 
@@ -224,7 +227,7 @@ python3 -m neurst.cli.run_exp \
     --model_dir /path_to_data/asr_st/asr_benchmark
 ```
 
-This process will constantly scan the `model_dir`, evaluate each checkpoint and store the checkpoints with best metrics (e.g. WER for ASR) into `{model_dir}/best` directory along with the corresponding averaged version into `{model_dir}/best_avg`. 
+This process will constantly scan the `model_dir`, evaluate each checkpoint and store the checkpoints with best metrics (e.g. WER for ASR) into `{model_dir}/best` directory along with the corresponding averaged version (by default 10 latest checkpoints) into `{model_dir}/best_avg`. 
 
 ### Evaluation on Testset
 By running with
@@ -255,6 +258,10 @@ On this basis, we can further initialize the ST decoder with MT decoder by follo
 > To inspect the names of model variables, use `inspect_checkpoint` tool (see [neurst/cli/README.md](/neurst/cli/README.md)).  
 
 
+### SpecAugment
+To further improve the performance of ASR or ST, we can apply SpecAugment (Park et al., 2019) by option `--specaug VALUE`. Alternatively, the VALUE can be set to LB, LD, SM and SS (described in the original paper), or a json-like string defining the detailed arguments (see [neurst/utils/audio_lib.py](/neurst/utils/audio_lib.py)))
+
+
 ### Cascade ST
 NeurST provides `cascade_st` tool for easily combining ASR and MT models, e.g.
 

diff --git a/examples/speech_to_text/augmented_librispeech/RESULTS.md b/examples/speech_to_text/augmented_librispeech/RESULTS.md
@@ -0,0 +1,60 @@
+# Results on Argumented LibriSpeech
+
+
+### Comparison with counterparts (speech_transformer_s)
+test, case-insensitive
+
+|Model|tok|detok|
+|---|---|---|
+|Transformer ST + ASR PT (1)| - |15.5|
+|Transformer ST + ASR/MT PT (1)| - |16.2|
+|Transformer ST + ASR/MT PT + SpecAug (1) | - |16.7|
+|Transformer ST ensemble 3 models (1) | - | 17.4|
+|Transformer ST + ASR/MT PT (2)| 14.3 | - |
+|Transformer ST + ASR/MT PT + KD (2) | 17.0 | - |
+|Transformer ST + ASR PT + SpecAug (3) | 16.9 | - |
+|Transformer ST + ASR PT + curriculum pre-training + SpecAug (3) | 18.0 | - |
+|Transformer ST + ASR PT (4) | 15.3 | - |
+|Transformer ST + triple supervision (TED) (4) | 18.3 | - |
+|**NeurST** Transformer ST + ASR PT | 17.9 | 16.5 |
+|**NeurST** Transformer ST + ASR PT + SpecAug | 18.7 | 17.2 |
+|**NeurST** Transformer ST ensemble 2 models | **19.2** | **17.7**|
+
+(1) Espnet-ST (Inaguma et al., 2020) with additional techniques: speed perturbation, pre-trained MT decoder and CTC loss for ASR pretrain;
+
+(2) Liu et al. (2019) with the proposed knowledge distillation;
+
+(3) Wang et al. (2020) with additional ASR corpora and curriculum pre-training;
+
+(4) Dong et al. (2020) with CTC loss and a pre-trained BERT encoder as supervision with external ASR data;
+
+
+### ASR (dmodel=256, WER) 
+
+|Framework|Model|Dev|Test| |
+|---|---|---|---|---|
+|NeurST|Transformer ASR |8.3|8.9| pure end-to-end, beam=4, no length penalty |
+|Espnet (Inaguma et al., 2020)| Transformer ASR + ctc | 6.5 | 6.4 | multi-task training with ctc loss | 
+
+
+### MT/ST (dmodel=256, case-sensitive, tokenized BLEU/detokenized BLEU)
+
+|Framework|Model|Dev|Test|
+|---|---|---|---|
+|NeurST|Transformer MT |20.8 / 19.3 | 19.3 / 17.6 |
+|NeurST|cascade ST (Transformer ASR -> Transformer MT) | 18.3 / 17.0| 17.4 / 16.0 |
+|NeurST|end2end Transformer ST + ASR pretrain | 18.3 / 16.9 | 16.9 / 15.5  |
+|NeurST|end2end Transformer ST + ASR pretrain + SpecAug | 19.3 / 17.8 | 17.8 / 16.3  |
+|NeurST|end2end Transformer ST ensemble above 2 models | 19.3 / 18.0 | 18.3 / 16.8  |
+
+### MT/Cascaded (dmodel=256, case-insensitive, tokenized BLEU/detokenized BLEU)
+
+|Framework|Model|Dev|Test|
+|---|---|---|---|
+|NeurST|Transformer MT | 21.7 / 20.2 | 20.2 / 18.5 |
+|Espnet (Inaguma et al., 2020)| Transformer MT| ---- / 19.6 | ---- / 18.1 |
+|NeurST|cascade ST (Transformer ASR -> Transformer MT) | 19.2 / 17.8 | 18.2 / 16.8 |
+|Espnet (Inaguma et al., 2020)| cascade ST (Transformer ASR + ctc -> Transformer MT) | ---- / ---- | ---- / 17.0 |
+
+
+
diff --git a/examples/speech_to_text/augmented_librispeech/asr_training_args.yml b/examples/speech_to_text/augmented_librispeech/asr_training_args.yml
@@ -13,9 +13,9 @@ entry.params:
     beta_2: 0.98
   lr_schedule.class: noam
   lr_schedule.params:
-    initial_factor: 5.0
+    initial_factor: 3.5
     end_factor: 2.0
-    dmodel: 512
+    dmodel: 256
     warmup_steps: 25000
     start_decay_at: 50000
     decay_steps: 50000

diff --git a/examples/speech_to_text/augmented_librispeech/mt_validation_args.yml b/examples/speech_to_text/augmented_librispeech/mt_validation_args.yml
@@ -11,7 +11,7 @@ validator.params:
   eval_search_method: beam_search
   eval_search_method.params:
     beam_size: 4
-    length_penalty: 0.6
+    length_penalty: -1
     maximum_decode_length: 180
     extra_decode_length: 50
   eval_metric: tok_bleu

diff --git a/examples/speech_to_text/augmented_librispeech/st_training_args.yml b/examples/speech_to_text/augmented_librispeech/st_training_args.yml
@@ -13,9 +13,9 @@ entry.params:
     beta_2: 0.98
   lr_schedule.class: noam
   lr_schedule.params:
-    initial_factor: 4.0
-    end_factor: 2.0
-    dmodel: 512
+    initial_factor: 3.5
+    end_factor: 1.5
+    dmodel: 256
     warmup_steps: 25000
     start_decay_at: 50000
     decay_steps: 50000

diff --git a/examples/speech_to_text/must-c/01-download.sh b/examples/speech_to_text/must-c/01-download.sh
@@ -0,0 +1,41 @@
+# Copyright 2020 ByteDance Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#!/usr/bin/env bash
+
+set -e
+
+if [[ ! -n "$1" ]] ;then
+    echo "Usage: ./01-download.sh SAVE_PATH"
+    exit 1
+else
+    DATA_PATH="$1"
+fi
+
+DATA_PATH=$DATA_PATH/raw/
+
+mkdir -p $DATA_PATH
+
+# Download from
+#   https://ict.fbk.eu/must-c/
+# and get following tgz files:
+#   - MUSTC_v1.0_en-de.tar.gz
+#   - MUSTC_v1.0_en-es.tar.gz
+#   - MUSTC_v1.0_en-fr.tar.gz
+#   - MUSTC_v1.0_en-it.tar.gz
+#   - MUSTC_v1.0_en-nl.tar.gz
+#   - MUSTC_v1.0_en-pt.tar.gz
+#   - MUSTC_v1.0_en-ro.tar.gz
+#   - MUSTC_v1.0_en-ru.tar.gz
+
+echo "Downloading MuST-C dataset..."