Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Noisy student training for wenet #1600

Merged
merged 46 commits into from
Dec 13, 2022
Merged

Noisy student training for wenet #1600

merged 46 commits into from
Dec 13, 2022

Conversation

NevermoreCY
Copy link
Contributor

Here, we provide a recipe to run Noisy Student Training (NST) with LM filter strategy using AISHELL-1 as supervised data and WenetSpeech as unsupervised data from our paper.

The example codes are stored under examples/aishell/NST with a detailed guideline and results in readme.md. we mainly modified the script for "run.sh" and added "executor_nst.py" ,"train_nst.py" as well as some auxiliary codes and examples under local directory.

@wd929
Copy link
Contributor

wd929 commented Dec 5, 2022

@robin1001 hi Binbin, would you mind reviewing this PR? Thanks!!

print("unsupervised data list = ", args.train_data_unsupervised)

cv_conf = copy.deepcopy(train_supervised_conf)
# cv_conf['speed_perturb'] = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use speed_perturb and spec_aug in cv?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi robin, thanks for pointing out this issue, the speed_perturb, spech_aug as well as shuffle should be set to false for CV. I traced back my history and found it was a mis-comment during my recent review. In our experiment cluster, it was set to be False.
eae283e8e3a124bd12bca4d8f53cd6c

@NevermoreCY
Copy link
Contributor Author

The flake8-bugbear is updated to version 22.12.6 from 22.10.27 and they modified the test for zip function.
It seems codes contains zip() from other section didn't pass the flake8 test.

@xingchensong
Copy link
Member

The flake8-bugbear is updated to version 22.12.6 from 22.10.27 and they modified the test for zip function. It seems codes contains zip() from other section didn't pass the flake8 test.

fixed, #1604, plz rebase your commits

@NevermoreCY
Copy link
Contributor Author

It works now, thanks!

pin_memory=args.pin_memory,
num_workers=args.num_workers,
prefetch_factor=args.prefetch)
executor = Executor_nst()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to Name it ExecutorNst

from tensorboardX import SummaryWriter
from torch.utils.data import DataLoader
from wenet.dataset.dataset import Dataset
from wenet.transformer.asr_model import init_asr_model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your base commits? we have refactor our model initialization in this PR #1280, call wenet.utils.init_model.init_model instead of wenet.transformer.asr_model.init_asr_model

@wd929
Copy link
Contributor

wd929 commented Dec 13, 2022

Could you please review this PR? @robin1001 and @xingchensong Thanks!!^^

Copy link
Collaborator

@robin1001 robin1001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the files which are not required for the recipe.

def __init__(self):
self.step = 0

def train(self, model, optimizer, scheduler, data_loader_aishell,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to name it
data_loader_supervised and data_loader_unsupervised

--val_best
fi

# export model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

export_jit is not required here.

@@ -0,0 +1,11 @@
data/train/shards/shards_000000000.tar
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些示例的文件可以删除掉。

@@ -0,0 +1,143 @@
# network architecture
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is not required, right?

@@ -0,0 +1,11 @@
data/train/shards/shards_000000000.tar
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove data_example/train dir

@@ -0,0 +1 @@
你好你好
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove wav_dir

@@ -0,0 +1,2 @@
data/train/wenet_1khr_tar//dir0_000000.tar
data/train/wenet_1khr_tar//dir0_000001.tar
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the file

Comment on lines 251 to 253
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ] && [ ${enable_nst} -eq 0 ]; then
echo "********step 3 start time : $now ********"
python split_data_list.py \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python local/split_data_list.py

@xingchensong
Copy link
Member

Great Job, LGTM.

@NevermoreCY
Copy link
Contributor Author

please remove the files which are not required for the recipe.

The data example directory, train_nst.py and executor.py has been removed, the problems you mentioned should be sovled.

@robin1001 robin1001 merged commit 9060ab2 into wenet-e2e:main Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants