WIP: Added babel_multilang example directory for multilingual setup using babel languages.#1027
WIP: Added babel_multilang example directory for multilingual setup using babel languages.#1027pegahgh wants to merge 21 commits intokaldi-asr:masterfrom
Conversation
vijayaditya
left a comment
There was a problem hiding this comment.
Added comments on run_common_langs.sh.
| @@ -0,0 +1,121 @@ | |||
| #!/bin/bash | |||
|
|
|||
There was a problem hiding this comment.
A comment describing this script.
| train_stage=-10 | ||
| generate_alignments=true # false if doing ctc training | ||
| speed_perturb=true | ||
| use_flp=false |
There was a problem hiding this comment.
@pegahgh many of these variables are different from a normal nnet3 script. Please describe what they mean.
| #Although the nnet will be trained by high resolution data, we still have to perturbe the normal data to get the alignment | ||
| # _sp stands for speed-perturbed | ||
| for datadir in train; do | ||
| utils/perturb_data_dir_speed.sh 0.9 data/$L/${datadir} data/$L/temp1 |
There was a problem hiding this comment.
use the *perturb_3way.sh script. Please add volume perturbation, which can also be done using a script like utils/data/volume_perturb.sh. Please check for the exact names.
There was a problem hiding this comment.
The only concern is that we should redo plp+pitch extraction for original data by using this script!
| . ./utils/parse_options.sh | ||
|
|
||
|
|
||
| L=$1 |
There was a problem hiding this comment.
use a better variable name, L is uninformative.
| done | ||
| fi | ||
|
|
||
| train_set=train_sp |
There was a problem hiding this comment.
When checking for .done scripts, print out a statement to the user describing what stage is being skipped, why and what the user can do to force the script to run that stage.
You could probably also check if the data/*_sp directories are done and skip the feature extraction.
|
|
||
| utils/data/perturb_data_dir_volume.sh $data_dir || exit 1 ; | ||
|
|
||
| steps/make_mfcc.sh --nj 70 --mfcc-config conf/mfcc_hires.conf \ |
There was a problem hiding this comment.
With the new make_mfcc.sh script you need specify the log and ark directories.
There was a problem hiding this comment.
It is in the next line!!!!!!
|
|
||
| if [ $stage -le 4 ]; then | ||
| if [[ "$use_pitch" == "true" ]]; then | ||
| echo use_pitch = $use_pitch |
There was a problem hiding this comment.
echo a user understandable message like
$0: Generating pitch features for <data-dir-name(s)> as use_pitch=$use_pitch
| steps/make_pitch.sh --nj 70 --pitch-config $pitch_conf \ | ||
| --cmd "$train_cmd" data/$L/${dataset}_pitch exp/$L/make_pitch/${dataset} $pitchdir; | ||
| fi | ||
| aux_suffix=_pitch |
There was a problem hiding this comment.
aux_suffix was a bit confusing for me and I had to search for its meaning. Do you think feat_suffix or pitch_suffix might be better ?
There was a problem hiding this comment.
changed to feat_suffix!
| fi | ||
|
|
||
| if [ ! -f data/$L/${dataset}_mfcc${aux_suffix}/feats.scp ]; then | ||
| steps/append_feats.sh --nj 16 --cmd "$train_cmd" data/$L/${dataset} \ |
There was a problem hiding this comment.
Please check if append_feats.sh can support different types of argument lists like make_mfcc.sh. If yes drop the log-dir and ark-dir specification. If not could you please add this feature to append_feats.sh just for the sake of consistency across scripts.
There was a problem hiding this comment.
I am not sure, if Dan agrees, since lots of scripts use append_feats.sh! We can do that later!
There was a problem hiding this comment.
I agree with Vijay that it would be a good idea to extend it in the same way-- now is a good opportunity to test the changes, since you'll have to rerun your stuff to test it after other changes anyway.
vijayaditya
left a comment
There was a problem hiding this comment.
I will continue the review over the weekend, as I have to work on some paper submissions.
| @@ -0,0 +1,51 @@ | |||
| #!/bin/bash | |||
|
|
|||
There was a problem hiding this comment.
Once again, add a comment describing what this script does.
| set -e | ||
| stage=1 | ||
| train_stage=-10 | ||
| generate_alignments=true # false if doing ctc training |
There was a problem hiding this comment.
Please describe the variables.
| train_set=train_sp | ||
| fi | ||
|
|
||
| extractor=$global_extractor |
There was a problem hiding this comment.
Why is the extractor variable even required ? Could you not just use global_extractor at line 45.
|
|
||
| . ./utils/parse_options.sh | ||
|
|
||
| L=$1 |
There was a problem hiding this comment.
Once again change the name L.
| @@ -0,0 +1,241 @@ | |||
| #!/bin/bash | |||
|
|
|||
| # This is a crosslingual training setup where there are no shared phones. | |||
There was a problem hiding this comment.
Minor grammatical mistakes in the comment. Please check.
| echo "$0: creating neural net config for multilingual setups" | ||
| # create the config files for nnet initialization | ||
| $cmd $dir/log/make_config.log \ | ||
| python steps/nnet3/multi/make_configs.py \ |
There was a problem hiding this comment.
I would suggest that you call this script steps/nnet3/multilingual/make_tdnn_configs.py` as you might want to add support for other architectures later on.
In other make_configs.py scripts there is support for specifying either the {feat,ivector}-dim and num-targets or providing the corresponding directories from which the script extracts the required values. This makes your top level scripts less verbose. I would recommend adding this support to this config generator script, at least for the sake of consistency.
There was a problem hiding this comment.
It already supports previous option num-targets or the directory to extract this value.
do you mean to provide option to read multiple ali-dir for all languages and extract num-targets per lang?
There was a problem hiding this comment.
@pegahgh I don't see the point of extracting feat_dim in this script if your config generator already supports providing a feat-dir. Same for ivector-dim.
| # create the config files for nnet initialization | ||
| $cmd $dir/log/make_config.log \ | ||
| python steps/nnet3/multi/make_configs.py \ | ||
| --splice-indexes "$splice_indexes" \ |
There was a problem hiding this comment.
I would recommend removing variables in this script which are not used more than once. This helps for easy reading.
| nnet3-init --srand=-2 $dir/configs/init.config $dir/init.raw || exit 1; | ||
| fi | ||
|
|
||
| . $dir/configs/vars || exit 1; |
There was a problem hiding this comment.
describe which variables are being sourced here.
|
|
||
| . $dir/configs/vars || exit 1; | ||
|
|
||
| if [ $stage -le 10 ]; then |
There was a problem hiding this comment.
Are you sure you want to all this in a top level script in local/. I would recommend moving these steps to some script like steps/nnet3/multilingual/get_egs.sh
| echo print-interval = $print_interval | ||
| if [ $stage -le 11 ]; then | ||
| echo "$0: training mutilingual model." | ||
| steps/nnet3/multi/train_tdnn.sh --cmd "$train_cmd" \ |
There was a problem hiding this comment.
is this script specific to tdnns ? I think most of our training steps are similar for all feed-forward dnns. I recommend renaming it to steps/nnet3/multilingual/train_dnn.sh.
egs/babel_multilang/s5/run-2c-bnf.sh
Outdated
| speed_perturb=true | ||
| multidir=exp/nnet3/multi_bnf_10_close_lang_plus_grg | ||
| global_extractor=exp/multi/nnet3/extractor | ||
| lang_list=(GRG LIT MONG TUR KAZ KUR PSH SWA TOK IGBO DHO) |
There was a problem hiding this comment.
Make it explicit here that this list of languages has been selected for GRG.
There was a problem hiding this comment.
As suggested before please describe each variable which is not usually found in a typical nnet3 recipe.
egs/babel_multilang/s5/run-2c-bnf.sh
Outdated
| . ./utils/parse_options.sh | ||
|
|
||
|
|
||
| L=$1 |
There was a problem hiding this comment.
once again, a better variable name please.
egs/babel_multilang/s5/run-2c-bnf.sh
Outdated
| if $speed_perturb; then | ||
| suffix=_sp | ||
| fi | ||
| exp_dir=exp/$L |
There was a problem hiding this comment.
This directory structure ({data,exp}/<lang-name>/) is not used in other Babel recipes. Better make it consistent with other scripts.
There was a problem hiding this comment.
This is the main structure in this multilingual egs directory!
egs/babel_multilang/s5/run-2c-bnf.sh
Outdated
| ./local/nnet3/run_tdnn_joint_babel_sp_bnf.sh --dir $multidir \ | ||
| --avg-num-archives $num_archives \ | ||
| --global-extractor $global_extractor \ | ||
| --init-lrate $bnf_init_lrate \ |
There was a problem hiding this comment.
Unless you anticipate the users to tune the variables a lot, please eliminate any variable which is not used more than once.
There was a problem hiding this comment.
As suggested before any time you skip a stage as it is already done, please let the user know what you are skipping and how they can force that stage to run.
vijayaditya
left a comment
There was a problem hiding this comment.
Completed review of few more scripts.
egs/babel_multilang/s5/run-2c-bnf.sh
Outdated
| if [ ! -f $data_bnf_dir/.done ]; then | ||
| mkdir -p $dump_bnf_dir | ||
| # put the archives in ${dump_bnf_dir}/. | ||
| steps/nnet3/dump_bottleneck_features.sh --use-gpu true --nj $train_nj --cmd "$train_cmd" \ |
There was a problem hiding this comment.
I would recommend adding support for different argument lists similar to steps/make_mfcc.sh.
| touch $exp_dir/tri5b/.done | ||
| fi | ||
|
|
||
| if [ ! $exp_dir/tri6/.done -nt $exp_dir/tri5b/.done ]; then |
There was a problem hiding this comment.
Do you actually use this GMM-HMM system anywhere ? If not you could probably skip this training.
@jtrmal any comments ?
There was a problem hiding this comment.
I thought some people may want to use GMM-HMM trained on top of BN features!
any comment?
There was a problem hiding this comment.
If this is something that's not immediately needed, but you think may be needed in future, you could just put it inside if false; then .... fi.
| @@ -0,0 +1,471 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
I am not familiar with all the possible use-cases for this script. @jtrmal might be better suited to review this script.
| set -e | ||
| stage=1 | ||
| train_stage=-10 | ||
| generate_alignments=true # false if doing ctc training |
There was a problem hiding this comment.
any references to ctc are outdated and should be removed also wherever you got this script from.
egs/babel_multilang/s5/run-2c-bnf.sh
Outdated
| @@ -0,0 +1,119 @@ | |||
| #!/bin/bash | |||
| # -v3 is as -v2 but it just consists 10 closest fullLP langs to GRG + GRG. | |||
There was a problem hiding this comment.
i think these comments may be out of date. Comments at the top of this script should say what the script does... not clear to me right now.
There was a problem hiding this comment.
pegah, please update the comments on this script... I think you may have also missed my other comments.
|
|
||
| // Returns a random integer number according to a discrete probability distribution. | ||
| // It works based on sampling from a discrete distribution and | ||
| // it returns i with prob(i). |
There was a problem hiding this comment.
document that 'prob' must sum to one.
src/nnet3/nnet-nnet.cc
Outdated
| return -1; | ||
| } | ||
|
|
||
| void Nnet::RenameNode(int32 node_index, const std::string &new_node_name) { |
There was a problem hiding this comment.
I think this function should be called SetNodeName for consistency with GetNodeName(), and
document that this can be used, for example, for renaming output nodes.
But you should do some more checking here.
Do you require that after this renaming, the nnet should still be valid? I would expect so.
If so, then call Check() at the end of this function. You should probably also verify that IsValidName(new_node_name).
src/nnet3bin/nnet3-copy.cc
Outdated
| if (!rename_node_names.empty()) { | ||
| std::vector<string> orig_names, | ||
| new_names; | ||
| //GetRenameNodeNames(rename_node_names, &orig_names, new_names); |
There was a problem hiding this comment.
I think making this a function (appropriately documented!) in nnet-utils.h called
void RenameNodes(std::string &rename_node_names, Nnet *nnet);
would make this much nicer.
| // or crashes if it is not there. | ||
| int32 NumOutputIndexes(const NnetExample &eg) { | ||
| for (size_t i = 0; i < eg.io.size(); i++) | ||
| if (eg.io[i].name == "output") |
There was a problem hiding this comment.
You can't just change the implementation of functions without re-doing their documentation, like this!
Think about what would be the right way to this.
There was a problem hiding this comment.
Actually this is probably OK, I see that it is in a binary, not in nnet3/.
There was a problem hiding this comment.
The way this is set up right now, the training is going to be extremely inefficient because graph compilation will happen on every single minibatch. This is because the minibatches have randomized structure.
The way I think you should solve this is to modify nnet3-merge-egs so that it batches together inputs that have the same structure (where by "the same structure" I mean the set of input and output names). It may be best to write a class to do the merging. Also the order of the NnetIo objects in the eg is not necessarily well defined, and should be ignored, so bear that in mind.
| typedef kaldi::int64 int64; | ||
|
|
||
| const char *usage = | ||
| "Copy examples (single frames or fixed-size groups of frames) for neural\n" |
There was a problem hiding this comment.
Update the Usage message.
src/nnet3bin/nnet3-copy.cc
Outdated
| po.Read(argc, argv); | ||
| po.Register("rename-node-names", &rename_node_names, "Comma-separated list of noed names need to be modified" | ||
| " and their new name. e.g. 'affine0/affine0-lang1,affine1/affine1-lang1'"); | ||
| po.Register("add-output-nodes", &add_output_nodes, "Comma-separated list of output node name and its input" |
There was a problem hiding this comment.
Comma-separated list of output node names and their corresponding input descriptors, to be added the nnet config.
| nj=4 | ||
| cmd=run.pl | ||
| use_gpu=false | ||
| bnf_name=renorm4 |
There was a problem hiding this comment.
Bnf_name seems to be a very critical parameter in this script. I think it is unwise to assign it by default, as we frequently vary the neural network architecture in our models.
| [ -f path.sh ] && . ./path.sh # source the path. | ||
| . parse_options.sh || exit 1; | ||
|
|
||
| if [ $# != 5 ]; then |
There was a problem hiding this comment.
As I had suggested in other scripts, it might be better to support multiple argument lists like steps/mke_mfcc.sh for the sake of consistency. By default you could drop the arkdir and logdir arguments.
There was a problem hiding this comment.
I agree that it's better to use the new style of putting these things in subdirectories of the destination directory.
| *) echo "Invalid feature type $feat_type" && exit 1; | ||
| esac | ||
|
|
||
| if [ ! -z "$transform_dir" ]; then |
There was a problem hiding this comment.
We are not using fMLLR/lda features in the nnet3 recipes. You could simplify your script by dropping support for these features.
| N0=$(cat $data/feats.scp | wc -l) | ||
| N1=$(cat $archivedir/raw_bnfeat_$name.*.scp | wc -l) | ||
| if [[ "$N0" != "$N1" ]]; then | ||
| echo "Error happens when generating BNF for $name (Original:$N0 BNF:$N1)" |
There was a problem hiding this comment.
echo "$0 : An error occurred...."
| # Concatenate feats.scp into bnf_data | ||
| for n in $(seq $nj); do cat $archivedir/raw_bnfeat_$name.$n.scp; done > $bnf_data/feats.scp | ||
|
|
||
| for f in segments spk2utt text utt2spk wav.scp char.stm glm kws reco2file_and_channel stm; do |
| @@ -0,0 +1,147 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
For the sake of consistency with naming convention used in Kaldi, use the name
steps/nnet3/make_bottleneck_features.sh.
| if [ $stage -le 1 ]; then | ||
| echo "Making BNF scp and ark." | ||
| echo output-node name=output input=$bnf_name > output.config | ||
| modified_bnf_nnet="nnet3-copy --rename-node-names=output/output-bkp $bnf_nnet - | nnet3-init - output.config - |" |
There was a problem hiding this comment.
What happens if there already exists a node with the name output-bkp in your network ?
There was a problem hiding this comment.
Let's just assume there was no node named output-bkp. But about this-- please see Issue #1040, eventually it will be possible to rename the node via config files using nnet3-init or nnet-copy. I'm not sure if we should finish that issue before merging this, but anyway, proceed as if we were merging this first.
|
|
||
|
|
||
| if [ $stage -le 1 ]; then | ||
| echo "Making BNF scp and ark." |
There was a problem hiding this comment.
echo "$0: Generating bottle-neck features"
| @@ -0,0 +1,678 @@ | |||
| #!/usr/bin/env python | |||
There was a problem hiding this comment.
Do you plan to support chain training with the multi-lingual training scripts ? If not you could probably eliminate few options and simplify the script significantly.
|
|
||
| def GetArgs(): | ||
| # we add compulsary arguments as named arguments for readability | ||
| parser = argparse.ArgumentParser(description="Writes config files and variables " |
There was a problem hiding this comment.
Modify the description.
| help="iVector dir, which will be used to derive the ivector-dim ", default=None) | ||
|
|
||
| num_target_group = parser.add_mutually_exclusive_group(required = True) | ||
| num_target_group.add_argument("--num-targets", type=int, |
There was a problem hiding this comment.
This mutually exclusive group does not make sense to me. Please rewrite this script rather than modifying the existing script.
There was a problem hiding this comment.
why do you think it doesn't make sense?You can provide either num-targets or num-multiple-targets!!
There was a problem hiding this comment.
Why do you still need the options corresponding to uni-lingual networks (--num-targets, --ali-dir and --tree-dir) ?
There are several other parts of this script which are not relevant in the context of multi-lingual networks. You could cull all these unnecessary parts to simplify this script substantially. In its current form this script is able to generate networks for uni-lingual xent training, uni-lingual chain training and multi-lingual xent training. I would strongly suggest that you keep the config generator scripts simple.
|
|
||
| return args | ||
|
|
||
| def AddPerDimAffineLayer(config_lines, name, input, input_window): |
There was a problem hiding this comment.
This config generator script seems to have been copied from an older tdnn config generator. The new script is eliminates a lot of these options.
| exit_stage=-100 # you can set this to terminate the training early. Exits before running this stage | ||
|
|
||
| # count space-separated fields in splice_indexes to get num-hidden-layers. | ||
| splice_indexes="-4,-3,-2,-1,0,1,2,3,4 0 -2,2 0 -4,4 0" |
There was a problem hiding this comment.
Please use your current best splice_indexes as the default.
| unset args[${#args[@]}-1] | ||
| num_lang=$[${#args[@]}/3] | ||
|
|
||
| for i in `seq 0 $[$num_lang-1]`; do |
There was a problem hiding this comment.
I see a large potential for argument offset errors.
| parser.add_argument("--relu-dim", type=int, | ||
| help="dimension of ReLU nonlinearities") | ||
|
|
||
| parser.add_argument("--self-repair-scale-nonlinearity", type=float, |
There was a problem hiding this comment.
you are undoing recent changes committed by @freewym .
| @@ -104,10 +104,16 @@ def GetArgs(): | |||
| parser.add_argument("--relu-dim", type=int, | |||
There was a problem hiding this comment.
@pegahgh You code is undoing most of the recent changes. Please update your branch.
|
@pegahgh I will continue review after the requested changes have been made. Please remember to update the usage messages and documentation in the C++ code. |
|
@danpovey @vijayaditya Thanks for comments. |
|
As Pegah, Vijay and I discussed offline, this is going to be reworked with a slightly different design geared towards greater I/O efficiency, so I'm marking it as WIP. |
| . parse_options.sh || exit 1; | ||
|
|
||
| if [ $# -lt 1 ]; then | ||
| echo "Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>" |
There was a problem hiding this comment.
The usage message needs to be accurate.
|
|
||
| exit 1; | ||
| fi | ||
| #data=$1 |
There was a problem hiding this comment.
remove these old comments.
| # Set off jobs doing some diagnostics, in the background. | ||
| # Use the egs dir from the previous iteration for the diagnostics | ||
| for i in `seq 0 $[$num_lang-1]`;do | ||
| rename_io_names="output-$i/output" |
There was a problem hiding this comment.
For computing the diagnostics probabilities, I don't see that it's necessary or desirable to do this renaming of outputs.
nnet3-compute-prob already, I believe, computes the output for each of the output nodes that is defined, and separately prints those stats per output layer. All that might be needed is to make sure that it doesn't die because of the 'IsSimpleNnet()' check.
| if [ $x -gt 0 ]; then | ||
| rename_io_names="output-0/output" | ||
|
|
||
| $cmd $dir/log/progress.$x.log \ |
There was a problem hiding this comment.
For the progress logs, I would advise just to do it once, and to leave off the last argument (the egs). The aspects of that program that require the egs are relatively unimportant.
| echo print-interval = $print_interval | ||
| if [ $stage -le 11 ]; then | ||
| echo "$0: training mutilingual model." | ||
| steps/nnet3/multi/train_tdnn.sh --cmd "$train_cmd" \ |
| use_ivector=false | ||
| initial_effective_lrate=0.01 | ||
| final_effective_lrate=0.001 | ||
| pnorm_input_dim=3000 |
There was a problem hiding this comment.
I believe this variable is unused; please check for other unused variables.
| // or crashes if it is not there. | ||
| int32 NumOutputIndexes(const NnetExample &eg) { | ||
| for (size_t i = 0; i < eg.io.size(); i++) | ||
| if (eg.io[i].name == "output") |
There was a problem hiding this comment.
Actually this is probably OK, I see that it is in a binary, not in nnet3/.
| // or crashes if it is not there. | ||
| int32 NumOutputIndexes(const NnetExample &eg) { | ||
| for (size_t i = 0; i < eg.io.size(); i++) | ||
| if (eg.io[i].name == "output") |
There was a problem hiding this comment.
The way this is set up right now, the training is going to be extremely inefficient because graph compilation will happen on every single minibatch. This is because the minibatches have randomized structure.
The way I think you should solve this is to modify nnet3-merge-egs so that it batches together inputs that have the same structure (where by "the same structure" I mean the set of input and output names). It may be best to write a class to do the merging. Also the order of the NnetIo objects in the eg is not necessarily well defined, and should be ignored, so bear that in mind.
| lang_id=${range[0]} | ||
| start_egs=${range[1]} | ||
| end_egs=$[$start_egs+${range[2]}] | ||
| awk -v s="$start_egs" -v e="$end_egs" 'NR >= s && NR < e' ${multi_egs_dirs[$lang_id]}/egs.scp >> $scp_file; |
There was a problem hiding this comment.
This looks like a super-inefficient way to do it-- it would take time quadratic in the amount of input, not linear.
| # in multilingual training. | ||
|
|
||
| if [ $# -lt 4 ]; then | ||
| echo "$0: Usage: $0 num-langs [<egs-dir-lang-1> .. <egs-dir-lang-n>] <ranges.1> <scp-1>" |
There was a problem hiding this comment.
this usage message is very unclear, it doesn't explain what the "ranges" or "scp" formats are, and things like <ranges.1> are not clear.
In any case, I think there might be a deeper design problem here, making it super slow.
| if true; then | ||
| echo "$0: Generate egs for ${lang_list[$lang]}" | ||
| if [[ $(hostname -f) == *.clsp.jhu.edu ]] && [ ! -d $egs_dir/storage ]; then | ||
| utils/create_split_dir.pl \ |
There was a problem hiding this comment.
this create_split_dir stuff is not the kind of thing that should appear in scripts inside steps/.
| . parse_options.sh || exit 1; | ||
|
|
||
| if [ $# -lt 4 ]; then | ||
| echo "Usage: $0 [opts] num-input-langs <data-dir-per-lang> <ali-dir-per-lang> <egs-dir-per-lang> <multilingual-egs-dir>" |
There was a problem hiding this comment.
I think it would be better to just take as input the egs dirs from all the languages, assuming they already exist; and move the responsibility for dumping those egs to a further-out level.
| @@ -0,0 +1,158 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
@danpovey
I added steps/nnet3/multilingual/get_egs.sh, which generates egs..scp, outputs..scp, weights..scp for multilingual training using python file.
The details of egs..scp generation described in comments.
If you agree with whole pipeline for generating egs generation , I will start next steps, by doing following changes:
- Modify nnet3-copy-egs to accept output and weight
- add steps/nnet3/train_raw.py and modify it to be compatible with multilingual training setup.
- modify all local/nnet3/run_* w.r.t new pipeline.
| @@ -0,0 +1,209 @@ | |||
| #!/usr/bin/env python | |||
There was a problem hiding this comment.
@danpovey
This python script reads lang2len as input and it randomly selects new language w.r.t probability of remaining examples in each language and it outputs ranges.* with format
and then the bash file get_egs.sh generates egs..scp w.r.t generated ranges. in parallel.
The reason that I didn't read scp in python was that I should load whole scp file for each language in memory, that is not efficient!
Should I change the process to stream the scp files in python line by line (batch of minibatch-size each time!!)?
|
Doing it in parallel will just overwhelm the NFS server. Implement it in a On Fri, Sep 23, 2016 at 8:48 PM, pegahgh [email protected] wrote:
|
8652655 to
3c6c1b5
Compare
|
@vijayaditya |
|
OK will review today. On Mon, Oct 3, 2016 at 12:14 PM, pegahgh [email protected] wrote:
|
vijayaditya
left a comment
There was a problem hiding this comment.
started the review. Will continue later..
| @@ -0,0 +1 @@ | |||
| ../../../babel/s5c/conf/common.fullLP No newline at end of file | |||
There was a problem hiding this comment.
We don't encourage soft-linking files across egs, as the source recipes can change anytime.
|
|
||
| exit 1; | ||
| fi | ||
| cmd=run.pl |
There was a problem hiding this comment.
why are you resetting cmd ?
| if [ $# -lt 4 ]; then | ||
| echo "Usage: $0 [opts] num-input-langs <data-dir-per-lang> <ali-dir-per-lang> <egs-dir-per-lang> <multilingual-egs-dir>" | ||
| echo " e.g.: $0 2 data/lang1/train data/lang2/train " | ||
| " exp/lang1/tri5_ali exp/lang2/tri5_ali exp/lang1/nnet3/lang1 exp/lang2/nnet3/lang2 exp/multi/egs" |
There was a problem hiding this comment.
exp/lang1/nnet3/lang1 Is this supposed to be exp/lang1/nnet3/egs ?
| stage=0 | ||
| left_context=13 | ||
| right_context=9 | ||
| online_multi_ivector_dir= # list of iVector dir for all languages |
There was a problem hiding this comment.
Give an example for this variable.
| ali_dir=${multi_ali_dirs[$lang]} | ||
| egs_dir=${multi_egs_dirs[$lang]} | ||
| online_ivector_dir= | ||
| if [ ! -z "$multi_ivector_dirs" ]; then |
There was a problem hiding this comment.
Variable name mismatch.
| for datadir in train; do | ||
| ./utils/data/perturb_data_dir_speed_3way.sh data/$lang/${datadir} data/$lang/${datadir}_sp | ||
|
|
||
| # Extract Plp+pitch feature for perturbed data. |
There was a problem hiding this comment.
this feature extraction needs to be done only if you want to generate alignments. Or are you assuming some other top level script needs these features ?
There was a problem hiding this comment.
actually this feature extraction is inside speed-perturb condition, where you need to regenerate alignment for perturbed data.
| utils/create_split_dir.pl /export/b0{1,2,3,4}/$USER/kaldi-data/egs/$lang-$date/s5c/$mfccdir/storage $mfccdir/storage | ||
| fi | ||
|
|
||
| # the 100k_nodup directory is copied seperately, as |
There was a problem hiding this comment.
These comments are irrelevant.
| @@ -0,0 +1,48 @@ | |||
| #!/bin/bash | |||
| # This scripts generates iVector using global iVector extractor | |||
There was a problem hiding this comment.
usually run_ivector_common*.sh also train the ivector extractor. Please rename this script.
|
|
||
| mkdir -p nnet3 | ||
| # perturbed data preparation | ||
| train_set=train |
There was a problem hiding this comment.
couldn't you just take these variables as input ?
| speed_perturb=true | ||
| multidir=exp/nnet3/multi_bnf_10_close_lang_plus_grg | ||
| global_extractor=exp/multi/nnet3/extractor | ||
| lang_list_for_grg=(GRG LIT MONG TUR KAZ KUR PSH SWA TOK IGBO DHO) |
There was a problem hiding this comment.
better describe the language codes, not everyone is aware of this.
There was a problem hiding this comment.
Instead of inventing your own codes, I'd suggest using the ids from babel, which is an established practice... i.e 404-georgian, 304-lithuanian and so on.
It does not help anything if you make the codes short, you only obfuscate things
vijayaditya
left a comment
There was a problem hiding this comment.
Reviewed a few more files. Will continue again when I can find time.
|
|
||
|
|
||
| [ ! -d $dump_bnf_dir ] && mkdir -p $dump_bnf_dir | ||
| if [ ! -f $data_bnf_dir/.done ]; then |
There was a problem hiding this comment.
As I requested in the previous review, whenever you are skipping a stage as it is already done, print a statement saying that you are skipping it and also add a comment describing how users can force it to run ie, by deleting the .done files.
| @@ -0,0 +1,698 @@ | |||
| #!/usr/bin/env python | |||
There was a problem hiding this comment.
Do you want to add this script as part of this PR ? Or do you want to assume that Vimal's PR is going to be merged before your PR ?
There was a problem hiding this comment.
I prefer to assume Vimal's PR is merged before!
| } | ||
|
|
||
| # The function signature of MakeConfigs is changed frequently as it is intended for local use in this script. | ||
| def MakeConfigs(config_dir, splice_indexes_string, |
There was a problem hiding this comment.
Better to move the common parts of tdnn/make_configs.py and tdnn/multi_lingual/make_configs.py (e.g. Splice string parsing, CNN specification, common argument parsing) and include it into both the scripts. See how Vimal is doing this in his PR #1066 .
| # This script uses separate input egs directory for each language as input, | ||
| # to generate egs.*.scp files in multilingual egs directory | ||
| # where the scp line points to the original archive for each egs directory. | ||
| # $megs/egs.*.scp is randomized w.r.t language id. |
There was a problem hiding this comment.
Adding snippets of these scp files would be helpful.
| # Begin configuration section. | ||
| cmd=run.pl | ||
| minibatch_size=512 # multiple of minibatch used during training. | ||
| num_jobs=10 # helps for better randomness across languages |
There was a problem hiding this comment.
Did you forget to update the comment beside this variable ?
| # they are discarded. | ||
| if lang_len[lang_id] < args.minibatch_size: | ||
| lang_len[lang_id] = 0 | ||
| print("Run out of data for language {0}".format(lang_id)) |
There was a problem hiding this comment.
Use warnings.warn to print this statement. Ran out of data..
There was a problem hiding this comment.
This is not really warning!! I will reword the statement!
|
|
||
| # check files befor writing. | ||
| if f is None: | ||
| sys.exit("Error opening file " + args.egs_dir + "/temp/" + args.prefix + "ranges." + str(job + 1)) |
There was a problem hiding this comment.
prefer usingraise Exception
| feats="ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |" | ||
| ivec_feats="scp:utils/filter_scp.pl $sdata/JOB/utt2spk $ivector_dir/ivector_online.scp |" | ||
|
|
||
| if [ ! -z "$transform_dir" ]; then |
There was a problem hiding this comment.
We haven't been using fMLLR features in nnet3, even the new training scripts do not support this option. I am eliminating this option in the new get_egs.py script, so you could safely eliminate this code and simplify your script.
| nj=4 | ||
| cmd=run.pl | ||
| use_gpu=false | ||
| bnf_name=Tdnn_Bottleneck_renorm |
There was a problem hiding this comment.
As I pointed out in my previous review, it is better not to use bnf_name as an optional argument. Please consider making it compulsory.
There was a problem hiding this comment.
The whole point is that this script can be used to dump output of different layers not only bottleneck. That was the reason I didn't fix the name!
| ivector_opts="--online-ivector-period=$ivec_period --online-ivectors='$ivec_feats'" | ||
| fi | ||
| $cmd $compute_queue_opt JOB=1:$nj $dir/log/make_bnf_$name.JOB.log \ | ||
| nnet3-compute $compute_gpu_opt $ivector_opts "$modified_bnf_nnet" "$feats" ark:- \| \ |
There was a problem hiding this comment.
I think you might want to make your bnf extractor architecture agnostic. If you are using an RNN you should support specification of extra-left-context, extra-right-context and chunk width.
|
node_name might be a better name than bnf_name. On Thu, Oct 13, 2016 at 3:15 PM, pegahgh [email protected] wrote:
|
…and modified some codes and binaries for multilingual setup.
No description provided.