Skip to content

WIP: Added babel_multilang example directory for multilingual setup using babel languages.#1027

Closed
pegahgh wants to merge 21 commits intokaldi-asr:masterfrom
pegahgh:multilingual
Closed

WIP: Added babel_multilang example directory for multilingual setup using babel languages.#1027
pegahgh wants to merge 21 commits intokaldi-asr:masterfrom
pegahgh:multilingual

Conversation

@pegahgh
Copy link
Contributor

@pegahgh pegahgh commented Sep 14, 2016

No description provided.

@danpovey
Copy link
Contributor

vijay, would you mind doing a first pass of review on this?

On Wed, Sep 14, 2016 at 4:03 PM, pegahgh [email protected] wrote:


You can view, comment on, or merge this pull request online at:

#1027
Commit Summary

  • added babel_multilang example dir for multilingual setting and added
    and modified some codes and binaries for multilingual setup.

File Changes

Patch Links:


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1027, or mute the thread
https://github.com/notifications/unsubscribe-auth/ADJVu-tRmxgPcbYjG9soZs2tJUM66ywSks5qqH05gaJpZM4J9Vpw
.

@vijayaditya
Copy link
Contributor

OK, on it.

--Vijay

On Wed, Sep 14, 2016 at 9:44 PM, Daniel Povey [email protected]
wrote:

vijay, would you mind doing a first pass of review on this?

On Wed, Sep 14, 2016 at 4:03 PM, pegahgh [email protected] wrote:


You can view, comment on, or merge this pull request online at:

#1027
Commit Summary

  • added babel_multilang example dir for multilingual setting and added
    and modified some codes and binaries for multilingual setup.

File Changes

Patch Links:


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1027, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu-
tRmxgPcbYjG9soZs2tJUM66ywSks5qqH05gaJpZM4J9Vpw>
.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1027 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADtwoD3Z-3FEV-45LUTPjG4IGb9oowasks5qqKL3gaJpZM4J9Vpw
.

Copy link
Contributor

@vijayaditya vijayaditya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments on run_common_langs.sh.

@@ -0,0 +1,121 @@
#!/bin/bash

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment describing this script.

train_stage=-10
generate_alignments=true # false if doing ctc training
speed_perturb=true
use_flp=false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pegahgh many of these variables are different from a normal nnet3 script. Please describe what they mean.

#Although the nnet will be trained by high resolution data, we still have to perturbe the normal data to get the alignment
# _sp stands for speed-perturbed
for datadir in train; do
utils/perturb_data_dir_speed.sh 0.9 data/$L/${datadir} data/$L/temp1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the *perturb_3way.sh script. Please add volume perturbation, which can also be done using a script like utils/data/volume_perturb.sh. Please check for the exact names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only concern is that we should redo plp+pitch extraction for original data by using this script!

. ./utils/parse_options.sh


L=$1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a better variable name, L is uninformative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed L to lang!

done
fi

train_set=train_sp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When checking for .done scripts, print out a statement to the user describing what stage is being skipped, why and what the user can do to force the script to run that stage.

You could probably also check if the data/*_sp directories are done and skip the feature extraction.


utils/data/perturb_data_dir_volume.sh $data_dir || exit 1 ;

steps/make_mfcc.sh --nj 70 --mfcc-config conf/mfcc_hires.conf \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new make_mfcc.sh script you need specify the log and ark directories.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is in the next line!!!!!!


if [ $stage -le 4 ]; then
if [[ "$use_pitch" == "true" ]]; then
echo use_pitch = $use_pitch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

echo a user understandable message like
$0: Generating pitch features for <data-dir-name(s)> as use_pitch=$use_pitch

steps/make_pitch.sh --nj 70 --pitch-config $pitch_conf \
--cmd "$train_cmd" data/$L/${dataset}_pitch exp/$L/make_pitch/${dataset} $pitchdir;
fi
aux_suffix=_pitch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aux_suffix was a bit confusing for me and I had to search for its meaning. Do you think feat_suffix or pitch_suffix might be better ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to feat_suffix!

fi

if [ ! -f data/$L/${dataset}_mfcc${aux_suffix}/feats.scp ]; then
steps/append_feats.sh --nj 16 --cmd "$train_cmd" data/$L/${dataset} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if append_feats.sh can support different types of argument lists like make_mfcc.sh. If yes drop the log-dir and ark-dir specification. If not could you please add this feature to append_feats.sh just for the sake of consistency across scripts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, if Dan agrees, since lots of scripts use append_feats.sh! We can do that later!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Vijay that it would be a good idea to extend it in the same way-- now is a good opportunity to test the changes, since you'll have to rerun your stuff to test it after other changes anyway.

Copy link
Contributor

@vijayaditya vijayaditya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will continue the review over the weekend, as I have to work on some paper submissions.

@@ -0,0 +1,51 @@
#!/bin/bash

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, add a comment describing what this script does.

set -e
stage=1
train_stage=-10
generate_alignments=true # false if doing ctc training
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe the variables.

train_set=train_sp
fi

extractor=$global_extractor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the extractor variable even required ? Could you not just use global_extractor at line 45.


. ./utils/parse_options.sh

L=$1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again change the name L.

@@ -0,0 +1,241 @@
#!/bin/bash

# This is a crosslingual training setup where there are no shared phones.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor grammatical mistakes in the comment. Please check.

echo "$0: creating neural net config for multilingual setups"
# create the config files for nnet initialization
$cmd $dir/log/make_config.log \
python steps/nnet3/multi/make_configs.py \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest that you call this script steps/nnet3/multilingual/make_tdnn_configs.py` as you might want to add support for other architectures later on.

In other make_configs.py scripts there is support for specifying either the {feat,ivector}-dim and num-targets or providing the corresponding directories from which the script extracts the required values. This makes your top level scripts less verbose. I would recommend adding this support to this config generator script, at least for the sake of consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It already supports previous option num-targets or the directory to extract this value.
do you mean to provide option to read multiple ali-dir for all languages and extract num-targets per lang?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pegahgh I don't see the point of extracting feat_dim in this script if your config generator already supports providing a feat-dir. Same for ivector-dim.

# create the config files for nnet initialization
$cmd $dir/log/make_config.log \
python steps/nnet3/multi/make_configs.py \
--splice-indexes "$splice_indexes" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend removing variables in this script which are not used more than once. This helps for easy reading.

nnet3-init --srand=-2 $dir/configs/init.config $dir/init.raw || exit 1;
fi

. $dir/configs/vars || exit 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

describe which variables are being sourced here.


. $dir/configs/vars || exit 1;

if [ $stage -le 10 ]; then
Copy link
Contributor

@vijayaditya vijayaditya Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you want to all this in a top level script in local/. I would recommend moving these steps to some script like steps/nnet3/multilingual/get_egs.sh

echo print-interval = $print_interval
if [ $stage -le 11 ]; then
echo "$0: training mutilingual model."
steps/nnet3/multi/train_tdnn.sh --cmd "$train_cmd" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this script specific to tdnns ? I think most of our training steps are similar for all feed-forward dnns. I recommend renaming it to steps/nnet3/multilingual/train_dnn.sh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, please rename.

speed_perturb=true
multidir=exp/nnet3/multi_bnf_10_close_lang_plus_grg
global_extractor=exp/multi/nnet3/extractor
lang_list=(GRG LIT MONG TUR KAZ KUR PSH SWA TOK IGBO DHO)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it explicit here that this list of languages has been selected for GRG.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested before please describe each variable which is not usually found in a typical nnet3 recipe.

. ./utils/parse_options.sh


L=$1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once again, a better variable name please.

if $speed_perturb; then
suffix=_sp
fi
exp_dir=exp/$L
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This directory structure ({data,exp}/<lang-name>/) is not used in other Babel recipes. Better make it consistent with other scripts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main structure in this multilingual egs directory!

./local/nnet3/run_tdnn_joint_babel_sp_bnf.sh --dir $multidir \
--avg-num-archives $num_archives \
--global-extractor $global_extractor \
--init-lrate $bnf_init_lrate \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you anticipate the users to tune the variables a lot, please eliminate any variable which is not used more than once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested before any time you skip a stage as it is already done, please let the user know what you are skipping and how they can force that stage to run.

Copy link
Contributor

@vijayaditya vijayaditya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed review of few more scripts.

if [ ! -f $data_bnf_dir/.done ]; then
mkdir -p $dump_bnf_dir
# put the archives in ${dump_bnf_dir}/.
steps/nnet3/dump_bottleneck_features.sh --use-gpu true --nj $train_nj --cmd "$train_cmd" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend adding support for different argument lists similar to steps/make_mfcc.sh.

touch $exp_dir/tri5b/.done
fi

if [ ! $exp_dir/tri6/.done -nt $exp_dir/tri5b/.done ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you actually use this GMM-HMM system anywhere ? If not you could probably skip this training.
@jtrmal any comments ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought some people may want to use GMM-HMM trained on top of BN features!
any comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is something that's not immediately needed, but you think may be needed in future, you could just put it inside if false; then .... fi.

@@ -0,0 +1,471 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with all the possible use-cases for this script. @jtrmal might be better suited to review this script.

set -e
stage=1
train_stage=-10
generate_alignments=true # false if doing ctc training
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any references to ctc are outdated and should be removed also wherever you got this script from.

@@ -0,0 +1,119 @@
#!/bin/bash
# -v3 is as -v2 but it just consists 10 closest fullLP langs to GRG + GRG.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think these comments may be out of date. Comments at the top of this script should say what the script does... not clear to me right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pegah, please update the comments on this script... I think you may have also missed my other comments.


// Returns a random integer number according to a discrete probability distribution.
// It works based on sampling from a discrete distribution and
// it returns i with prob(i).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document that 'prob' must sum to one.

return -1;
}

void Nnet::RenameNode(int32 node_index, const std::string &new_node_name) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function should be called SetNodeName for consistency with GetNodeName(), and
document that this can be used, for example, for renaming output nodes.
But you should do some more checking here.
Do you require that after this renaming, the nnet should still be valid? I would expect so.
If so, then call Check() at the end of this function. You should probably also verify that IsValidName(new_node_name).

if (!rename_node_names.empty()) {
std::vector<string> orig_names,
new_names;
//GetRenameNodeNames(rename_node_names, &orig_names, new_names);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think making this a function (appropriately documented!) in nnet-utils.h called
void RenameNodes(std::string &rename_node_names, Nnet *nnet);
would make this much nicer.

// or crashes if it is not there.
int32 NumOutputIndexes(const NnetExample &eg) {
for (size_t i = 0; i < eg.io.size(); i++)
if (eg.io[i].name == "output")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't just change the implementation of functions without re-doing their documentation, like this!
Think about what would be the right way to this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is probably OK, I see that it is in a binary, not in nnet3/.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this is set up right now, the training is going to be extremely inefficient because graph compilation will happen on every single minibatch. This is because the minibatches have randomized structure.
The way I think you should solve this is to modify nnet3-merge-egs so that it batches together inputs that have the same structure (where by "the same structure" I mean the set of input and output names). It may be best to write a class to do the merging. Also the order of the NnetIo objects in the eg is not necessarily well defined, and should be ignored, so bear that in mind.

typedef kaldi::int64 int64;

const char *usage =
"Copy examples (single frames or fixed-size groups of frames) for neural\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the Usage message.

po.Read(argc, argv);
po.Register("rename-node-names", &rename_node_names, "Comma-separated list of noed names need to be modified"
" and their new name. e.g. 'affine0/affine0-lang1,affine1/affine1-lang1'");
po.Register("add-output-nodes", &add_output_nodes, "Comma-separated list of output node name and its input"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comma-separated list of output node names and their corresponding input descriptors, to be added the nnet config.

nj=4
cmd=run.pl
use_gpu=false
bnf_name=renorm4
Copy link
Contributor

@vijayaditya vijayaditya Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bnf_name seems to be a very critical parameter in this script. I think it is unwise to assign it by default, as we frequently vary the neural network architecture in our models.

[ -f path.sh ] && . ./path.sh # source the path.
. parse_options.sh || exit 1;

if [ $# != 5 ]; then
Copy link
Contributor

@vijayaditya vijayaditya Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I had suggested in other scripts, it might be better to support multiple argument lists like steps/mke_mfcc.sh for the sake of consistency. By default you could drop the arkdir and logdir arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's better to use the new style of putting these things in subdirectories of the destination directory.

*) echo "Invalid feature type $feat_type" && exit 1;
esac

if [ ! -z "$transform_dir" ]; then
Copy link
Contributor

@vijayaditya vijayaditya Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not using fMLLR/lda features in the nnet3 recipes. You could simplify your script by dropping support for these features.

N0=$(cat $data/feats.scp | wc -l)
N1=$(cat $archivedir/raw_bnfeat_$name.*.scp | wc -l)
if [[ "$N0" != "$N1" ]]; then
echo "Error happens when generating BNF for $name (Original:$N0 BNF:$N1)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

echo "$0 : An error occurred...."

# Concatenate feats.scp into bnf_data
for n in $(seq $nj); do cat $archivedir/raw_bnfeat_$name.$n.scp; done > $bnf_data/feats.scp

for f in segments spk2utt text utt2spk wav.scp char.stm glm kws reco2file_and_channel stm; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot utt2uniq.

@@ -0,0 +1,147 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of consistency with naming convention used in Kaldi, use the name
steps/nnet3/make_bottleneck_features.sh.

if [ $stage -le 1 ]; then
echo "Making BNF scp and ark."
echo output-node name=output input=$bnf_name > output.config
modified_bnf_nnet="nnet3-copy --rename-node-names=output/output-bkp $bnf_nnet - | nnet3-init - output.config - |"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there already exists a node with the name output-bkp in your network ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just assume there was no node named output-bkp. But about this-- please see Issue #1040, eventually it will be possible to rename the node via config files using nnet3-init or nnet-copy. I'm not sure if we should finish that issue before merging this, but anyway, proceed as if we were merging this first.



if [ $stage -le 1 ]; then
echo "Making BNF scp and ark."
Copy link
Contributor

@vijayaditya vijayaditya Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

echo "$0: Generating bottle-neck features"

@@ -0,0 +1,678 @@
#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to support chain training with the multi-lingual training scripts ? If not you could probably eliminate few options and simplify the script significantly.


def GetArgs():
# we add compulsary arguments as named arguments for readability
parser = argparse.ArgumentParser(description="Writes config files and variables "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify the description.

help="iVector dir, which will be used to derive the ivector-dim ", default=None)

num_target_group = parser.add_mutually_exclusive_group(required = True)
num_target_group.add_argument("--num-targets", type=int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mutually exclusive group does not make sense to me. Please rewrite this script rather than modifying the existing script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you think it doesn't make sense?You can provide either num-targets or num-multiple-targets!!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you still need the options corresponding to uni-lingual networks (--num-targets, --ali-dir and --tree-dir) ?
There are several other parts of this script which are not relevant in the context of multi-lingual networks. You could cull all these unnecessary parts to simplify this script substantially. In its current form this script is able to generate networks for uni-lingual xent training, uni-lingual chain training and multi-lingual xent training. I would strongly suggest that you keep the config generator scripts simple.


return args

def AddPerDimAffineLayer(config_lines, name, input, input_window):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config generator script seems to have been copied from an older tdnn config generator. The new script is eliminates a lot of these options.

exit_stage=-100 # you can set this to terminate the training early. Exits before running this stage

# count space-separated fields in splice_indexes to get num-hidden-layers.
splice_indexes="-4,-3,-2,-1,0,1,2,3,4 0 -2,2 0 -4,4 0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use your current best splice_indexes as the default.

unset args[${#args[@]}-1]
num_lang=$[${#args[@]}/3]

for i in `seq 0 $[$num_lang-1]`; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a large potential for argument offset errors.

parser.add_argument("--relu-dim", type=int,
help="dimension of ReLU nonlinearities")

parser.add_argument("--self-repair-scale-nonlinearity", type=float,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are undoing recent changes committed by @freewym .

@@ -104,10 +104,16 @@ def GetArgs():
parser.add_argument("--relu-dim", type=int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pegahgh You code is undoing most of the recent changes. Please update your branch.

@vijayaditya
Copy link
Contributor

@pegahgh I will continue review after the requested changes have been made. Please remember to update the usage messages and documentation in the C++ code.

@pegahgh
Copy link
Contributor Author

pegahgh commented Sep 21, 2016

@danpovey @vijayaditya Thanks for comments.
I almost fixed all the issues, and I will push changes tonight!

@danpovey danpovey changed the title Added babel_multilang example directory for multilingual setup using babel languages. WIP: Added babel_multilang example directory for multilingual setup using babel languages. Sep 21, 2016
@danpovey
Copy link
Contributor

As Pegah, Vijay and I discussed offline, this is going to be reworked with a slightly different design geared towards greater I/O efficiency, so I'm marking it as WIP.

. parse_options.sh || exit 1;

if [ $# -lt 1 ]; then
echo "Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>"
Copy link
Contributor

@danpovey danpovey Sep 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage message needs to be accurate.


exit 1;
fi
#data=$1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove these old comments.

# Set off jobs doing some diagnostics, in the background.
# Use the egs dir from the previous iteration for the diagnostics
for i in `seq 0 $[$num_lang-1]`;do
rename_io_names="output-$i/output"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For computing the diagnostics probabilities, I don't see that it's necessary or desirable to do this renaming of outputs.
nnet3-compute-prob already, I believe, computes the output for each of the output nodes that is defined, and separately prints those stats per output layer. All that might be needed is to make sure that it doesn't die because of the 'IsSimpleNnet()' check.

if [ $x -gt 0 ]; then
rename_io_names="output-0/output"

$cmd $dir/log/progress.$x.log \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the progress logs, I would advise just to do it once, and to leave off the last argument (the egs). The aspects of that program that require the egs are relatively unimportant.

echo print-interval = $print_interval
if [ $stage -le 11 ]; then
echo "$0: training mutilingual model."
steps/nnet3/multi/train_tdnn.sh --cmd "$train_cmd" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, please rename.

use_ivector=false
initial_effective_lrate=0.01
final_effective_lrate=0.001
pnorm_input_dim=3000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this variable is unused; please check for other unused variables.

// or crashes if it is not there.
int32 NumOutputIndexes(const NnetExample &eg) {
for (size_t i = 0; i < eg.io.size(); i++)
if (eg.io[i].name == "output")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is probably OK, I see that it is in a binary, not in nnet3/.

// or crashes if it is not there.
int32 NumOutputIndexes(const NnetExample &eg) {
for (size_t i = 0; i < eg.io.size(); i++)
if (eg.io[i].name == "output")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this is set up right now, the training is going to be extremely inefficient because graph compilation will happen on every single minibatch. This is because the minibatches have randomized structure.
The way I think you should solve this is to modify nnet3-merge-egs so that it batches together inputs that have the same structure (where by "the same structure" I mean the set of input and output names). It may be best to write a class to do the merging. Also the order of the NnetIo objects in the eg is not necessarily well defined, and should be ignored, so bear that in mind.

lang_id=${range[0]}
start_egs=${range[1]}
end_egs=$[$start_egs+${range[2]}]
awk -v s="$start_egs" -v e="$end_egs" 'NR >= s && NR < e' ${multi_egs_dirs[$lang_id]}/egs.scp >> $scp_file;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a super-inefficient way to do it-- it would take time quadratic in the amount of input, not linear.

# in multilingual training.

if [ $# -lt 4 ]; then
echo "$0: Usage: $0 num-langs [<egs-dir-lang-1> .. <egs-dir-lang-n>] <ranges.1> <scp-1>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this usage message is very unclear, it doesn't explain what the "ranges" or "scp" formats are, and things like <ranges.1> are not clear.
In any case, I think there might be a deeper design problem here, making it super slow.

if true; then
echo "$0: Generate egs for ${lang_list[$lang]}"
if [[ $(hostname -f) == *.clsp.jhu.edu ]] && [ ! -d $egs_dir/storage ]; then
utils/create_split_dir.pl \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this create_split_dir stuff is not the kind of thing that should appear in scripts inside steps/.

. parse_options.sh || exit 1;

if [ $# -lt 4 ]; then
echo "Usage: $0 [opts] num-input-langs <data-dir-per-lang> <ali-dir-per-lang> <egs-dir-per-lang> <multilingual-egs-dir>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to just take as input the egs dirs from all the languages, assuming they already exist; and move the responsibility for dumping those egs to a further-out level.

@@ -0,0 +1,158 @@
#!/bin/bash
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danpovey
I added steps/nnet3/multilingual/get_egs.sh, which generates egs..scp, outputs..scp, weights..scp for multilingual training using python file.
The details of egs..scp generation described in comments.
If you agree with whole pipeline for generating egs generation , I will start next steps, by doing following changes:

  1. Modify nnet3-copy-egs to accept output and weight
  2. add steps/nnet3/train_raw.py and modify it to be compatible with multilingual training setup.
  3. modify all local/nnet3/run_* w.r.t new pipeline.

@@ -0,0 +1,209 @@
#!/usr/bin/env python
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danpovey
This python script reads lang2len as input and it randomly selects new language w.r.t probability of remaining examples in each language and it outputs ranges.* with format

and then the bash file get_egs.sh generates egs..scp w.r.t generated ranges. in parallel.
The reason that I didn't read scp in python was that I should load whole scp file for each language in memory, that is not efficient!
Should I change the process to stream the scp files in python line by line (batch of minibatch-size each time!!)?

@danpovey
Copy link
Contributor

Doing it in parallel will just overwhelm the NFS server. Implement it in a
streaming way in python.
Dan

On Fri, Sep 23, 2016 at 8:48 PM, pegahgh [email protected] wrote:

@pegahgh commented on this pull request.

In egs/wsj/s5/steps/nnet3/multilingual/allocate_multilingual_examples.py
#1027 (review):

@@ -0,0 +1,209 @@
+#!/usr/bin/env python

@danpovey https://github.com/danpovey
This python script reads lang2len as input and it randomly selects new
language w.r.t probability of remaining examples in each language and it
outputs ranges.* with format

and then the bash file get_egs.sh generates egs..scp w.r.t generated
ranges.
in parallel.
The reason that I didn't read scp in python was that I should load whole
scp file for each language in memory, that is not efficient!
Should I change the process to stream the scp files in python line by line
(batch of minibatch-size each time!!)?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1027 (review),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADJVu2vgVf124jkBaHrPiTz1LAdAvJ62ks5qtHNggaJpZM4J9Vpw
.

@pegahgh
Copy link
Contributor Author

pegahgh commented Oct 3, 2016

@vijayaditya
I added new multilingual setup, which manages multilingual examples in more IO efficient way.

@vijayaditya
Copy link
Contributor

OK will review today.

On Mon, Oct 3, 2016 at 12:14 PM, pegahgh [email protected] wrote:

@vijayaditya https://github.com/vijayaditya
I added new multilingual setup, which manages multilingual examples in
more IO efficient way.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1027 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADtwoMpcCAeBuDE3TNB4CAJZwmDJFDqwks5qwSnmgaJpZM4J9Vpw
.

Copy link
Contributor

@vijayaditya vijayaditya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

started the review. Will continue later..

@@ -0,0 +1 @@
../../../babel/s5c/conf/common.fullLP No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't encourage soft-linking files across egs, as the source recipes can change anytime.


exit 1;
fi
cmd=run.pl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you resetting cmd ?

if [ $# -lt 4 ]; then
echo "Usage: $0 [opts] num-input-langs <data-dir-per-lang> <ali-dir-per-lang> <egs-dir-per-lang> <multilingual-egs-dir>"
echo " e.g.: $0 2 data/lang1/train data/lang2/train "
" exp/lang1/tri5_ali exp/lang2/tri5_ali exp/lang1/nnet3/lang1 exp/lang2/nnet3/lang2 exp/multi/egs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exp/lang1/nnet3/lang1 Is this supposed to be exp/lang1/nnet3/egs ?

stage=0
left_context=13
right_context=9
online_multi_ivector_dir= # list of iVector dir for all languages
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give an example for this variable.

ali_dir=${multi_ali_dirs[$lang]}
egs_dir=${multi_egs_dirs[$lang]}
online_ivector_dir=
if [ ! -z "$multi_ivector_dirs" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable name mismatch.

for datadir in train; do
./utils/data/perturb_data_dir_speed_3way.sh data/$lang/${datadir} data/$lang/${datadir}_sp

# Extract Plp+pitch feature for perturbed data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feature extraction needs to be done only if you want to generate alignments. Or are you assuming some other top level script needs these features ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this feature extraction is inside speed-perturb condition, where you need to regenerate alignment for perturbed data.

utils/create_split_dir.pl /export/b0{1,2,3,4}/$USER/kaldi-data/egs/$lang-$date/s5c/$mfccdir/storage $mfccdir/storage
fi

# the 100k_nodup directory is copied seperately, as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments are irrelevant.

@@ -0,0 +1,48 @@
#!/bin/bash
# This scripts generates iVector using global iVector extractor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually run_ivector_common*.sh also train the ivector extractor. Please rename this script.


mkdir -p nnet3
# perturbed data preparation
train_set=train
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't you just take these variables as input ?

speed_perturb=true
multidir=exp/nnet3/multi_bnf_10_close_lang_plus_grg
global_extractor=exp/multi/nnet3/extractor
lang_list_for_grg=(GRG LIT MONG TUR KAZ KUR PSH SWA TOK IGBO DHO)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better describe the language codes, not everyone is aware of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of inventing your own codes, I'd suggest using the ids from babel, which is an established practice... i.e 404-georgian, 304-lithuanian and so on.
It does not help anything if you make the codes short, you only obfuscate things

Copy link
Contributor

@vijayaditya vijayaditya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed a few more files. Will continue again when I can find time.



[ ! -d $dump_bnf_dir ] && mkdir -p $dump_bnf_dir
if [ ! -f $data_bnf_dir/.done ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I requested in the previous review, whenever you are skipping a stage as it is already done, print a statement saying that you are skipping it and also add a comment describing how users can force it to run ie, by deleting the .done files.

@@ -0,0 +1,698 @@
#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add this script as part of this PR ? Or do you want to assume that Vimal's PR is going to be merged before your PR ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to assume Vimal's PR is merged before!

}

# The function signature of MakeConfigs is changed frequently as it is intended for local use in this script.
def MakeConfigs(config_dir, splice_indexes_string,
Copy link
Contributor

@vijayaditya vijayaditya Oct 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to move the common parts of tdnn/make_configs.py and tdnn/multi_lingual/make_configs.py (e.g. Splice string parsing, CNN specification, common argument parsing) and include it into both the scripts. See how Vimal is doing this in his PR #1066 .

# This script uses separate input egs directory for each language as input,
# to generate egs.*.scp files in multilingual egs directory
# where the scp line points to the original archive for each egs directory.
# $megs/egs.*.scp is randomized w.r.t language id.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding snippets of these scp files would be helpful.

# Begin configuration section.
cmd=run.pl
minibatch_size=512 # multiple of minibatch used during training.
num_jobs=10 # helps for better randomness across languages
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you forget to update the comment beside this variable ?

# they are discarded.
if lang_len[lang_id] < args.minibatch_size:
lang_len[lang_id] = 0
print("Run out of data for language {0}".format(lang_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use warnings.warn to print this statement. Ran out of data..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really warning!! I will reword the statement!


# check files befor writing.
if f is None:
sys.exit("Error opening file " + args.egs_dir + "/temp/" + args.prefix + "ranges." + str(job + 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer usingraise Exception

feats="ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |"
ivec_feats="scp:utils/filter_scp.pl $sdata/JOB/utt2spk $ivector_dir/ivector_online.scp |"

if [ ! -z "$transform_dir" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't been using fMLLR features in nnet3, even the new training scripts do not support this option. I am eliminating this option in the new get_egs.py script, so you could safely eliminate this code and simplify your script.

nj=4
cmd=run.pl
use_gpu=false
bnf_name=Tdnn_Bottleneck_renorm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I pointed out in my previous review, it is better not to use bnf_name as an optional argument. Please consider making it compulsory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point is that this script can be used to dump output of different layers not only bottleneck. That was the reason I didn't fix the name!

ivector_opts="--online-ivector-period=$ivec_period --online-ivectors='$ivec_feats'"
fi
$cmd $compute_queue_opt JOB=1:$nj $dir/log/make_bnf_$name.JOB.log \
nnet3-compute $compute_gpu_opt $ivector_opts "$modified_bnf_nnet" "$feats" ark:- \| \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might want to make your bnf extractor architecture agnostic. If you are using an RNN you should support specification of extra-left-context, extra-right-context and chunk width.

@danpovey
Copy link
Contributor

node_name might be a better name than bnf_name.

On Thu, Oct 13, 2016 at 3:15 PM, pegahgh [email protected] wrote:

@pegahgh commented on this pull request.

In egs/wsj/s5/steps/nnet3/make_bottleneck_features.sh
#1027:

@@ -0,0 +1,128 @@
+#!/bin/bash
+
+# 2016 Pegah Ghahremani
+# Apache 2.0
+# This script dumps bottleneck feature for model trained using nnet3.
+
+# Begin configuration section.
+stage=1
+nj=4
+cmd=run.pl
+use_gpu=false
+bnf_name=Tdnn_Bottleneck_renorm

The whole point is that this script can be used to dump output of
different layers not only bottleneck. That was the reason I didn't fix the
name!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1027, or mute the thread
https://github.com/notifications/unsubscribe-auth/ADJVu4K7qGTPhy3cWVBZZcSGqy6JcV9gks5qzoNOgaJpZM4J9Vpw
.

@danpovey danpovey closed this May 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants