-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Swedish changes #1242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swedish changes #1242
Conversation
Update from original
…prakbanken_swe and removed deprecated commands from run.sh
…h python 2 and 3 on the request of @jtrmal (I think they are slower now because we use more regexes). Changed the preprocessing so case is not normalised and altered default behaviour to delete sentence-final '.' rather than convert to a token because it is more often the case that they are not spoken aloud.
…ased systems. Changed the scoring scripts in local/ to be similar to WSJ to get better analyses and changed the local/wer* scripts to fit this recipe.
… but particular Danish characters. Corrected error in previous commit that changes openfst version tools/Makefile
Update from original
|
Thanks... let's wait until the lexicon is available at openslr before
merging it.
In general we don't like to overwrite files at openslr if they have been
there a while, but this isn't a hard-and-fast rule.
Did you plan for the new lexicon to have the same filename, and what are
the differences from the old lexicon? I'm wondering whether we should give
it a different filename.
…On Fri, Dec 2, 2016 at 9:36 AM, Andreas Søeborg Kirkedal < ***@***.***> wrote:
The update in this PR makes te modifications to sprakbanken that was
requested for sprakbanken_swe, makes the python scripts work with python
2.7.x, simplifies the recipe and gives better results. Because I have
changed the data preprocessing, a new lexicon needs to be uploaded to
openslr, but I cannot attach it to the PR.
------------------------------
You can view, comment on, or merge this pull request online at:
#1242
Commit Summary
- Merge pull request #4 from kaldi-asr/master
- Made the same modifications to sprakbanken as @jtrmal suggested for
sprakbanken_swe and removed deprecated commands from run.sh
- Modified python scripts called by sprak_data_prep.sh so they work
with python 2 and 3 on the request of @jtrmal (I think they are slower now
because we use more regexes). Changed the preprocessing so case is not
normalised and altered default behaviour to delete sentence-final '.'
rather than convert to a token because it is more often the case that they
are not spoken aloud.
- Modified run.sh and tuned #leaves and #Gauss on dev set for for
GMM-based systems. Changed the scoring scripts in local/ to be similar to
WSJ to get better analyses and changed the local/wer* scripts to fit this
recipe.
- Modify the filters in local/wer_* so they remove accents and
umlauts, but particular Danish characters. Corrected error in previous
commit that changes openfst version tools/Makefile
File Changes
- *M* egs/sprakbanken/s5/local/copy_dict.sh
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-0> (6)
- *M* egs/sprakbanken/s5/local/create_datasets.sh
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-1> (2)
- *M* egs/sprakbanken/s5/local/dict_prep.sh
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-2> (129)
- *M* egs/sprakbanken/s5/local/norm_dk/format_text.sh
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-3> (11)
- *A* egs/sprakbanken/s5/local/norm_dk/numbersLow.tbl
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-4> (265)
- *M* egs/sprakbanken/s5/local/normalize_transcript.py
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-5> (17)
- *M* egs/sprakbanken/s5/local/normalize_transcript_prefixed.py
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-6> (30)
- *M* egs/sprakbanken/s5/local/score.sh
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-7> (124)
- *M* egs/sprakbanken/s5/local/sprak_data_prep.sh
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-8> (62)
- *A* egs/sprakbanken/s5/local/wer_hyp_filter
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-9> (5)
- *A* egs/sprakbanken/s5/local/wer_output_filter
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-10> (5)
- *A* egs/sprakbanken/s5/local/wer_ref_filter
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-11> (5)
- *M* egs/sprakbanken/s5/local/writenumbers.py
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-12> (1)
- *M* egs/sprakbanken/s5/run.sh
<https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-13> (311)
Patch Links:
- https://github.com/kaldi-asr/kaldi/pull/1242.patch
- https://github.com/kaldi-asr/kaldi/pull/1242.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1242>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu2NdJPg_Q6po6UU4Wtm-6FaIZzr-ks5rECzugaJpZM4LCnQa>
.
|
|
The words in the new lexicon are not case normalised. Otherwise, the old and new version are the same. I had thought to just replace the old lexicon with the new one, but if you would like to keep the old version, I can rename the new one to e.g. lexicon-da-nonorm.tgz |
|
Yes, please rename, and email Yenda separately with the new file.
…On Sat, Dec 3, 2016 at 2:42 PM, Andreas Søeborg Kirkedal < ***@***.***> wrote:
The words in the new lexicon are not case normalised. Otherwise, the old
and new version are the same. I had thought to just replace the old lexicon
with the new one, but if you would like to keep the old version, I can
rename the new one to e.g. lexicon-da-nonorm.tgz
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1242 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuzWzHQuVaAQXG-BWosJ4P6VyEHFgks5rEcYsgaJpZM4LCnQa>
.
|
|
Lexicon published -- http://www.openslr.org/8/
On Sat, Dec 3, 2016 at 2:43 PM, Daniel Povey <[email protected]>
wrote:
… Yes, please rename, and email Yenda separately with the new file.
On Sat, Dec 3, 2016 at 2:42 PM, Andreas Søeborg Kirkedal <
***@***.***> wrote:
> The words in the new lexicon are not case normalised. Otherwise, the old
> and new version are the same. I had thought to just replace the old
lexicon
> with the new one, but if you would like to keep the old version, I can
> rename the new one to e.g. lexicon-da-nonorm.tgz
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#1242 (comment)>,
or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/ADJVuzWzHQuVaAQXG-
BWosJ4P6VyEHFgks5rEcYsgaJpZM4LCnQa>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1242 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AKisX1Oqs9DHj_ob9fG6r-6LTs4Oety2ks5rEcZlgaJpZM4LCnQa>
.
|
|
Andreas, let me know when the recipe is ready to check (e.g. the filename
matches the one in openslr).
…On Mon, Dec 5, 2016 at 9:46 AM, jtrmal ***@***.***> wrote:
Lexicon published -- http://www.openslr.org/8/
On Sat, Dec 3, 2016 at 2:43 PM, Daniel Povey ***@***.***>
wrote:
> Yes, please rename, and email Yenda separately with the new file.
>
>
> On Sat, Dec 3, 2016 at 2:42 PM, Andreas Søeborg Kirkedal <
> ***@***.***> wrote:
>
> > The words in the new lexicon are not case normalised. Otherwise, the
old
> > and new version are the same. I had thought to just replace the old
> lexicon
> > with the new one, but if you would like to keep the old version, I can
> > rename the new one to e.g. lexicon-da-nonorm.tgz
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <#1242 (comment)>,
> or mute
> > the thread
> > <https://github.com/notifications/unsubscribe-auth/ADJVuzWzHQuVaAQXG-
> BWosJ4P6VyEHFgks5rEcYsgaJpZM4LCnQa>
> > .
> >
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1242 (comment)>,
or mute
> the thread
> <https://github.com/notifications/unsubscribe-
auth/AKisX1Oqs9DHj_ob9fG6r-6LTs4Oety2ks5rEcZlgaJpZM4LCnQa>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1242 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuyf5gS-VZXGlqjxjR5rkGtMKIYezks5rFCO2gaJpZM4LCnQa>
.
|
|
Ready for review, Dan.
2016-12-05 17:48 GMT+01:00 Daniel Povey <[email protected]>:
… Andreas, let me know when the recipe is ready to check (e.g. the filename
matches the one in openslr).
On Mon, Dec 5, 2016 at 9:46 AM, jtrmal ***@***.***> wrote:
> Lexicon published -- http://www.openslr.org/8/
>
> On Sat, Dec 3, 2016 at 2:43 PM, Daniel Povey ***@***.***>
> wrote:
>
> > Yes, please rename, and email Yenda separately with the new file.
> >
> >
> > On Sat, Dec 3, 2016 at 2:42 PM, Andreas Søeborg Kirkedal <
> > ***@***.***> wrote:
> >
> > > The words in the new lexicon are not case normalised. Otherwise, the
> old
> > > and new version are the same. I had thought to just replace the old
> > lexicon
> > > with the new one, but if you would like to keep the old version, I
can
> > > rename the new one to e.g. lexicon-da-nonorm.tgz
> > >
> > > —
> > > You are receiving this because you commented.
> > > Reply to this email directly, view it on GitHub
> > > <#1242 (comment)
>,
> > or mute
> > > the thread
> > > <https://github.com/notifications/unsubscribe-
auth/ADJVuzWzHQuVaAQXG-
> > BWosJ4P6VyEHFgks5rEcYsgaJpZM4LCnQa>
> > > .
> > >
> >
> > —
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub
> > <#1242 (comment)>,
> or mute
> > the thread
> > <https://github.com/notifications/unsubscribe-
> auth/AKisX1Oqs9DHj_ob9fG6r-6LTs4Oety2ks5rEcZlgaJpZM4LCnQa>
> > .
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#1242 (comment)>,
or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/ADJVuyf5gS-
VZXGlqjxjR5rkGtMKIYezks5rFCO2gaJpZM4LCnQa>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1242 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABZKbKG2p8o6xMBG5ACFW_kBQwW6M8lFks5rFEBjgaJpZM4LCnQa>
.
--
Med venlig hilsen
Andreas Søeborg Kirkedal
|
| dictdir=data/local/dict | ||
| espeakdir='espeak-1.48.04-source' | ||
| mkdir -p $dir | ||
| mkdir -p $dictsrc $dictd ir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems to be a space in the middle of a word.
|
There is a conflict, can you please merge and resolve? |
Merging to resolve conflict
Swedish changes (kaldi-asr#1242)
The update in this PR makes te modifications to sprakbanken that was requested for sprakbanken_swe, makes the python scripts work with python 2.7.x, simplifies the recipe and gives better results. Because I have changed the data preprocessing, a new lexicon needs to be uploaded to openslr, but I cannot attach it to the PR.