Quick heads up on some metadata confidence estimate work were doing #641

JRMeyer · 2021-03-08T03:16:54Z

JRMeyer
Mar 8, 2021
Maintainer

>>> utunga
[May 21, 2019, 2:14am]

Hello all,

We thought we'd mention that over at the Te Hiku /
Kōrero Māori project we're doing some work on
the decode_metadata() methods to add a bit more info the the
MetadataItem object, specifically around confidence per letter.

This work might feed into some projects around pronunciation. Another
thing we might do with it is to show some level of confidence in our
transcription UI - to give some sense for when the model is confident of
a transcription and when/where it might be worth the human reviewer to
look a little closer.

We made a branch at around 'deepspeech-0.5.0a8' and we're hoping that
we'll be able to turn it into a PR at some point in the future.

At this point the code is working for our experimentation but the PR is
not ready. We just thought it might be good to sort of mention that
we're doing this in case it overlaps with other work already going on or
coming soon.

It's really only a few changes to deepspeech.cc decode_metadata method

557: ModelState::decode_metadata(const vector& logits)

and to MetadataItem in deepspeech.h (where we've added three new
properties for now)

// Stores each individual character, along with its timing and confidence information
struct MetadataItem {
char* character;
int timestep; // Position of the character in units of 20ms
float start_time; // Position of the character in seconds
double probability; // Logit value at the time the character was chosen
double entropy; // Entropy across all logits at the time the character was chosen
char* acoustic_char; // Best guess from acoustic model at timestep of chosen letter (sometimes differs from best guess overall)
};

Our current plan is to experiment with the above fields with some real
world data, so we can see which of these confidence measures is actually
useful, maybe tweak it a bit based on that feedback and then create a
PR. So it will be a wee ways off and of course we fully expect we may
have to adapt or minimize our changes even more based on feedback during
the PR process.

That said, we figured we might as well get the word out there that this
is something we're working on.

In case there are replies I figure I'll just tag my collaborators at
TeHikuMedia
slash kmahelona and
[ slash mathematiguy](
on this one
as well, since I guess you were the person who added the timing metadata
stuff in the first place.

PS Some example output... you'll notice at 8.20 seconds the acoustic
model guesses 'n' but the language model corrects that in the final
transcription to ŋ (aka ng ).

Target transcription
# Ka whakapā a Hine ki tētahi āhuatanga whakahirahira o te whakamahinga o te reo Māori
Actual transcription (this one has 0% WER)
# ka whakapā a hine ki tētahi āhuatanga whakahirahira o te whakamahinga o te reo māori
Raw output transcription (in our new Te Reo Māori specific orthography)
# ka ƒakapā a hine ki tētahi āhuataŋa ƒakahirahira o te ƒakamahiŋa o te reo māori
Raw transcription if we only used the acoustic model
# ka ƒakapā a hine ki tētahi āhuataŋa ƒakahirahira o te ƒakamahina o te reo māori

'char':seconds:'acoustic_char' probability entropy
'k':1.28:'k' prob:0.997742 entropy:0.025697
'a':1.48:'a' prob:0.999660 entropy:0.005033
' ':1.42:' ' prob:0.998576 entropy:0.015578
'ƒ':1.46:'ƒ' prob:0.999910 entropy:0.001490
'a':1.62:'a' prob:0.999978 entropy:0.000389
'k':1.60:'k' prob:0.999788 entropy:0.003169
'a':1.62:'a' prob:0.999978 entropy:0.000389
'p':1.84:'p' prob:0.991591 entropy:0.083503
'ā':1.86:'ā' prob:0.669923 entropy:0.946214
' ':2.22:' ' prob:0.898275 entropy:0.509989
'a':2.24:'a' prob:0.997645 entropy:0.026884
' ':2.46:' ' prob:0.636555 entropy:0.950531
'h':2.50:'h' prob:0.994121 entropy:0.061711
'i':2.52:'i' prob:0.998789 entropy:0.015438
'n':2.68:'n' prob:0.999938 entropy:0.000999
'e':2.70:'e' prob:0.998212 entropy:0.021415
' ':3.38:' ' prob:0.841081 entropy:0.633549
'k':3.08:'k' prob:0.965924 entropy:0.249148
'i':3.10:'i' prob:0.994305 entropy:0.060981
' ':3.38:' ' prob:0.841081 entropy:0.633549
't':3.66:'t' prob:0.999955 entropy:0.000785
'ē':3.44:'ē' prob:0.985221 entropy:0.119456
't':3.66:'t' prob:0.999955 entropy:0.000785
'a':3.68:'a' prob:0.999769 entropy:0.003429
'h':3.80:'h' prob:0.999525 entropy:0.005986
'i':3.82:'i' prob:0.999887 entropy:0.001707
' ':4.06:' ' prob:0.999827 entropy:0.002440
'ā':4.12:'ā' prob:0.999900 entropy:0.001572
'h':4.38:'h' prob:0.999580 entropy:0.005927
'u':4.40:'u' prob:0.999936 entropy:0.001080
'a':4.66:'a' prob:0.998985 entropy:0.012388
't':4.64:'t' prob:0.999953 entropy:0.000757
'a':4.82:'a' prob:0.999497 entropy:0.006564
'ŋ':4.80:'ŋ' prob:0.999840 entropy:0.002288
'a':4.82:'a' prob:0.999497 entropy:0.006564
' ':5.66:' ' prob:0.994251 entropy:0.053260
'ƒ':5.70:'ƒ' prob:0.994764 entropy:0.057157
'a':5.72:'a' prob:0.998290 entropy:0.020614
'k':5.86:'k' prob:0.995913 entropy:0.039462
'a':5.88:'a' prob:0.995949 entropy:0.040058
'h':6.08:'h' prob:0.996310 entropy:0.038118
'i':6.10:'i' prob:0.997903 entropy:0.025600
'r':6.20:'r' prob:0.999943 entropy:0.000914
'a':6.22:'a' prob:0.999786 entropy:0.003344
'h':6.36:'h' prob:0.998314 entropy:0.018087
'i':6.38:'i' prob:0.994112 entropy:0.059620
'r':6.52:'r' prob:0.999738 entropy:0.003533
'a':6.54:'a' prob:0.999723 entropy:0.003865
' ':6.86:' ' prob:0.999614 entropy:0.004946
'o':6.92:'o' prob:0.992238 entropy:0.078715
' ':7.06:' ' prob:0.999666 entropy:0.004346
't':7.08:'t' prob:0.999473 entropy:0.006716
'e':7.10:'e' prob:0.999567 entropy:0.005573
' ':7.30:' ' prob:0.999920 entropy:0.001255
'ƒ':7.34:'ƒ' prob:0.999175 entropy:0.010537
'a':7.36:'a' prob:0.999082 entropy:0.011338
'k':7.50:'k' prob:0.999428 entropy:0.007066
'a':7.80:'a' prob:0.999715 entropy:0.003959
'm':7.78:'m' prob:0.999868 entropy:0.002160
'a':7.80:'a' prob:0.999715 entropy:0.003959
'h':8.02:'h' prob:0.994506 entropy:0.051605
'i':8.04:'i' prob:0.997908 entropy:0.024319
'ŋ':8.20:'n' prob:0.163235 entropy:0.758166
'a':8.22:'a' prob:0.986215 entropy:0.106885
' ':8.84:' ' prob:0.999548 entropy:0.005679
'o':8.90:'o' prob:0.996885 entropy:0.037253
' ':9.08:' ' prob:0.999404 entropy:0.007245
't':9.10:'t' prob:0.999635 entropy:0.004874
'e':9.12:'e' prob:0.999580 entropy:0.005448
' ':9.20:' ' prob:0.999891 entropy:0.001612
'r':9.24:'r' prob:0.999954 entropy:0.000807
'e':9.26:'e' prob:0.999885 entropy:0.001855
'o':9.38:'o' prob:0.999943 entropy:0.000926
' ':9.60:' ' prob:0.999293 entropy:0.008496
'm':9.64:'m' prob:0.999928 entropy:0.001178
'ā':9.66:'ā' prob:0.999976 entropy:0.000439
'o':9.84:'o' prob:0.999944 entropy:0.000880
'r':9.94:'r' prob:0.998312 entropy:0.019489
'i':9.96:'i' prob:0.999996 entropy:0.000085

[This is an archived TTS discussion thread from discourse.mozilla.org/t/quick-heads-up-on-some-metadata-confidence-estimate-work-were-doing]

JRMeyer · 2021-03-08T03:16:56Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> utunga
[May 21, 2019, 2:09am]

looking at the above it
looks like there is, maybe just maybe? a bug in the way
ctc_beam_search_decoder works ?

Specifically, notice how the space character before after the word 'ki'
appears to be repeated - so it appears like time goes backwards there
for a bit (from 2.70 to 3.38 then back to 3.08) ... ?

Is that a bug? From what we can tell just looking at this it seems to be
the timestep coming straight out of ctc_beam_search_decoder that does
this.

If you think it might be a bug, I can try and dig further, maybe see if
I can find more examples like this and maybe even create an issue. On
the other hand if it's expected behaviour I guess I won't do that.

Thanks

'h':2.50:'h' prob:0.994121 entropy:0.061711
'i':2.52:'i' prob:0.998789 entropy:0.015438
'n':2.68:'n' prob:0.999938 entropy:0.000999
'e':2.70:'e' prob:0.998212 entropy:0.021415
' ':3.38:' ' prob:0.841081 entropy:0.633549 *
'k':3.08:'k' prob:0.965924 entropy:0.249148
'i':3.10:'i' prob:0.994305 entropy:0.060981
' ':3.38:' ' prob:0.841081 entropy:0.633549 *
't':3.66:'t' prob:0.999955 entropy:0.000785
'ē':3.44:'ē' prob:0.985221 entropy:0.119456

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:16:59Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> dabinat
[May 21, 2019, 4:03pm]

I created the initial implementation of the timing metadata. Letter
confidence is definitely something I wanted to add but wasn't sure how
to extract it. This looks great - thanks for working on it.

For the client there's definitely more value in word entropy than letter
entropy. What do you think is the best way of calculating this - average
entropy or peak entropy?

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:17:02Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> tippy_top
[May 22, 2019, 2:55am]

The entropy of each character is well defined (H = - Sum (p log p)),
where the sum is over all the characters in the alphabet.

The entropy of a word is trickier, as we need to sum over probabilities
of all the words. But one word-level measure that could be useful would
be the probability of a word, derived from the character probabilities.
We can define this recursively. If W_0 = 0 then the probability that a
word (up to letter i) is incorrect is W_i = P_i W slash _{i-1} + (1 - P_i).
Iterate this relation, up to the space, to get one minus the word
probability.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:17:04Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> tippy_top
[May 22, 2019, 2:47am]

To make it clearer, here is a Gist, calculating the probability of the
word 'hine' from within python, using the character probabilities above.
This may be too simplistic---would be good to check with some real world
transcriptions. slash

gist.github.com

#### https://gist.github.com/edwardabraham/0a2d8643125e081ed2281d8de000bead

##### word_probability.py

# A python example of calculating the probability of the word 'hine'
# See https://discourse.mozilla.org/t/quick-heads-up-on-some-metadata-confidence-estimate-work-were-doing/40618/2

# The probability that a word is incorrect before the nth letter:
def p_not_word(n, probabilities):
if n == 0:
return 0
else:
return probabilities[n - 1] DEEPSPEECH.cdx deepspeech.commands DEEPSPEECH.pages DEEPSPEECH.warc.gz discourse.mozilla.org html-to-markdown.sh shell-conver-html-to-split-posts.sh sorted-deepspeech-posts p_not_word(n - 1, probabilities) + slash
(1 - probabilities[n - 1])

This file has been truncated. show
original

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:17:07Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> keoni
[June 25, 2019, 11:44pm]

Below are some examples demonstrating (1) the transcribed text with the
acoustic character probability, (2) the acoustic character with the
entropy, and (3) the transcribed word with the calculated probability
post above).

Correct pronunciation:
audio

https://tehikumedia.s3.amazonaws.com/2019/05/22/411008_correct_versions/correct.m4a

slash

https://tehikumedia.s3.amazonaws.com/2019/05/22/411008_correct.mov

Incorrect pronunciation:
audio

https://tehikumedia.s3.amazonaws.com/2019/05/22/674019_incorrect_versions/incorrect.m4a

46.2

slash

https://tehikumedia.s3.amazonaws.com/2019/05/22/674019_incorrect.mov

[Hey where slash 'd those logits go?

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:17:09Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[May 22, 2019, 7:19am]

> PS: Can we embed audio/video in discourse?

I'm not sure, I'll check

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:17:12Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> hmitsch
[May 22, 2019, 8:17am]

Yes, you can embed video.

will
be able to provide more insight.

You need to paste a URL for a video on a separate line:

> bla bla slash
> url - https://example.com/video slash
> yadda yadda

Example: slash

Hope this helps?

Best regards, slash
Henrik

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:17:15Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> leo
[June 3, 2019, 12:37pm]

> PS: Can we embed audio/video in discourse?

Yes, as Henrik said if you paste a url by itself on a line Discourse
will do its best to embed that media, for example with one of your audio
clips:

https://tehikumedia.s3.amazonaws.com/2019/05/22/411008_correct_versions/correct.m4a

Up until today this wasn't working properly because our CSP was blocking
external media sources, but I've fixed that now.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:17:17Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> keoni
[June 25, 2019, 11:40pm]

it seems to be working now.
The only thing is our website, https://tehiku.nz, isn't an 'allowed'
site for embedding iframes. I can't expect Mozilla to change that just
for us... but it would support 'the little guys' as opposed to just
allowing people to embed content from the big guys
![:sweat_smile:](

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:17:20Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> ena.1994
[August 9, 2019, 7:23am]

I've seen lots of more examples where the timings are wrong. Any news
s first question? Is it

maybe you know that a bit better? Didn't you say something about a shift
you would include to the timings as well? Thank you already

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick heads up on some metadata confidence estimate work were doing #641

{{title}}

Replies: 10 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Quick heads up on some metadata confidence estimate work were doing #641

JRMeyer Mar 8, 2021 Maintainer

Replies: 10 comments

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author