Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The timestamp of model 'interspeech21' is incorrect #62

Open
owaski opened this issue Apr 15, 2022 · 5 comments
Open

The timestamp of model 'interspeech21' is incorrect #62

owaski opened this issue Apr 15, 2022 · 5 comments

Comments

@owaski
Copy link

owaski commented Apr 15, 2022

I run the following command:

python -m allosaurus.run --timestamp=True -i sample.wav -m interspeech21

and it gives me

0.040 0.025 ɑ
0.080 0.025 l
0.100 0.025 ʌ
0.120 0.025 s
0.140 0.025 o
0.170 0.025 ɹ
0.180 0.025 ə
0.200 0.025 s

This is incorrect for the sample audio. Seems the window shift is set wrongly.

@SlistInc
Copy link

I am struggling with the timing as well. Is anybody aware of any library able to do a forced alignment of phonemes based on the input from allosaurus? I would really appreciate any input and tipps on how I can improve the output from allosaurus.

@journeytosilius
Copy link

I am also looking for something like this

@xinjli
Copy link
Owner

xinjli commented Jun 12, 2022

Hi guys, sorry I was a bit busy with other projects and my internship in the last few months and did not have time to look at it.

I forgot to count the subsampling factor from the conv layer, i fixed it in the latest commit.

@kzgajos
Copy link

kzgajos commented Aug 30, 2022

A very useful library -- thank you for creating it.
I also have a timing issue. The onset of the phonemes seems to be reported correctly, but the duration of each shows as 0.045 regardless of how long each phoneme actually is. I need to detect pauses so accurate durations would be very helpful. Here's the output I get:

0.840 0.045 ʔ
0.870 0.045 a
0.900 0.045 l̪
0.960 0.045 t̪
0.990 0.045 ɒ
1.080 0.045 k͡p̚
1.140 0.045 a
1.260 0.045 t̪
1.320 0.045 ɒ
1.380 0.045 t̪
1.440 0.045 ɒ
1.470 0.045 k

@emorling
Copy link

emorling commented Jul 3, 2024

i assumed it was because its returning the most likely phoneme at the 0.045 interval?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants