-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speech to text translation utilizing 3-way data #1099
Speech to text translation utilizing 3-way data #1099
Conversation
@@ -0,0 +1,56 @@ | |||
from typing import Optional, Sequence, Union |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this file was mistakenly added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
lhotse/dataset/speech_translation.py
Outdated
[ | ||
{ | ||
"text": supervision.text, | ||
"tgt_text": supervision.custom["tgt_text"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would the dataset be prepared if I want to have multiple target translations in different languages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an excellent question, and I hadn't considered that before. I believe a possible approach would be to concatenate the target languages together, including their language tags, and store that concatenated text in tgt_text. If you have a multilingual BPE (Byte Pair Encoding) model, you can tokenize tgt_text to include all languages and use it for training. If you have any better suggestions, please let me know. Additionally, I want to mention that I already have an Icefall recipe for a trained Zipformer speech translation model, and the results have been very promising. I plan to push that in the upcoming days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a nested field in supervision.custom
:
{
"translated_text": {
"en": text_en,
"fr": text_fr,
...
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the tgt_text
and tgt_lang
should be tuples or lists instead, where each item in the tuple is one language. But this is your choice. Users can also choose to extend this class for multi-task ST.
Edit: +1 for Piotr's suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lhotse/dataset/speech_translation.py
Outdated
return batch | ||
|
||
|
||
def validate_for_asr(cuts: CutSet) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can import this function from speech_recognition.py
if it is unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
lhotse/recipes/iwslt22_ta.py
Outdated
# limitations under the License. | ||
|
||
""" | ||
IWSLT Tunisian is a 3-way parallel data includes 160 hours and 200k lines worth of aligned Audio, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add more details about the dataset, including citation for the original paper (and links)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@AmirHussein96 please resolve conflicts and fix the tests (also remove WIP when you think it's ready for another review). |
@desh2608 ready for another review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor suggestions.
lhotse/dataset/speech_translation.py
Outdated
from lhotse.workarounds import Hdf5MemoryIssueFix | ||
|
||
|
||
class K2Speech2textTranslationDataset(torch.utils.data.Dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K2Speech2textTranslationDataset
-> K2Speech2TextTranslationDataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
lhotse/dataset/speech_translation.py
Outdated
input_strategy: BatchIO = PrecomputedFeatures(), | ||
): | ||
""" | ||
k2 ASR IterableDataset constructor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K2Speech2TextTranslationDataset constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
lhotse/recipes/iwslt22_ta.py
Outdated
""" | ||
logging.info( | ||
""" | ||
To obtaining this data your institution needs to have an LDC subscription. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To *obtain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
lhotse/recipes/iwslt22_ta.py
Outdated
logging.info( | ||
""" | ||
To obtaining this data your institution needs to have an LDC subscription. | ||
You also should download the predined splits with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*pre-defined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
lhotse/recipes/iwslt22_ta.py
Outdated
corpus_dir: Pathlike, | ||
splits: Pathlike, | ||
output_dir: Optional[Pathlike] = None, | ||
clean: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally keep the option name here same as the one in the CLI (i.e. normalize_text
). Also "clean" has several other connotations other than normalization, e.g., it can refer to resegmentation, data filtering, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
lhotse/recipes/iwslt22_ta.py
Outdated
# UO/ - uncertain + foreign | ||
|
||
|
||
arabic_filter = re.compile(r"[OUM]+/*|\u061F|\?|\!|\.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Tagging @pzelasko)
Putting the regex compilation in the global scope means that it would be run whenever users call import lhotse
. Even if you are using the compiled regex several times, there is no real benefit in defining it globally since Python internally caches it anyway, so you might just compile it in the function where you are using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, cool info about regex caching, I wasn't aware of that (SO reference https://stackoverflow.com/questions/12514157/how-does-pythons-regex-pattern-caching-work)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch. I fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one minor change.
docs/corpus.rst
Outdated
@@ -185,6 +185,8 @@ a CLI tool that create the manifests given a corpus directory. | |||
- :func:`lhotse.recipes.prepare_mgb2` | |||
* - XBMU-AMDO31 | |||
- :func:`lhotse.recipes.xbmu_amdo31` | |||
* - IWSLT22_Ta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list is in alphabetic order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This is a pull request for 3-way Tunisian Arabic to English speech to text recipe from iwslt22 shared task https://iwslt.org/2022/dialect.
This is also an introduction on how to prepare a 3-way data (speech, transcription, and translation) to use in model training, e.g., Multitask learning scenario."
The idea now is to add target language translation to custom field as additional information for the supervision, since both source and target languages correspond to the same recording. See example below: