Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speech to text translation utilizing 3-way data #1099

Merged
merged 21 commits into from
Aug 17, 2023

Conversation

AmirHussein96
Copy link
Contributor

This is a pull request for 3-way Tunisian Arabic to English speech to text recipe from iwslt22 shared task https://iwslt.org/2022/dialect.
This is also an introduction on how to prepare a 3-way data (speech, transcription, and translation) to use in model training, e.g., Multitask learning scenario."

The idea now is to add target language translation to custom field as additional information for the supervision, since both source and target languages correspond to the same recording. See example below:

cut[10]
MonoCut(id='994203_ta_eng_20170614_214611_15072_B_010266', start=102.662, duration=2.674, channel=0, supervisions=[SupervisionSegment(id='994203_ta_eng_20170614_214611_15072_B_010266', recording_id='20170614_214611_15072_B', start=0.0, duration=2.674, channel=0, text='اه وزيد يظهرلي الببا ما يخليهاش', language='ta', speaker='994203', gender=None, custom={'tgt_lang': 'eng', 'tgt_text': "ah in addition apparently my dad won't let her"}, alignment=None)], features=None, recording=Recording(id='20170614_214611_15072_B', sources=[AudioSource(type='file', channels=[0], source='/export/common/data/corpora/LDC/LDC2022E01/data/audio/ta/20170614_214611_15072_B.sph')], sampling_rate=8000, num_samples=5120320, duration=640.04, channel_ids=[0], transforms=None), custom=None)

@@ -0,0 +1,56 @@
from typing import Optional, Sequence, Union
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this file was mistakenly added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

lhotse/bin/modes/recipes/iwslt22_ta.py Show resolved Hide resolved
[
{
"text": supervision.text,
"tgt_text": supervision.custom["tgt_text"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would the dataset be prepared if I want to have multiple target translations in different languages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an excellent question, and I hadn't considered that before. I believe a possible approach would be to concatenate the target languages together, including their language tags, and store that concatenated text in tgt_text. If you have a multilingual BPE (Byte Pair Encoding) model, you can tokenize tgt_text to include all languages and use it for training. If you have any better suggestions, please let me know. Additionally, I want to mention that I already have an Icefall recipe for a trained Zipformer speech translation model, and the results have been very promising. I plan to push that in the upcoming days.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a nested field in supervision.custom:

{
  "translated_text": {
    "en": text_en,
    "fr": text_fr, 
    ...
  }
}

Copy link
Collaborator

@desh2608 desh2608 Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the tgt_text and tgt_lang should be tuples or lists instead, where each item in the tuple is one language. But this is your choice. Users can also choose to extend this class for multi-task ST.

Edit: +1 for Piotr's suggestion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thank you @pzelasko @desh2608 for the nice suggestions. I followed @pzelasko suggestion.

return batch


def validate_for_asr(cuts: CutSet) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can import this function from speech_recognition.py if it is unchanged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

lhotse/recipes/iwslt22_ta.py Show resolved Hide resolved
# limitations under the License.

"""
IWSLT Tunisian is a 3-way parallel data includes 160 hours and 200k lines worth of aligned Audio,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add more details about the dataset, including citation for the original paper (and links)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

lhotse/recipes/iwslt22_ta.py Outdated Show resolved Hide resolved
@desh2608
Copy link
Collaborator

@AmirHussein96 please resolve conflicts and fix the tests (also remove WIP when you think it's ready for another review).

@AmirHussein96 AmirHussein96 changed the title [WIP]: Speech to text translation utilizing 3-way data Speech to text translation utilizing 3-way data Jul 31, 2023
@AmirHussein96
Copy link
Contributor Author

@AmirHussein96 please resolve conflicts and fix the tests (also remove WIP when you think it's ready for another review).

@desh2608 ready for another review.

@desh2608 desh2608 self-requested a review July 31, 2023 13:14
Copy link
Collaborator

@desh2608 desh2608 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor suggestions.

from lhotse.workarounds import Hdf5MemoryIssueFix


class K2Speech2textTranslationDataset(torch.utils.data.Dataset):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K2Speech2textTranslationDataset -> K2Speech2TextTranslationDataset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

input_strategy: BatchIO = PrecomputedFeatures(),
):
"""
k2 ASR IterableDataset constructor.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K2Speech2TextTranslationDataset constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"""
logging.info(
"""
To obtaining this data your institution needs to have an LDC subscription.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To *obtain

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

logging.info(
"""
To obtaining this data your institution needs to have an LDC subscription.
You also should download the predined splits with
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*pre-defined

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

corpus_dir: Pathlike,
splits: Pathlike,
output_dir: Optional[Pathlike] = None,
clean: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally keep the option name here same as the one in the CLI (i.e. normalize_text). Also "clean" has several other connotations other than normalization, e.g., it can refer to resegmentation, data filtering, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# UO/ - uncertain + foreign


arabic_filter = re.compile(r"[OUM]+/*|\u061F|\?|\!|\.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Tagging @pzelasko)

Putting the regex compilation in the global scope means that it would be run whenever users call import lhotse. Even if you are using the compiled regex several times, there is no real benefit in defining it globally since Python internally caches it anyway, so you might just compile it in the function where you are using it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, cool info about regex caching, I wasn't aware of that (SO reference https://stackoverflow.com/questions/12514157/how-does-pythons-regex-pattern-caching-work)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch. I fixed it.

Copy link
Collaborator

@desh2608 desh2608 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one minor change.

docs/corpus.rst Outdated
@@ -185,6 +185,8 @@ a CLI tool that create the manifests given a corpus directory.
- :func:`lhotse.recipes.prepare_mgb2`
* - XBMU-AMDO31
- :func:`lhotse.recipes.xbmu_amdo31`
* - IWSLT22_Ta
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list is in alphabetic order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it.

Copy link
Collaborator

@desh2608 desh2608 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@desh2608 desh2608 enabled auto-merge (squash) August 17, 2023 22:06
@desh2608 desh2608 merged commit c80fc07 into lhotse-speech:master Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants