Initial pipeline code #19

boykovdn · 2025-01-27T12:12:31Z

Needs the translation and transcription models to be useful

* Wrappers of huggingface t5 and Ollama * Might require some dependencies which are not installed in the project by default

* To pass CI

* Alternatively would have to get ollama set up in the CI VM, so just removing for now. This is not testing fundamental functionality

* For t5 tokenizer (Google)

* Used downstream to match a text file to the correct audio file.

* A wrapper around Whisper, which is prompted for transcription

* The translation and transcription outputs are still to be tested

* Translation model source and target languages now set in constructor * Pipeline updated to reflect above change * CI python env set to 3.11, to match what is pinned in the pyproject.toml * Added pydub dependency

* Trying to see whether it will still pass tests and CI

jack89roberts · 2025-02-06T15:19:47Z

Would you like someone to review this @boykovdn or is it still being worked on?

boykovdn · 2025-02-06T16:00:03Z

Better to review later - data handling might change

The new way of handling Callhome is to download the datasets (Talkbank and the English extension), set the right paths in the pipe.sh script, then run it to preprocess the dataset. Then you can use the CallhomePipeline to load a split as you need, which will load the entire text dataset into memory (it's small enough). You can then use it for downstream processing - it provides spanish/english pairs of text, alongside audio and translations if needed.

* Was updated in the source a few commits ago, but I forgot to update the test. * Should pass the CI now.

jack89roberts

Hey @boykovdn , I had a very quick look through and left some comments.

jack89roberts · 2025-02-18T11:44:49Z

src/m4st/parse.py

+        parser = cls()
+        with open(file_path) as file:
+            data = file.read()
+            parser.parse_transcription(data, file_prefix=file_path[-8:-4])


Suggested change

parser.parse_transcription(data, file_prefix=file_path[-8:-4])

parser.parse_transcription(data, file_prefix=file_path[-8:-4])

jack89roberts · 2025-02-18T11:46:36Z

pyproject.toml

@@ -38,9 +38,13 @@ dependencies = [
    "sacrebleu>=2.4.3",
    "seaborn>=0.13.2",
    "sonar-space>=0.2.0",
-    "torch==2.0.1",
+    "torch",


Is there a range we can give for all dependencies, to give some indication of what we ran the code with at least?

Sure, locally it runs with the latest torch version, but on the cluster it works with 2.2.2. I am happy to set that as the minimum.

jack89roberts · 2025-02-18T11:49:18Z

scripts/callhome/align_lines.py

Would be helpful to have a readme somewhere that explains where we got all the data (free and paid labels) + a brief summary of the necessary pre-processing steps and what they're doing etc.

Will update the README and add another README under scripts/callhome, explaining how to do the pre-processing.

jack89roberts · 2025-02-18T11:50:23Z

scripts/callhome/align_lines.py

+    if len(sys.argv) != 2:
+        sys.stderr.write(f"Usage: {sys.argv[0]} input_file\n")
+        sys.exit(1)


would be neater as argparse

jack89roberts · 2025-02-18T11:52:10Z

scripts/callhome/align_lines.py

+            # Print previous line if no merge occurred.
+            print(prev_line)
+
+        prev_line = line
+
+    if prev_line is not None:
+        print(prev_line)
+


are the print statements needed/could they be marked as debug/hidden via a logger?

Align lines is used in the context of processing a stream of text, and outputting to stdout - these are not debug lines. It's part of the pipe.sh script, for which I'll add a brief readme. The reason it's used like this, is because part of the pre-processing script from the Callhome dataset is PERL scripts, which I found easiest to use with unix pipes.

src/m4st/translate/model.py

jack89roberts · 2025-02-18T12:11:39Z

src/m4st/translate/model.py

+        assert target_lang_iso in self.supported_languages, f"This model \
+only supports {self.supported_languages}, but got target {target_lang_iso}."
+        assert source_lang_iso in self.supported_languages, f"This model \
+only supports {self.supported_languages}, but got source {source_lang_iso}."
+


prefer raising errors over asserts

src/m4st/translate/model.py

jack89roberts · 2025-02-18T12:13:19Z

src/m4st/translate/model.py

+        model_input_tokens = self.tokenizer(
+            model_input_text, return_tensors="pt"
+        ).input_ids
+        model_output_tokens = self.model.generate(model_input_tokens)


does this need args? E.g. I think it limits how many tokens it will generate based on max_new_tokens, which you might want to vary based on the length of the input text or something.

Hmm, good point. It actually does not matter in the end, because it turns out T5 does not support Spanish, so we will not be using it. I will update the max tokens to 210 as a reasonable upper bound (max sentence len 70 words, and about 3 tokens per word?)

src/m4st/translate/model.py

boykovdn · 2025-02-25T18:06:37Z

I've added most of the changes commented on, let me know what you think. I've still got to add NLLB, which I will use in my experiments, and change the Ollama testing, but the latter I think is not a priority at the moment.

* The update to preprocessing adds cleaning of the A/B notation for speaker.

* Adds pytest skip, because we don't want to run an LLM in the CI.

jack89roberts · 2025-02-27T10:32:49Z

README.md

@@ -23,9 +23,9 @@ python -m pip install .

 ## CallHome Dataset

-Go [https://ca.talkbank.org/access/CallHome](here), select the conversation language, create account, then you can download the "media folder". There you can find the .cha files, which contain the transcriptions.
+Go [https://ca.talkbank.org/access/CallHome](here), select the conversation language, create account, then you can download the "media folder". There you can find the .cha files, which contain the transcriptions. You will need to pre-process this data in combination with the Callhome translations dataset, which includes part of the pre-processing scripts. The README under ./scripts/callhome of this repo contains more information.


There isn't a README in scripts/callhome currently, like the last sentence says there should be.

Ah yes, sorry - adding it to my next commit

jack89roberts · 2025-02-27T10:34:20Z

src/m4st/callhome/process_mappings.py

    r"""
    Loads a DataFrame of Spanish utterances alongside their English translations.

    This function expects that the dataset has been pre-processed, so that the
    lines in the mapping file match to the correct lines in the transcription files.
+
+    Args:
+        cha_root_dir (str): Path to folder containing the .cha transcript files.


Minor but generally if you've type-hinted the args in the function def then it's not necessary to repeat the type in the docstring (and makes things easier to maintain).

jack89roberts · 2025-02-27T10:37:39Z

src/m4st/translate/model.py

+        if target_lang_iso not in self.supported_languages:
+            err_msg = f"This model only supports {self.supported_languages}, \
+but got target {target_lang_iso}."
+            raise Exception(err_msg)


ValueError would be a bit nicer to raise than a base Exception (also in a few other places).

jack89roberts · 2025-02-27T10:42:06Z

tests/test_translate.py

+def test_t5():
+    t5 = T5TranslateModel("eng", "fra")
+    translated_text_t5 = t5(eng_text)
+    print(translated_text_t5)


some kind of assert instead of print would be nicer

jack89roberts · 2025-02-27T10:43:34Z

src/m4st/callhome/process_mappings.py

+    return mname[3:] + ".cha"
+
+
+def process_line_ids(maybe_lines: str):


There are quite a few functions without a specified return type, e.g. for this one

Suggested change

def process_line_ids(maybe_lines: str):

def process_line_ids(maybe_lines: str) -> list[int]:

(I think)

jack89roberts · 2025-02-27T10:45:01Z

I added a few new comments @boykovdn but the only significant one is I think you've maybe forgotten to push the readme in scripts/callhome (or didn't make one yet). The others are all minor so it's fine not to address them and focus on getting results etc. first.

boykovdn · 2025-02-27T12:12:31Z

experiments/correlation/main.py

+    all_metric_evals = {}
+    for metric_class in metrics:
+        metric = metric_class()
+        if metric_class.__name__ in ["COMETRefScore", "COMETQEScore"]:


@klh5 This is what I use to calculate the metrics

boykovdn added 15 commits January 9, 2025 14:49

Add initial translation models

1a7b4f1

* Wrappers of huggingface t5 and Ollama * Might require some dependencies which are not installed in the project by default

Add transformers dependency

777eed2

* To pass CI

Remove ollama translation test

51da636

* Alternatively would have to get ollama set up in the CI VM, so just removing for now. This is not testing fundamental functionality

Add torch dependency

f451e5a

Add sentencepiece dependency

ebd9654

* For t5 tokenizer (Google)

Merge branch 'main' into 11-translation-models

7421d72

Merge branch 'main' into 11-translation-models

ed4339e

Modify callhome parser to log file prefix

2d42f4b

* Used downstream to match a text file to the correct audio file.

Add initial transcription model

aa8a626

* A wrapper around Whisper, which is prompted for transcription

[WIP] Add initial Callhome pipeline code

c6dd8a8

* The translation and transcription outputs are still to be tested

Merge branch '5-transcription' into 5-audio

63908c6

Merge branch '11-translation-models' into 5-audio

3d03893

Rename pipeline folder for clarity

79f4795

Slight refactor of translation model and fixes

463e97e

* Translation model source and target languages now set in constructor * Pipeline updated to reflect above change * CI python env set to 3.11, to match what is pinned in the pyproject.toml * Added pydub dependency

Unpin torch from 2.0.1

47c7e6a

* Trying to see whether it will still pass tests and CI

boykovdn added 2 commits February 11, 2025 12:42

Update translation model init in test

1b5fe69

* Was updated in the source a few commits ago, but I forgot to update the test. * Should pass the CI now.

jack89roberts reviewed Feb 18, 2025

View reviewed changes

Add changes to address review comments

01399fe

boykovdn added 3 commits February 26, 2025 10:56

Add NLLB and update preprocessing

39f59c0

* The update to preprocessing adds cleaning of the A/B notation for speaker.

Lint

4cb9166

Uncomment Ollama test case

cedca15

* Adds pytest skip, because we don't want to run an LLM in the CI.

jack89roberts reviewed Feb 27, 2025

View reviewed changes

Add script that evaluates metrics

02ae48c

boykovdn commented Feb 27, 2025

View reviewed changes

Add missing readme

730916a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial pipeline code #19

Initial pipeline code #19

boykovdn commented Jan 27, 2025

jack89roberts commented Feb 6, 2025

boykovdn commented Feb 6, 2025

jack89roberts left a comment

jack89roberts Feb 18, 2025

jack89roberts Feb 18, 2025

boykovdn Feb 25, 2025

jack89roberts Feb 18, 2025 •

edited

Loading

boykovdn Feb 25, 2025

jack89roberts Feb 18, 2025

jack89roberts Feb 18, 2025

boykovdn Feb 25, 2025 •

edited

Loading

jack89roberts Feb 18, 2025

jack89roberts Feb 18, 2025

boykovdn Feb 25, 2025

boykovdn commented Feb 25, 2025

jack89roberts Feb 27, 2025 •

edited

Loading

boykovdn Mar 3, 2025

jack89roberts Feb 27, 2025

jack89roberts Feb 27, 2025 •

edited

Loading

jack89roberts Feb 27, 2025

jack89roberts Feb 27, 2025 •

edited

Loading

jack89roberts commented Feb 27, 2025 •

edited

Loading

boykovdn Feb 27, 2025

	parser.parse_transcription(data, file_prefix=file_path[-8:-4])
	parser.parse_transcription(data, file_prefix=file_path[-8:-4])

		return mname[3:] + ".cha"


		def process_line_ids(maybe_lines: str):

	def process_line_ids(maybe_lines: str):
	def process_line_ids(maybe_lines: str) -> list[int]:

Initial pipeline code #19

Are you sure you want to change the base?

Initial pipeline code #19

Conversation

boykovdn commented Jan 27, 2025

jack89roberts commented Feb 6, 2025

boykovdn commented Feb 6, 2025

jack89roberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jack89roberts Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

boykovdn Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

boykovdn commented Feb 25, 2025

jack89roberts Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jack89roberts Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jack89roberts Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

jack89roberts commented Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

jack89roberts Feb 18, 2025 •

edited

Loading

boykovdn Feb 25, 2025 •

edited

Loading

jack89roberts Feb 27, 2025 •

edited

Loading

jack89roberts Feb 27, 2025 •

edited

Loading

jack89roberts Feb 27, 2025 •

edited

Loading

jack89roberts commented Feb 27, 2025 •

edited

Loading