Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial pipeline code #19

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open

Initial pipeline code #19

wants to merge 23 commits into from

Conversation

boykovdn
Copy link
Collaborator

  • Needs the translation and transcription models to be useful

* Wrappers of huggingface t5 and Ollama
* Might require some dependencies which are not installed in the project by default
* Alternatively would have to get ollama set up in the CI VM, so just removing for now. This is not testing fundamental functionality
* For t5 tokenizer (Google)
* Used downstream to match a text file to the correct audio file.
* A wrapper around Whisper, which is prompted for transcription
* The translation and transcription outputs are still to be tested
* Translation model source and target languages now set in constructor
* Pipeline updated to reflect above change
* CI python env set to 3.11, to match what is pinned in the pyproject.toml
* Added pydub dependency
* Trying to see whether it will still pass tests and CI
@jack89roberts
Copy link
Collaborator

Would you like someone to review this @boykovdn or is it still being worked on?

@boykovdn
Copy link
Collaborator Author

boykovdn commented Feb 6, 2025

Better to review later - data handling might change

The new way of handling Callhome is to download the datasets (Talkbank and the English extension), set the right paths in the pipe.sh script, then run it to preprocess the dataset. Then you can use the CallhomePipeline to load a split as you need, which will load the entire text dataset into memory (it's small enough). You can then use it for downstream processing - it provides spanish/english pairs of text, alongside audio and translations if needed.
* Was updated in the source a few commits ago, but I forgot to update the test.
* Should pass the CI now.
Copy link
Collaborator

@jack89roberts jack89roberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @boykovdn , I had a very quick look through and left some comments.

parser = cls()
with open(file_path) as file:
data = file.read()
parser.parse_transcription(data, file_prefix=file_path[-8:-4])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
parser.parse_transcription(data, file_prefix=file_path[-8:-4])
parser.parse_transcription(data, file_prefix=file_path[-8:-4])

pyproject.toml Outdated
@@ -38,9 +38,13 @@ dependencies = [
"sacrebleu>=2.4.3",
"seaborn>=0.13.2",
"sonar-space>=0.2.0",
"torch==2.0.1",
"torch",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a range we can give for all dependencies, to give some indication of what we ran the code with at least?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, locally it runs with the latest torch version, but on the cluster it works with 2.2.2. I am happy to set that as the minimum.

Copy link
Collaborator

@jack89roberts jack89roberts Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be helpful to have a readme somewhere that explains where we got all the data (free and paid labels) + a brief summary of the necessary pre-processing steps and what they're doing etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update the README and add another README under scripts/callhome, explaining how to do the pre-processing.

Comment on lines +13 to +15
if len(sys.argv) != 2:
sys.stderr.write(f"Usage: {sys.argv[0]} input_file\n")
sys.exit(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be neater as argparse

Comment on lines +43 to +50
# Print previous line if no merge occurred.
print(prev_line)

prev_line = line

if prev_line is not None:
print(prev_line)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the print statements needed/could they be marked as debug/hidden via a logger?

Copy link
Collaborator Author

@boykovdn boykovdn Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align lines is used in the context of processing a stream of text, and outputting to stdout - these are not debug lines. It's part of the pipe.sh script, for which I'll add a brief readme. The reason it's used like this, is because part of the pre-processing script from the Callhome dataset is PERL scripts, which I found easiest to use with unix pipes.

Comment on lines 36 to 40
assert target_lang_iso in self.supported_languages, f"This model \
only supports {self.supported_languages}, but got target {target_lang_iso}."
assert source_lang_iso in self.supported_languages, f"This model \
only supports {self.supported_languages}, but got source {source_lang_iso}."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer raising errors over asserts

model_input_tokens = self.tokenizer(
model_input_text, return_tensors="pt"
).input_ids
model_output_tokens = self.model.generate(model_input_tokens)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need args? E.g. I think it limits how many tokens it will generate based on max_new_tokens, which you might want to vary based on the length of the input text or something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. It actually does not matter in the end, because it turns out T5 does not support Spanish, so we will not be using it. I will update the max tokens to 210 as a reasonable upper bound (max sentence len 70 words, and about 3 tokens per word?)

@boykovdn
Copy link
Collaborator Author

I've added most of the changes commented on, let me know what you think. I've still got to add NLLB, which I will use in my experiments, and change the Ollama testing, but the latter I think is not a priority at the moment.

* The update to preprocessing adds cleaning of the A/B notation for speaker.
* Adds pytest skip, because we don't want to run an LLM in the CI.
@@ -23,9 +23,9 @@ python -m pip install .

## CallHome Dataset

Go [https://ca.talkbank.org/access/CallHome](here), select the conversation language, create account, then you can download the "media folder". There you can find the .cha files, which contain the transcriptions.
Go [https://ca.talkbank.org/access/CallHome](here), select the conversation language, create account, then you can download the "media folder". There you can find the .cha files, which contain the transcriptions. You will need to pre-process this data in combination with the Callhome translations dataset, which includes part of the pre-processing scripts. The README under ./scripts/callhome of this repo contains more information.
Copy link
Collaborator

@jack89roberts jack89roberts Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a README in scripts/callhome currently, like the last sentence says there should be.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, sorry - adding it to my next commit

r"""
Loads a DataFrame of Spanish utterances alongside their English translations.

This function expects that the dataset has been pre-processed, so that the
lines in the mapping file match to the correct lines in the transcription files.

Args:
cha_root_dir (str): Path to folder containing the .cha transcript files.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor but generally if you've type-hinted the args in the function def then it's not necessary to repeat the type in the docstring (and makes things easier to maintain).

if target_lang_iso not in self.supported_languages:
err_msg = f"This model only supports {self.supported_languages}, \
but got target {target_lang_iso}."
raise Exception(err_msg)
Copy link
Collaborator

@jack89roberts jack89roberts Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueError would be a bit nicer to raise than a base Exception (also in a few other places).

def test_t5():
t5 = T5TranslateModel("eng", "fra")
translated_text_t5 = t5(eng_text)
print(translated_text_t5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some kind of assert instead of print would be nicer

return mname[3:] + ".cha"


def process_line_ids(maybe_lines: str):
Copy link
Collaborator

@jack89roberts jack89roberts Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few functions without a specified return type, e.g. for this one

Suggested change
def process_line_ids(maybe_lines: str):
def process_line_ids(maybe_lines: str) -> list[int]:

(I think)

@jack89roberts
Copy link
Collaborator

jack89roberts commented Feb 27, 2025

I added a few new comments @boykovdn but the only significant one is I think you've maybe forgotten to push the readme in scripts/callhome (or didn't make one yet). The others are all minor so it's fine not to address them and focus on getting results etc. first.

all_metric_evals = {}
for metric_class in metrics:
metric = metric_class()
if metric_class.__name__ in ["COMETRefScore", "COMETQEScore"]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klh5 This is what I use to calculate the metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants