-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial pipeline code #19
base: main
Are you sure you want to change the base?
Conversation
boykovdn
commented
Jan 27, 2025
- Needs the translation and transcription models to be useful
* Wrappers of huggingface t5 and Ollama * Might require some dependencies which are not installed in the project by default
* To pass CI
* Alternatively would have to get ollama set up in the CI VM, so just removing for now. This is not testing fundamental functionality
* For t5 tokenizer (Google)
* Used downstream to match a text file to the correct audio file.
* A wrapper around Whisper, which is prompted for transcription
* The translation and transcription outputs are still to be tested
* Translation model source and target languages now set in constructor * Pipeline updated to reflect above change * CI python env set to 3.11, to match what is pinned in the pyproject.toml * Added pydub dependency
* Trying to see whether it will still pass tests and CI
Would you like someone to review this @boykovdn or is it still being worked on? |
Better to review later - data handling might change |
The new way of handling Callhome is to download the datasets (Talkbank and the English extension), set the right paths in the pipe.sh script, then run it to preprocess the dataset. Then you can use the CallhomePipeline to load a split as you need, which will load the entire text dataset into memory (it's small enough). You can then use it for downstream processing - it provides spanish/english pairs of text, alongside audio and translations if needed.
* Was updated in the source a few commits ago, but I forgot to update the test. * Should pass the CI now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @boykovdn , I had a very quick look through and left some comments.
parser = cls() | ||
with open(file_path) as file: | ||
data = file.read() | ||
parser.parse_transcription(data, file_prefix=file_path[-8:-4]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parser.parse_transcription(data, file_prefix=file_path[-8:-4]) | |
parser.parse_transcription(data, file_prefix=file_path[-8:-4]) |
pyproject.toml
Outdated
@@ -38,9 +38,13 @@ dependencies = [ | |||
"sacrebleu>=2.4.3", | |||
"seaborn>=0.13.2", | |||
"sonar-space>=0.2.0", | |||
"torch==2.0.1", | |||
"torch", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a range we can give for all dependencies, to give some indication of what we ran the code with at least?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, locally it runs with the latest torch version, but on the cluster it works with 2.2.2. I am happy to set that as the minimum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be helpful to have a readme somewhere that explains where we got all the data (free and paid labels) + a brief summary of the necessary pre-processing steps and what they're doing etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update the README and add another README under scripts/callhome, explaining how to do the pre-processing.
if len(sys.argv) != 2: | ||
sys.stderr.write(f"Usage: {sys.argv[0]} input_file\n") | ||
sys.exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be neater as argparse
# Print previous line if no merge occurred. | ||
print(prev_line) | ||
|
||
prev_line = line | ||
|
||
if prev_line is not None: | ||
print(prev_line) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the print statements needed/could they be marked as debug/hidden via a logger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align lines is used in the context of processing a stream of text, and outputting to stdout - these are not debug lines. It's part of the pipe.sh script, for which I'll add a brief readme. The reason it's used like this, is because part of the pre-processing script from the Callhome dataset is PERL scripts, which I found easiest to use with unix pipes.
src/m4st/translate/model.py
Outdated
assert target_lang_iso in self.supported_languages, f"This model \ | ||
only supports {self.supported_languages}, but got target {target_lang_iso}." | ||
assert source_lang_iso in self.supported_languages, f"This model \ | ||
only supports {self.supported_languages}, but got source {source_lang_iso}." | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer raising errors over asserts
src/m4st/translate/model.py
Outdated
model_input_tokens = self.tokenizer( | ||
model_input_text, return_tensors="pt" | ||
).input_ids | ||
model_output_tokens = self.model.generate(model_input_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need args? E.g. I think it limits how many tokens it will generate based on max_new_tokens
, which you might want to vary based on the length of the input text or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point. It actually does not matter in the end, because it turns out T5 does not support Spanish, so we will not be using it. I will update the max tokens to 210 as a reasonable upper bound (max sentence len 70 words, and about 3 tokens per word?)
I've added most of the changes commented on, let me know what you think. I've still got to add NLLB, which I will use in my experiments, and change the Ollama testing, but the latter I think is not a priority at the moment. |
* The update to preprocessing adds cleaning of the A/B notation for speaker.
* Adds pytest skip, because we don't want to run an LLM in the CI.
@@ -23,9 +23,9 @@ python -m pip install . | |||
|
|||
## CallHome Dataset | |||
|
|||
Go [https://ca.talkbank.org/access/CallHome](here), select the conversation language, create account, then you can download the "media folder". There you can find the .cha files, which contain the transcriptions. | |||
Go [https://ca.talkbank.org/access/CallHome](here), select the conversation language, create account, then you can download the "media folder". There you can find the .cha files, which contain the transcriptions. You will need to pre-process this data in combination with the Callhome translations dataset, which includes part of the pre-processing scripts. The README under ./scripts/callhome of this repo contains more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a README in scripts/callhome currently, like the last sentence says there should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, sorry - adding it to my next commit
r""" | ||
Loads a DataFrame of Spanish utterances alongside their English translations. | ||
|
||
This function expects that the dataset has been pre-processed, so that the | ||
lines in the mapping file match to the correct lines in the transcription files. | ||
|
||
Args: | ||
cha_root_dir (str): Path to folder containing the .cha transcript files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor but generally if you've type-hinted the args in the function def then it's not necessary to repeat the type in the docstring (and makes things easier to maintain).
if target_lang_iso not in self.supported_languages: | ||
err_msg = f"This model only supports {self.supported_languages}, \ | ||
but got target {target_lang_iso}." | ||
raise Exception(err_msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ValueError
would be a bit nicer to raise than a base Exception
(also in a few other places).
def test_t5(): | ||
t5 = T5TranslateModel("eng", "fra") | ||
translated_text_t5 = t5(eng_text) | ||
print(translated_text_t5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some kind of assert instead of print would be nicer
return mname[3:] + ".cha" | ||
|
||
|
||
def process_line_ids(maybe_lines: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are quite a few functions without a specified return type, e.g. for this one
def process_line_ids(maybe_lines: str): | |
def process_line_ids(maybe_lines: str) -> list[int]: |
(I think)
I added a few new comments @boykovdn but the only significant one is I think you've maybe forgotten to push the readme in |
all_metric_evals = {} | ||
for metric_class in metrics: | ||
metric = metric_class() | ||
if metric_class.__name__ in ["COMETRefScore", "COMETQEScore"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@klh5 This is what I use to calculate the metrics