Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3 callhome parsing #9

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,12 @@ python -m pip install .

`typst compile notes.typ`

## CallHome Dataset

Go [https://ca.talkbank.org/access/CallHome](here), select the conversation language, create account, then you can download the "media folder". There you can find the .cha files, which contain the transcriptions.

To load the transcriptions as a bag of sentences, use `m4st.parse.TranscriptParser.from_folder` to load all conversation lines. This class does not group them by participant, or conversation - it just loads every line as an entry to a list (+ some pre-processing).


## License

Expand Down
2 changes: 1 addition & 1 deletion doc/notes.typ
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Either way, the translation will be influenced by the domain shift due to filler
//We don't know the real world distribution of filler words, but we could use a LLM to sample from $bb(P)(hat(x) | x)$, where $x$ is the clean input, and $hat(x)$ is the filler-word-corrupted input.

The translation model can be defined as $cal(T): x arrow x^prime$, where $x^prime$ is the translated text.
The metric can be defined as $cal(M): x^prime, x, {y_i}_(i=1)^N arrow bb(R)$, where $y_i$ are reference translations provided by $N$ translators.
The metric can be defined as $cal(M): [a], x^prime, x, {y_i}_(i=1)^N arrow bb(R)$, where $y_i$ are reference translations provided by $N$ translators, and $a$ is the source audio (denoted optional since some metrics don't accept it).
In our use case $N=1$.

We are generally not interested in benchmarking different models, so we can assume that $cal(T)$ is given.
Expand Down
5 changes: 4 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,9 @@ classifiers = [
"Topic :: Scientific/Engineering",
"Typing :: Typed",
]
dependencies = []
dependencies = [
"tqdm"
]

[project.optional-dependencies]
dev = [
Expand Down Expand Up @@ -70,6 +72,7 @@ disallow_untyped_defs = false
disallow_incomplete_defs = false
check_untyped_defs = true
strict = false
ignore_missing_imports = true


[tool.ruff]
Expand Down
70 changes: 70 additions & 0 deletions src/m4st/parse.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
import glob
import os
import re

from tqdm import tqdm


class TranscriptParser:
r"""
Provides a bag of conversational lines.

Instantiate this by using the `from_folder` class method and
pointing it to a folder from the CallHome dataset, for example
the 'deu' folder for transcriptions in German. This class will
try its best to remove the .cha format specifics, and only
keep the UTF8 characters, thus providing text we can use for
downstream translation.
"""

def __init__(self):
self.lines = []

@classmethod
def from_folder(cls, folder_path: str):
parser = cls()
# Loop through all .cha files in the folder
for file_path in tqdm(
glob.glob(os.path.join(folder_path, "*.cha")), desc=f"Parsing {folder_path}"
):
with open(file_path) as file:
data = file.read()
parser.parse_transcription(data)

return parser

def parse_line(self, line: str):
# Match lines with participant utterances
match = re.match(r"\*(\w):\s+(.*)", line)
if match:
participant, text = match.groups()
# Remove timestamps (e.g., •50770_51060•) from the text
# And other artefacts
clean_text = re.sub(r"\x15\d+_\d+\x15", "", text).strip()
clean_text = re.sub(r"&=\S+", "", clean_text).strip()
clean_text = re.sub(r"&+\S+", "", clean_text).strip()
clean_text = re.sub(r"\+/", "", clean_text).strip()
clean_text = re.sub(r"\+", "", clean_text).strip()
if clean_text in [".", "?", "!"]:
# Nothing but the punctuation is remaining
return

self.lines.append(clean_text)

def parse_transcription(self, data: str):
lines = data.split("\n")
for line in lines:
if line in ["@Begin", "@UTF8", "@End"]:
# The begin header
pass
elif line.startswith("*"):
# Participant line
self.parse_line(line)


if __name__ == "__main__":
# Input transcription data
# Parse the transcription
folder_path = "/Users/bvodenicharski/Downloads/deu"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this an arg &/or a path in the repo?

tp = TranscriptParser.from_folder(folder_path)
print(len(tp.lines))
Loading