-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update make_data_column_uniform #125
base: new-aliasing-structure
Are you sure you want to change the base?
Conversation
slu/slu/utils/preprocessing.py
Outdated
@@ -74,9 +74,11 @@ def make_data_column_uniform(data_frame: pd.DataFrame) -> pd.DataFrame: | |||
): | |||
if isinstance(row[const.ALTERNATIVES], str): | |||
data = json.loads(row[const.ALTERNATIVES]) | |||
if isinstance(data, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would json.loads
yield a string? Shouldn't it strictly give a dict
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slu/slu/utils/preprocessing.py
Outdated
@@ -58,7 +58,7 @@ def make_reftime_column_uniform(data_frame: pd.DataFrame) -> pd.DataFrame: | |||
|
|||
return data_frame | |||
|
|||
def make_data_column_uniform(data_frame: pd.DataFrame) -> pd.DataFrame: | |||
def make_data_column_uniform(data_frame: pd.DataFrame) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the type hint need a change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a mistake, return type hint shouldn't be removed. Fixed this in a new commit.
Update return-type hint
data_frame.loc[i, const.ALTERNATIVES] = json.dumps( | ||
data, ensure_ascii=False | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discard this data point.
Ticket: https://vernacular-ai.atlassian.net/browse/CMS-2525
Data path: s3://vodafone-depository/datasets/problematic_datapoints_(production calls).csv)
Problem:
I found that a large amount of production data in Vodafone contains noisy transcripts.
In these data points, ASR alternatives seemed be dumped twice (double
json.dumps()
).The
merge_asr_output
plugin values are incorrect for these datapoints, which eventually passes as input to the model.Following is an example of a noisy transcript, and its expected vs actual
merge_asr_output
value.Sample transcript:
Expected output:
Actual output: