update make_data_column_uniform #125

AmeyHengle · 2023-01-29T05:04:45Z

Ticket: https://vernacular-ai.atlassian.net/browse/CMS-2525
Data path: s3://vodafone-depository/datasets/problematic_datapoints_(production calls).csv)

Problem:
I found that a large amount of production data in Vodafone contains noisy transcripts.
In these data points, ASR alternatives seemed be dumped twice (double json.dumps()).
The merge_asr_output plugin values are incorrect for these datapoints, which eventually passes as input to the model.
Following is an example of a noisy transcript, and its expected vs actual merge_asr_output value.

Sample transcript:

"[[{\"confidence\": 0.6691616, \"transcript\": \"Shikha was never being recorded\"}, {\"confidence\": 0.6227231, \"transcript\": \"dikha was never being recorded\"}, {\"confidence\": 0.6227347, \"transcript\": \"Dekha was never being recorded\"}, {\"confidence\": 0.5426301, \"transcript\": \"Shikha was now being recorded\"}, {\"confidence\": 0.243892, \"transcript\": \"the cow is now being recorded\"}, {\"confidence\": 0.425278, \"transcript\": \"Shekhawat now being recorded\"}, {\"confidence\": 0.5605131, \"transcript\": \"Sikh was never being recorded\"}, {\"confidence\": 0.4961915, \"transcript\": \"dikha was now being recorded\"}, {\"confidence\": 0.5235198, \"transcript\": \"Shikha was never been recorded\"}, {\"confidence\": 0.61963093, \"transcript\": \"Sikho was never being recorded\"}]]"

Expected output:

['<s> Shikha was never being recorded </s> <s> dikha was never being recorded </s> <s> Dekha was never being recorded </s> <s> Shikha was now being recorded </s> <s> the cow is now being recorded </s> <s> Shekhawat now being recorded </s> <s> Sikh was never being recorded </s> <s> dikha was now being recorded </s> <s> Shikha was never been recorded </s> <s> Sikho was never being recorded </s>']

Actual output:

['<s> [[{"confidence": 0.6691616, "transcript": "Shikha was never being recorded"}, {"confidence": 0.6227231, "transcript": "dikha was never being recorded"}, {"confidence": 0.6227347, "transcript": "Dekha was never being recorded"}, {"confidence": 0.5426301, "transcript": "Shikha was now being recorded"}, {"confidence": 0.243892, "transcript": "the cow is now being recorded"}, {"confidence": 0.425278, "transcript": "Shekhawat now being recorded"}, {"confidence": 0.5605131, "transcript": "Sikh was never being recorded"}, {"confidence": 0.4961915, "transcript": "dikha was now being recorded"}, {"confidence": 0.5235198, "transcript": "Shikha was never been recorded"}, {"confidence": 0.61963093, "transcript": "Sikho was never being recorded"}]] </s>']

dakshvar22 · 2023-02-24T06:49:22Z

slu/slu/utils/preprocessing.py

@@ -74,9 +74,11 @@ def make_data_column_uniform(data_frame: pd.DataFrame) -> pd.DataFrame:
    ):
        if isinstance(row[const.ALTERNATIVES], str):
            data = json.loads(row[const.ALTERNATIVES])
+            if isinstance(data, str):


Why would json.loads yield a string? Shouldn't it strictly give a dict?

Production calls in vodafone contain noisy transcripts, where ASR alternatives seem to be dumped twice. The second if condition is supposed to fix this by performing another json.loads(data).

Case example:

dakshvar22 · 2023-02-24T06:49:43Z

slu/slu/utils/preprocessing.py

@@ -58,7 +58,7 @@ def make_reftime_column_uniform(data_frame: pd.DataFrame) -> pd.DataFrame:

    return data_frame

-def make_data_column_uniform(data_frame: pd.DataFrame) -> pd.DataFrame:
+def make_data_column_uniform(data_frame: pd.DataFrame) -> None:


Why does the type hint need a change?

This was a mistake, return type hint shouldn't be removed. Fixed this in a new commit.

Update return-type hint

dakshvar22 · 2023-04-20T12:14:37Z

slu/slu/utils/preprocessing.py

+                data_frame.loc[i, const.ALTERNATIVES] = json.dumps(
+                    data, ensure_ascii=False
+                )


discard this data point.

update make_data_column_uniform

6c4f601

AmeyHengle requested a review from janaab11 January 29, 2023 05:04

dakshvar22 requested review from dakshvar22 and removed request for janaab11 February 24, 2023 06:47

dakshvar22 reviewed Feb 24, 2023

View reviewed changes

AmeyHengle added 3 commits March 1, 2023 20:15

Update preprocessing.py

98ccd74

Update return-type hint

Update preprocessing.py

1215b82

Update preprocessing.py

0b4e5e9

AmeyHengle requested a review from dakshvar22 March 10, 2023 12:03

AmeyHengle self-assigned this Mar 10, 2023

dakshvar22 reviewed Apr 20, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update make_data_column_uniform #125

update make_data_column_uniform #125

AmeyHengle commented Jan 29, 2023 •

edited

Loading

dakshvar22 Feb 24, 2023

AmeyHengle Mar 1, 2023

dakshvar22 Feb 24, 2023

AmeyHengle Mar 1, 2023

dakshvar22 Apr 20, 2023

update make_data_column_uniform #125

Are you sure you want to change the base?

update make_data_column_uniform #125

Conversation

AmeyHengle commented Jan 29, 2023 • edited Loading

dakshvar22 Feb 24, 2023

Choose a reason for hiding this comment

AmeyHengle Mar 1, 2023

Choose a reason for hiding this comment

dakshvar22 Feb 24, 2023

Choose a reason for hiding this comment

AmeyHengle Mar 1, 2023

Choose a reason for hiding this comment

dakshvar22 Apr 20, 2023

Choose a reason for hiding this comment

AmeyHengle commented Jan 29, 2023 •

edited

Loading