Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update make_data_column_uniform #125

Open
wants to merge 4 commits into
base: new-aliasing-structure
Choose a base branch
from

Conversation

AmeyHengle
Copy link
Contributor

@AmeyHengle AmeyHengle commented Jan 29, 2023

Ticket: https://vernacular-ai.atlassian.net/browse/CMS-2525
Data path: s3://vodafone-depository/datasets/problematic_datapoints_(production calls).csv)

Problem:
I found that a large amount of production data in Vodafone contains noisy transcripts.
In these data points, ASR alternatives seemed be dumped twice (double json.dumps()).
The merge_asr_output plugin values are incorrect for these datapoints, which eventually passes as input to the model.
Following is an example of a noisy transcript, and its expected vs actual merge_asr_output value.

Sample transcript:

"[[{\"confidence\": 0.6691616, \"transcript\": \"Shikha was never being recorded\"}, {\"confidence\": 0.6227231, \"transcript\": \"dikha was never being recorded\"}, {\"confidence\": 0.6227347, \"transcript\": \"Dekha was never being recorded\"}, {\"confidence\": 0.5426301, \"transcript\": \"Shikha was now being recorded\"}, {\"confidence\": 0.243892, \"transcript\": \"the cow is now being recorded\"}, {\"confidence\": 0.425278, \"transcript\": \"Shekhawat now being recorded\"}, {\"confidence\": 0.5605131, \"transcript\": \"Sikh was never being recorded\"}, {\"confidence\": 0.4961915, \"transcript\": \"dikha was now being recorded\"}, {\"confidence\": 0.5235198, \"transcript\": \"Shikha was never been recorded\"}, {\"confidence\": 0.61963093, \"transcript\": \"Sikho was never being recorded\"}]]"

Expected output:

['<s> Shikha was never being recorded </s> <s> dikha was never being recorded </s> <s> Dekha was never being recorded </s> <s> Shikha was now being recorded </s> <s> the cow is now being recorded </s> <s> Shekhawat now being recorded </s> <s> Sikh was never being recorded </s> <s> dikha was now being recorded </s> <s> Shikha was never been recorded </s> <s> Sikho was never being recorded </s>']

Actual output:

['<s> [[{"confidence": 0.6691616, "transcript": "Shikha was never being recorded"}, {"confidence": 0.6227231, "transcript": "dikha was never being recorded"}, {"confidence": 0.6227347, "transcript": "Dekha was never being recorded"}, {"confidence": 0.5426301, "transcript": "Shikha was now being recorded"}, {"confidence": 0.243892, "transcript": "the cow is now being recorded"}, {"confidence": 0.425278, "transcript": "Shekhawat now being recorded"}, {"confidence": 0.5605131, "transcript": "Sikh was never being recorded"}, {"confidence": 0.4961915, "transcript": "dikha was now being recorded"}, {"confidence": 0.5235198, "transcript": "Shikha was never been recorded"}, {"confidence": 0.61963093, "transcript": "Sikho was never being recorded"}]] </s>']

@AmeyHengle AmeyHengle requested a review from janaab11 January 29, 2023 05:04
@dakshvar22 dakshvar22 requested review from dakshvar22 and removed request for janaab11 February 24, 2023 06:47
@@ -74,9 +74,11 @@ def make_data_column_uniform(data_frame: pd.DataFrame) -> pd.DataFrame:
):
if isinstance(row[const.ALTERNATIVES], str):
data = json.loads(row[const.ALTERNATIVES])
if isinstance(data, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would json.loads yield a string? Shouldn't it strictly give a dict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Production calls in vodafone contain noisy transcripts, where ASR alternatives seem to be dumped twice. The second if condition is supposed to fix this by performing another json.loads(data).

Case example:
image

@@ -58,7 +58,7 @@ def make_reftime_column_uniform(data_frame: pd.DataFrame) -> pd.DataFrame:

return data_frame

def make_data_column_uniform(data_frame: pd.DataFrame) -> pd.DataFrame:
def make_data_column_uniform(data_frame: pd.DataFrame) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the type hint need a change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a mistake, return type hint shouldn't be removed. Fixed this in a new commit.

@AmeyHengle AmeyHengle requested a review from dakshvar22 March 10, 2023 12:03
@AmeyHengle AmeyHengle self-assigned this Mar 10, 2023
Comment on lines +84 to +87
data_frame.loc[i, const.ALTERNATIVES] = json.dumps(
data, ensure_ascii=False
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discard this data point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants