Making gender assignment random for cardinals, fractions, and decimal… #3759

bonham79 · 2022-02-25T20:32:02Z

…s. Also making fixes to measure so sounds more fluent.

Signed-off-by: Bonham79 [email protected]

What does this PR do ?

Updates tn_es to make gender randomized.

Add a one line overview of what this PR aims to accomplish.
This will update some minor aspects of tn_es to lower error rate.

Collection: [Nemo Text Processing]

Changelog

Cardinal and Decimal have randomized gender assignment
Fraction has randomized gender assignment for integer portion
Measure now has more fluent normalization of fractional amounts. (Note, due to Sparrowhawk architecture it can sound slightly awkward: 1 1/2 l-> "uno y medio litros" as opposed to more common "uno litro y medio"). Change would require rewriting of how measure verbalizer manages fraction tags.
Non-deterministic strings now permit apocope for "una". This arises when the number precedes a noun with stressed "a" sound. Since this is hard to determine which FST setup, this is left to normalize_with_audio to manage.

Usage

Before your PR is "Ready for review"

Pre checks:

[y ] Make sure you read and followed Contributor guidelines
[n ] Did you write any new necessary tests?
[ n] Did you add or update any necessary documentation?
[ y] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

[ y] New Feature
[ y] Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

…s. Also making fixes to measure so sounds more fluent. Signed-off-by: Bonham79 <[email protected]>

Signed-off-by: Bonham79 <[email protected]>

lgtm-com · 2022-02-25T20:42:30Z

This pull request introduces 1 alert when merging 245adb8 into bc6215f - view on LGTM.com

new alerts:

1 for Unused import

Signed-off-by: bonham79 <[email protected]>

bonham79 · 2022-02-26T00:15:33Z

@yzhang123

Signed-off-by: Bonham79 <[email protected]>

yzhang123 · 2022-02-28T18:45:59Z

@bonham79 thanks alot for this!
@erastorgueva-nv since the change is very much spanish related, could you please review this this week ? thanks!

erastorgueva-nv · 2022-03-07T12:22:40Z

Hi Travis @bonham79, thank you very much for this PR. I think there is a bug somewhere causing only masculine cardinals and decimals to be produced?
e.g. when I run python nemo_text_processing/text_normalization/normalize.py "1" --language es , the output is always un . Could you please investigate?

bonham79 · 2022-03-07T12:51:33Z

@erastorgueva-nv Was this for just through normalize.py or also the case for Sparrowhawk deployment?

erastorgueva-nv · 2022-03-08T20:39:01Z

@erastorgueva-nv Was this for just through normalize.py or also the case for Sparrowhawk deployment?

@bonham79 Hi Travis, when I wrote that comment I had only seen it in normalize.py but I have since tried it in Sparrowhawk and I see the same results, i.e. echo "1" | ../../src/bin/normalizer_main --config=sparrowhawk_configuration.ascii_proto outputs only un and never "una" or "uno"

bonham79 · 2022-03-08T20:42:28Z

@erastorgueva-nv Was this for just through normalize.py or also the case for Sparrowhawk deployment?

@bonham79 Hi Travis, when I wrote that comment I had only seen it in normalize.py but I have since tried it in Sparrowhawk and I see the same results, i.e. echo "1" | ../../src/bin/normalizer_main --config=sparrowhawk_configuration.ascii_proto outputs only un and never "una" or "uno"

Ah! I will have time to check tomorrow to see what's up.

bonham79 · 2022-03-09T01:51:12Z

@yzhang123 @erastorgueva-nv I ran rewrite.optimal_rewrites on the error case pointed out. Both 'un' and 'una' showed up in that output which means pynini is weighting them the same. For cases of 201,121, or 200 sparrowhawk prefers doscientas una, ciento ventiuna, and doscientas respectively, while nemo_text_processing prefers the masculine versions. So it doesn't seem to be affecting the gender switching in general.

Looking at the rewrite function outputs, the only difference I'm seeing is that is switches list order. e.g.

101 -> [ciento un, ciento una]
121 -> [ciento veintiuna, ciento veintiun]

For normalize.py this would mean shortest_path is equivalent to the 0 index in the first case and the 1 index in the second, which seems to be the arbitrary behavior that's desired. So it's looking like this is more just a quirk of implementation than a bug.

Also for decimals you'll see preference for masculines on nemo_text_processing and feminine for sparrowhawk. e.g.

1,1 -> uno coma un
1,1 una coma una

Thoughts?

…or multiples of hundred thousand. Signed-off-by: Bonham79 <[email protected]>

lgtm-com · 2022-03-09T21:07:12Z

This pull request introduces 1 alert when merging ce9c30f into ad2a730 - view on LGTM.com

new alerts:

1 for Unused import

Signed-off-by: Bonham79 <[email protected]>

…tn_es_update

yzhang123 · 2022-03-10T17:56:09Z

@yzhang123 @erastorgueva-nv I ran rewrite.optimal_rewrites on the error case pointed out. Both 'un' and 'una' showed up in that output which means pynini is weighting them the same. For cases of 201,121, or 200 sparrowhawk prefers doscientas una, ciento ventiuna, and doscientas respectively, while nemo_text_processing prefers the masculine versions. So it doesn't seem to be affecting the gender switching in general.

Looking at the rewrite function outputs, the only difference I'm seeing is that is switches list order. e.g.
101 -> [ciento un, ciento una]
121 -> [ciento veintiuna, ciento veintiun]
For normalize.py this would mean shortest_path is equivalent to the 0 index in the first case and the 1 index in the second, which seems to be the arbitrary behavior that's desired. So it's looking like this is more just a quirk of implementation than a bug.

Also for decimals you'll see preference for masculines on nemo_text_processing and feminine for sparrowhawk. e.g.
1,1 -> uno coma un
1,1 una coma una
Thoughts?

Hey @bonham79 I'm not sure if I understand correctly since i do not know spanish. But if there is only one correct gender, please use that! don't abstract away from that and make mistakes in the grammar. I do not know if you know German @bonham79 , but for ordinals the suffix needs to match the gender of the following noun. E.g. "third[suffix] woman": suffix need to match the female noun following it. Since we don't know the context when we write the grammar I just randomly choose a [suffix]. so it could be that in the end it chooses "der dritte Frau", instead of "die dritte Frau". Does that make sense? In case you know the context, please use the correct gender!

bonham79 · 2022-03-10T18:55:34Z

@yzhang123 Sorry the examples were for @erastorgueva-nv so didn't fully explain. This is for the case of when it needs to randomly assign gender just like the case you gave for German. It appears that nemo_text_processing is only choosing one of the genders instead of being random, but I checked the path weights and it looks like pynini is weighting both genders the same. (rewrite.optimal_rewirtes only output strings of same weight.) So it looks like there's just an implementation quirk where it's selecting the masculine form for 1 and 101 each time despite both masculine and feminine forms being weighted the same in the FST. This doesn't seem to be endemic since there's variation for 201 and 21 (both masculine forms and feminine forms pop up if you run through nemo_text_processing and sparrowhawk, respectively), which follow the same conversion rules.

For cases where the gender can be assumed (time, money) I didn't make any changes - save for a small bug I found in the measure tagger last night.

yzhang123 · 2022-03-10T19:07:16Z

@yzhang123 Sorry the examples were for @erastorgueva-nv so didn't fully explain. This is for the case of when it needs to randomly assign gender just like the case you gave for German. It appears that nemo_text_processing is only choosing one of the genders instead of being random, but I checked the path weights and it looks like pynini is weighting both genders the same. (rewrite.optimal_rewirtes only output strings of same weight.) So it looks like there's just an implementation quirk where it's selecting the masculine form for 1 and 101 each time despite both masculine and feminine forms being weighted the same in the FST. This doesn't seem to be endemic since there's variation for 201 and 21 (both masculine forms and feminine forms pop up if you run through nemo_text_processing and sparrowhawk, respectively), which follow the same conversion rules.

For cases where the gender can be assumed (time, money) I didn't make any changes - save for a small bug I found in the measure tagger last night.

Hi @bonham79 as long as the weights are the same, its fine. If you get mismatches from pytest and sparrowhawk due to different paths they pick, just adjust/or disable that test. We assume if we have same weights the output is not well defined. If you still want to test that all genders are considered equally, you can do so by add all gender options here https://github.com/NVIDIA/NeMo/blob/main/tests/nemo_text_processing/en/test_normalization_with_audio.py#L29 for english

erastorgueva-nv · 2022-03-14T18:48:52Z

Hi all, it seems to me that currently the TN code does not meet our requirement of the gender being random when we do normalization. For normalize.py, I always get a masculine gender output, and for SparrowHawk I always get a feminine gender output (except for 1 & 101). So if a user always uses normalize.py, they will always get a masculine gender result, and if a user uses SparrowHawk, they will always get a feminine gender result (except for 1 & 101). This is not what we want.

Specific results that I get are listed below. Note that I always get the same results each time I run normalize.py or SparrowHawk.

Input	`normalize.py` output [gender]	SparrowHawk output [gender]
1	"un" [masc.]	"un" [masc.]
101	"ciento un" [masc.]	"ciento un" [masc.]
21	"ventiún" [masc.]	"veintiuna" [fem.]
121	"ciento veintiún" [masc.]	"ciento veintiuna" [fem.]
200	" doscientos" [masc.]	"doscientas" [fem.]
201	"doscientos un" [masc.]	"doscientas una" [fem.]
300	"trescientos" [masc.]	"trescientas" [fem.]
400	"cuatrocientos" [masc.]	"cuatrocientas" [fem.]
500	"quinientos" [masc.]	"quinientas" [fem.]

bonham79 · 2022-03-14T19:28:43Z

@erastorgueva-nv It was my understanding that by 'random' we were referring to making all valid outputs have equivalent weights and letting pynini/sparrowhawk manage the decision. For instance, running German ordinals as in the case @yzhang123 mentioned gives me the same case ending each time. ("die vierTE frau", "die siebTE frau", "die drei hundert zweiTE frau") even with multiple suffixes being viable and an FST being newly instantiated each time. I can just switch in a random number generator to make it truly random but need clarification on what the ideal is.

yzhang123 · 2022-03-14T20:21:33Z

@erastorgueva-nv It was my understanding that by 'random' we were referring to making all valid outputs have equivalent weights and letting pynini/sparrowhawk manage the decision. For instance, running German ordinals as in the case @yzhang123 mentioned gives me the same case ending each time. ("die vierTE frau", "die siebTE frau", "die drei hundert zweiTE frau") even with multiple suffixes being viable and an FST being newly instantiated each time. I can just switch in a random number generator to make it truly random but need clarification on what the ideal is.

if you can do truly random easily that is was we want. But it seems like a platform implementation detail rather than what we can define. is it possible in Sparrowhawk? Is the truly randomization a feature build in FST graph? if not, sparrowhawk won't be able to do true randomization

bonham79 · 2022-03-14T20:26:38Z

if you can do truly random easily that is was we want. But it seems like a platform implementation detail rather than what we can define. is it possible in Sparrowhawk? Is the truly randomization a feature build in FST graph? if not, sparrowhawk won't be able to do true randomization

It wouldn't be random in the FST since (I believe) Sparrowhawk's shortest path algorithm is deterministic. What I can do is simply randomize per FST instantiation by a random number generator. So each time the FST is built there would be a chance decision to go one gender or another.

yzhang123 · 2022-03-14T22:25:48Z

if you can do truly random easily that is was we want. But it seems like a platform implementation detail rather than what we can define. is it possible in Sparrowhawk? Is the truly randomization a feature build in FST graph? if not, sparrowhawk won't be able to do true randomization

It wouldn't be random in the FST since (I believe) Sparrowhawk's shortest path algorithm is deterministic. What I can do is simply randomize per FST instantiation by a random number generator. So each time the FST is built there would be a chance decision to go one gender or another.

that's overkill, let's not do that.

erastorgueva-nv

Looks good to me! Thank you very much for your hard work Travis @bonham79.
I spoke to Yang yesterday about what are our requirements for this PR and realised that the code meets the requirements. The important thing is that pynini weighs the different options the same (as you have said it does). The fact that nemo_text_processing prefers masculines and the fact that sparrowhawk prefers feminines is a quirk of implementation (as you say).

#3759) * Making gender assignment random for cardinals, fractions, and decimals. Also making fixes to measure so sounds more fluent. Signed-off-by: Bonham79 <[email protected]> * Style fixes Signed-off-by: Bonham79 <[email protected]> * Import fix. Signed-off-by: bonham79 <[email protected]> * Files missed style check after import fix Signed-off-by: Bonham79 <[email protected]> * Fixing bug in measure .py which was preventing gender carrying over for multiples of hundred thousand. Signed-off-by: Bonham79 <[email protected]> * Missed an import Signed-off-by: Bonham79 <[email protected]> Co-authored-by: Elena Rastorgueva <[email protected]> Co-authored-by: Yang Zhang <[email protected]>

bonham79 and others added 3 commits February 25, 2022 15:23

Making gender assignment random for cardinals, fractions, and decimal…

d7feee9

…s. Also making fixes to measure so sounds more fluent. Signed-off-by: Bonham79 <[email protected]>

Style fixes

926715b

Signed-off-by: Bonham79 <[email protected]>

Merge branch 'main' into tn_es_update

245adb8

Import fix.

e5c4ccc

Signed-off-by: bonham79 <[email protected]>

Files missed style check after import fix

37ae78e

Signed-off-by: Bonham79 <[email protected]>

yzhang123 requested review from erastorgueva-nv and yzhang123 February 28, 2022 18:45

erastorgueva-nv and others added 3 commits March 1, 2022 15:00

Merge branch 'main' into tn_es_update

22b12ec

Merge branch 'main' into tn_es_update

d52dfba

Merge branch 'main' into tn_es_update

e69ea10

Merge branch 'main' into tn_es_update

aa694e4

Merge branch 'main' into tn_es_update

7b172db

bonham79 and others added 2 commits March 9, 2022 15:47

Fixing bug in measure .py which was preventing gender carrying over f…

63070b3

…or multiples of hundred thousand. Signed-off-by: Bonham79 <[email protected]>

Merge branch 'main' into tn_es_update

ce9c30f

bonham79 added 2 commits March 9, 2022 16:12

Missed an import

5b984b6

Signed-off-by: Bonham79 <[email protected]>

Merge branch 'tn_es_update' of https://github.com/bonham79/NeMo into …

e6b3e06

…tn_es_update

Merge branch 'main' into tn_es_update

185bca0

Merge branch 'main' into tn_es_update

c57baa1

bonham79 and others added 2 commits March 12, 2022 11:51

Merge branch 'main' into tn_es_update

c949bac

Merge branch 'main' into tn_es_update

b65f861

Merge branch 'main' into tn_es_update

d371715

bonham79 and others added 2 commits March 16, 2022 10:31

Merge branch 'main' into tn_es_update

91b644d

Merge branch 'main' into tn_es_update

9a0a9ef

yzhang123 approved these changes Mar 17, 2022

View reviewed changes

erastorgueva-nv approved these changes Mar 18, 2022

View reviewed changes

yzhang123 merged commit 1454fdd into NVIDIA:main Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making gender assignment random for cardinals, fractions, and decimal… #3759

Making gender assignment random for cardinals, fractions, and decimal… #3759

bonham79 commented Feb 25, 2022

lgtm-com bot commented Feb 25, 2022

bonham79 commented Feb 26, 2022

yzhang123 commented Feb 28, 2022

erastorgueva-nv commented Mar 7, 2022 •

edited

Loading

bonham79 commented Mar 7, 2022 •

edited

Loading

erastorgueva-nv commented Mar 8, 2022

bonham79 commented Mar 8, 2022

bonham79 commented Mar 9, 2022 •

edited

Loading

lgtm-com bot commented Mar 9, 2022

yzhang123 commented Mar 10, 2022 •

edited

Loading

bonham79 commented Mar 10, 2022

yzhang123 commented Mar 10, 2022 •

edited

Loading

erastorgueva-nv commented Mar 14, 2022

bonham79 commented Mar 14, 2022

yzhang123 commented Mar 14, 2022

bonham79 commented Mar 14, 2022

yzhang123 commented Mar 14, 2022

erastorgueva-nv left a comment

Making gender assignment random for cardinals, fractions, and decimal… #3759

Making gender assignment random for cardinals, fractions, and decimal… #3759

Conversation

bonham79 commented Feb 25, 2022

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

lgtm-com bot commented Feb 25, 2022

bonham79 commented Feb 26, 2022

yzhang123 commented Feb 28, 2022

erastorgueva-nv commented Mar 7, 2022 • edited Loading

bonham79 commented Mar 7, 2022 • edited Loading

erastorgueva-nv commented Mar 8, 2022

bonham79 commented Mar 8, 2022

bonham79 commented Mar 9, 2022 • edited Loading

lgtm-com bot commented Mar 9, 2022

yzhang123 commented Mar 10, 2022 • edited Loading

bonham79 commented Mar 10, 2022

yzhang123 commented Mar 10, 2022 • edited Loading

erastorgueva-nv commented Mar 14, 2022

bonham79 commented Mar 14, 2022

yzhang123 commented Mar 14, 2022

bonham79 commented Mar 14, 2022

yzhang123 commented Mar 14, 2022

erastorgueva-nv left a comment

Choose a reason for hiding this comment

erastorgueva-nv commented Mar 7, 2022 •

edited

Loading

bonham79 commented Mar 7, 2022 •

edited

Loading

bonham79 commented Mar 9, 2022 •

edited

Loading

yzhang123 commented Mar 10, 2022 •

edited

Loading

yzhang123 commented Mar 10, 2022 •

edited

Loading