Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making gender assignment random for cardinals, fractions, and decimal… #3759

Merged
merged 21 commits into from
Mar 18, 2022
Merged

Making gender assignment random for cardinals, fractions, and decimal… #3759

merged 21 commits into from
Mar 18, 2022

Conversation

bonham79
Copy link
Contributor

…s. Also making fixes to measure so sounds more fluent.

Signed-off-by: Bonham79 [email protected]

What does this PR do ?

Updates tn_es to make gender randomized.

Add a one line overview of what this PR aims to accomplish.
This will update some minor aspects of tn_es to lower error rate.

Collection: [Nemo Text Processing]

Changelog

  • Cardinal and Decimal have randomized gender assignment
  • Fraction has randomized gender assignment for integer portion
  • Measure now has more fluent normalization of fractional amounts. (Note, due to Sparrowhawk architecture it can sound slightly awkward: 1 1/2 l-> "uno y medio litros" as opposed to more common "uno litro y medio"). Change would require rewriting of how measure verbalizer manages fraction tags.
  • Non-deterministic strings now permit apocope for "una". This arises when the number precedes a noun with stressed "a" sound. Since this is hard to determine which FST setup, this is left to normalize_with_audio to manage.

Usage

Before your PR is "Ready for review"

Pre checks:

  • [y ] Make sure you read and followed Contributor guidelines
  • [n ] Did you write any new necessary tests?
  • [ n] Did you add or update any necessary documentation?
  • [ y] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • [ y] New Feature
  • [ y] Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

bonham79 and others added 3 commits February 25, 2022 15:23
…s. Also making fixes to measure so sounds more fluent.

Signed-off-by: Bonham79 <[email protected]>
Signed-off-by: Bonham79 <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented Feb 25, 2022

This pull request introduces 1 alert when merging 245adb8 into bc6215f - view on LGTM.com

new alerts:

  • 1 for Unused import

Signed-off-by: bonham79 <[email protected]>
@bonham79
Copy link
Contributor Author

@yzhang123

@yzhang123
Copy link
Contributor

@bonham79 thanks alot for this!
@erastorgueva-nv since the change is very much spanish related, could you please review this this week ? thanks!

@erastorgueva-nv
Copy link
Collaborator

erastorgueva-nv commented Mar 7, 2022

Hi Travis @bonham79, thank you very much for this PR. I think there is a bug somewhere causing only masculine cardinals and decimals to be produced?
e.g. when I run python nemo_text_processing/text_normalization/normalize.py "1" --language es , the output is always un . Could you please investigate?

@bonham79
Copy link
Contributor Author

bonham79 commented Mar 7, 2022

@erastorgueva-nv Was this for just through normalize.py or also the case for Sparrowhawk deployment?

@erastorgueva-nv
Copy link
Collaborator

@erastorgueva-nv Was this for just through normalize.py or also the case for Sparrowhawk deployment?

@bonham79 Hi Travis, when I wrote that comment I had only seen it in normalize.py but I have since tried it in Sparrowhawk and I see the same results, i.e. echo "1" | ../../src/bin/normalizer_main --config=sparrowhawk_configuration.ascii_proto outputs only un and never "una" or "uno"

@bonham79
Copy link
Contributor Author

bonham79 commented Mar 8, 2022

@erastorgueva-nv Was this for just through normalize.py or also the case for Sparrowhawk deployment?

@bonham79 Hi Travis, when I wrote that comment I had only seen it in normalize.py but I have since tried it in Sparrowhawk and I see the same results, i.e. echo "1" | ../../src/bin/normalizer_main --config=sparrowhawk_configuration.ascii_proto outputs only un and never "una" or "uno"

Ah! I will have time to check tomorrow to see what's up.

@bonham79
Copy link
Contributor Author

bonham79 commented Mar 9, 2022

@yzhang123 @erastorgueva-nv I ran rewrite.optimal_rewrites on the error case pointed out. Both 'un' and 'una' showed up in that output which means pynini is weighting them the same. For cases of 201,121, or 200 sparrowhawk prefers doscientas una, ciento ventiuna, and doscientas respectively, while nemo_text_processing prefers the masculine versions. So it doesn't seem to be affecting the gender switching in general.

Looking at the rewrite function outputs, the only difference I'm seeing is that is switches list order. e.g.

101 -> [ciento un, ciento una]
121 -> [ciento veintiuna, ciento veintiun]

For normalize.py this would mean shortest_path is equivalent to the 0 index in the first case and the 1 index in the second, which seems to be the arbitrary behavior that's desired. So it's looking like this is more just a quirk of implementation than a bug.

Also for decimals you'll see preference for masculines on nemo_text_processing and feminine for sparrowhawk. e.g.

1,1 -> uno coma un
1,1 una coma una

Thoughts?

@lgtm-com
Copy link

lgtm-com bot commented Mar 9, 2022

This pull request introduces 1 alert when merging ce9c30f into ad2a730 - view on LGTM.com

new alerts:

  • 1 for Unused import

@yzhang123
Copy link
Contributor

yzhang123 commented Mar 10, 2022

@yzhang123 @erastorgueva-nv I ran rewrite.optimal_rewrites on the error case pointed out. Both 'un' and 'una' showed up in that output which means pynini is weighting them the same. For cases of 201,121, or 200 sparrowhawk prefers doscientas una, ciento ventiuna, and doscientas respectively, while nemo_text_processing prefers the masculine versions. So it doesn't seem to be affecting the gender switching in general.

Looking at the rewrite function outputs, the only difference I'm seeing is that is switches list order. e.g.

101 -> [ciento un, ciento una]
121 -> [ciento veintiuna, ciento veintiun]

For normalize.py this would mean shortest_path is equivalent to the 0 index in the first case and the 1 index in the second, which seems to be the arbitrary behavior that's desired. So it's looking like this is more just a quirk of implementation than a bug.

Also for decimals you'll see preference for masculines on nemo_text_processing and feminine for sparrowhawk. e.g.

1,1 -> uno coma un
1,1 una coma una

Thoughts?

Hey @bonham79 I'm not sure if I understand correctly since i do not know spanish. But if there is only one correct gender, please use that! don't abstract away from that and make mistakes in the grammar. I do not know if you know German @bonham79 , but for ordinals the suffix needs to match the gender of the following noun. E.g. "third[suffix] woman": suffix need to match the female noun following it. Since we don't know the context when we write the grammar I just randomly choose a [suffix]. so it could be that in the end it chooses "der dritte Frau", instead of "die dritte Frau". Does that make sense? In case you know the context, please use the correct gender!

@bonham79
Copy link
Contributor Author

@yzhang123 Sorry the examples were for @erastorgueva-nv so didn't fully explain. This is for the case of when it needs to randomly assign gender just like the case you gave for German. It appears that nemo_text_processing is only choosing one of the genders instead of being random, but I checked the path weights and it looks like pynini is weighting both genders the same. (rewrite.optimal_rewirtes only output strings of same weight.) So it looks like there's just an implementation quirk where it's selecting the masculine form for 1 and 101 each time despite both masculine and feminine forms being weighted the same in the FST. This doesn't seem to be endemic since there's variation for 201 and 21 (both masculine forms and feminine forms pop up if you run through nemo_text_processing and sparrowhawk, respectively), which follow the same conversion rules.

For cases where the gender can be assumed (time, money) I didn't make any changes - save for a small bug I found in the measure tagger last night.

@yzhang123
Copy link
Contributor

yzhang123 commented Mar 10, 2022

@yzhang123 Sorry the examples were for @erastorgueva-nv so didn't fully explain. This is for the case of when it needs to randomly assign gender just like the case you gave for German. It appears that nemo_text_processing is only choosing one of the genders instead of being random, but I checked the path weights and it looks like pynini is weighting both genders the same. (rewrite.optimal_rewirtes only output strings of same weight.) So it looks like there's just an implementation quirk where it's selecting the masculine form for 1 and 101 each time despite both masculine and feminine forms being weighted the same in the FST. This doesn't seem to be endemic since there's variation for 201 and 21 (both masculine forms and feminine forms pop up if you run through nemo_text_processing and sparrowhawk, respectively), which follow the same conversion rules.

For cases where the gender can be assumed (time, money) I didn't make any changes - save for a small bug I found in the measure tagger last night.

Hi @bonham79 as long as the weights are the same, its fine. If you get mismatches from pytest and sparrowhawk due to different paths they pick, just adjust/or disable that test. We assume if we have same weights the output is not well defined. If you still want to test that all genders are considered equally, you can do so by add all gender options here https://github.com/NVIDIA/NeMo/blob/main/tests/nemo_text_processing/en/test_normalization_with_audio.py#L29 for english

@erastorgueva-nv
Copy link
Collaborator

Hi all, it seems to me that currently the TN code does not meet our requirement of the gender being random when we do normalization. For normalize.py, I always get a masculine gender output, and for SparrowHawk I always get a feminine gender output (except for 1 & 101). So if a user always uses normalize.py, they will always get a masculine gender result, and if a user uses SparrowHawk, they will always get a feminine gender result (except for 1 & 101). This is not what we want.

Specific results that I get are listed below. Note that I always get the same results each time I run normalize.py or SparrowHawk.

Input normalize.py output [gender] SparrowHawk output [gender]
1 "un" [masc.] "un" [masc.]
101 "ciento un" [masc.] "ciento un" [masc.]
21 "ventiún" [masc.] "veintiuna" [fem.]
121 "ciento veintiún" [masc.] "ciento veintiuna" [fem.]
200 " doscientos" [masc.] "doscientas" [fem.]
201 "doscientos un" [masc.] "doscientas una" [fem.]
300 "trescientos" [masc.] "trescientas" [fem.]
400 "cuatrocientos" [masc.] "cuatrocientas" [fem.]
500 "quinientos" [masc.] "quinientas" [fem.]

@bonham79
Copy link
Contributor Author

@erastorgueva-nv It was my understanding that by 'random' we were referring to making all valid outputs have equivalent weights and letting pynini/sparrowhawk manage the decision. For instance, running German ordinals as in the case @yzhang123 mentioned gives me the same case ending each time. ("die vierTE frau", "die siebTE frau", "die drei hundert zweiTE frau") even with multiple suffixes being viable and an FST being newly instantiated each time. I can just switch in a random number generator to make it truly random but need clarification on what the ideal is.

@yzhang123
Copy link
Contributor

@erastorgueva-nv It was my understanding that by 'random' we were referring to making all valid outputs have equivalent weights and letting pynini/sparrowhawk manage the decision. For instance, running German ordinals as in the case @yzhang123 mentioned gives me the same case ending each time. ("die vierTE frau", "die siebTE frau", "die drei hundert zweiTE frau") even with multiple suffixes being viable and an FST being newly instantiated each time. I can just switch in a random number generator to make it truly random but need clarification on what the ideal is.

if you can do truly random easily that is was we want. But it seems like a platform implementation detail rather than what we can define. is it possible in Sparrowhawk? Is the truly randomization a feature build in FST graph? if not, sparrowhawk won't be able to do true randomization

@bonham79
Copy link
Contributor Author

if you can do truly random easily that is was we want. But it seems like a platform implementation detail rather than what we can define. is it possible in Sparrowhawk? Is the truly randomization a feature build in FST graph? if not, sparrowhawk won't be able to do true randomization

It wouldn't be random in the FST since (I believe) Sparrowhawk's shortest path algorithm is deterministic. What I can do is simply randomize per FST instantiation by a random number generator. So each time the FST is built there would be a chance decision to go one gender or another.

@yzhang123
Copy link
Contributor

if you can do truly random easily that is was we want. But it seems like a platform implementation detail rather than what we can define. is it possible in Sparrowhawk? Is the truly randomization a feature build in FST graph? if not, sparrowhawk won't be able to do true randomization

It wouldn't be random in the FST since (I believe) Sparrowhawk's shortest path algorithm is deterministic. What I can do is simply randomize per FST instantiation by a random number generator. So each time the FST is built there would be a chance decision to go one gender or another.

that's overkill, let's not do that.

Copy link
Collaborator

@erastorgueva-nv erastorgueva-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thank you very much for your hard work Travis @bonham79.
I spoke to Yang yesterday about what are our requirements for this PR and realised that the code meets the requirements. The important thing is that pynini weighs the different options the same (as you have said it does). The fact that nemo_text_processing prefers masculines and the fact that sparrowhawk prefers feminines is a quirk of implementation (as you say).

@yzhang123 yzhang123 merged commit 1454fdd into NVIDIA:main Mar 18, 2022
fayejf pushed a commit that referenced this pull request Mar 22, 2022
#3759)

* Making gender assignment random for cardinals, fractions, and decimals. Also making fixes to measure so sounds more fluent.

Signed-off-by: Bonham79 <[email protected]>

* Style fixes

Signed-off-by: Bonham79 <[email protected]>

* Import fix.

Signed-off-by: bonham79 <[email protected]>

* Files missed style check after import fix

Signed-off-by: Bonham79 <[email protected]>

* Fixing bug in measure .py which was preventing gender carrying over for multiples of hundred thousand.

Signed-off-by: Bonham79 <[email protected]>

* Missed an import

Signed-off-by: Bonham79 <[email protected]>

Co-authored-by: Elena Rastorgueva <[email protected]>
Co-authored-by: Yang Zhang <[email protected]>
fayejf pushed a commit that referenced this pull request Mar 22, 2022
#3759)

* Making gender assignment random for cardinals, fractions, and decimals. Also making fixes to measure so sounds more fluent.

Signed-off-by: Bonham79 <[email protected]>

* Style fixes

Signed-off-by: Bonham79 <[email protected]>

* Import fix.

Signed-off-by: bonham79 <[email protected]>

* Files missed style check after import fix

Signed-off-by: Bonham79 <[email protected]>

* Fixing bug in measure .py which was preventing gender carrying over for multiples of hundred thousand.

Signed-off-by: Bonham79 <[email protected]>

* Missed an import

Signed-off-by: Bonham79 <[email protected]>

Co-authored-by: Elena Rastorgueva <[email protected]>
Co-authored-by: Yang Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants