Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gender normalization (localization) #20

Closed
skalee opened this issue Jan 5, 2021 · 11 comments
Closed

Gender normalization (localization) #20

skalee opened this issue Jan 5, 2021 · 11 comments
Assignees
Labels
question Further information is requested

Comments

@skalee
Copy link

skalee commented Jan 5, 2021

@ronaldtse I got a couple of questions. See IEV 102-04-22 on old Electropedia.

  1. Entry in Serbian (апсциса, <дуж криве> ж јд) has gender ж јд, which probably means "feminine singular". We surely need to display it this way, but the question is how to represent it in data? ж јд or normalized as f?
  2. Entry in Dutch (abscis, m/f) has gender, m/f which means "masculine with optional feminine" (it's different than masculine, feminine, or neuter). We surely need to display it this way, but the question is how to represent it in data? m/f or maybe there is another common notation for mixed masculine-feminine genders like this one?

Also note that there may be more genders or alike. For example, in some Slavic languages (Czech, Slovene) nouns are further divided into animate and inanimate ones. I am not sure how important is that, but Wiktionary denotes that next to gender (see this).

@skalee
Copy link
Author

skalee commented Jan 6, 2021

Furthermore, some languages may have different set of genders for singular and plural. One example is Polish, in which most linguist distinguish 5 genders: 3 for singular (masculine, feminine and neuter) and 2 for plural (virile and non-virile). It must be noted though that some linguists prefer different classifications, for example Polish entries in IEV use an old-school approach with masculine and feminine genders in plural (102-03-13).

My conclusion is that it will be difficult to develop a discrete set of genders which will work for every language and for every project. Perhaps we should allow arbitrary genders, but I'm not sure if Glossarist Desktop supports that. Perhaps we should be even more elastic and describe terms with an array of arbitrary grammar classifiers rather than have separate fields for gender, plurality, etc.

@strogonoff
Copy link

For what it’s worth, here is how grammatical properties of nouns are typed in Glossarist model:

https://github.com/glossarist/glossarist-desktop/blob/4105c7a2b2b1f5085c748af3ce0fdb27fd7e3149/src/models/concepts.ts#L188

  • Common and neuter genders are supported.
  • Grammatical number (plural/singular) and gender are separate.

Not sure if this helps and what you are trying to achieve, just saw this issue in my notifications.

@skalee
Copy link
Author

skalee commented Jan 9, 2021

What does "common" gender stand for? Is it kinda "not applicable" or "unspecified"? Or maybe it's kinda "masculine or feminine, but not neuter"?

Not sure what you are trying to achieve.

I'm trying to achieve something more elastic as there are languages which have more than three genders. For example in context of IEV, Dutch has m, f, n, and m/f.

@strogonoff
Copy link

I recommend using fully qualified gender names instead of one-letter abbreviations to reduce ambiguity.

For linguistic background of neuter/common see e.g. https://en.wikipedia.org/wiki/Grammatical_gender

@skalee
Copy link
Author

skalee commented Jan 9, 2021

For linguistic background of neuter/common see e.g. https://en.wikipedia.org/wiki/Grammatical_gender

Thanks! It explains everything.

I recommend using fully qualified gender names instead of one-letter abbreviations to reduce ambiguity.

I'm okay with either option.


Still, I'm not sure if set of just four genders will be future-proof. For example, some languages distinguish for example animate and inanimate nouns, and most vocabularies display that next to gender, because it's useful for users. Moreover, some languages (e.g. Polish) distinguish different genders in singular (masculine, feminine, neuter) and in plural (virile, non-virile). These two extra genders in plural can be internally represented as masculine and feminine, and that's probably technically correct, but at some point I guess we'll have to do some mapping in the interface in both Geolexica and Glossarist desktop so that more appropriate verbiage is used.

That said, what you proposed should be enough in context of IEV and I'm okay with that.

@strogonoff
Copy link

For linguistic background of neuter/common see e.g. https://en.wikipedia.org/wiki/Grammatical_gender

Thanks! It explains everything.

I recommend using fully qualified gender names instead of one-letter abbreviations to reduce ambiguity.

I'm okay with either option.


Still, I'm not sure if set of just four genders will be future-proof. For example, some languages distinguish for example animate and inanimate nouns, and most vocabularies display that next to gender, because it's useful for users. Moreover, some languages (e.g. Polish) distinguish different genders in singular (masculine, feminine, neuter) and in plural (virile, non-virile). These two extra genders in plural can be internally represented as masculine and feminine, and that's probably technically correct, but at some point I guess we'll have to do some mapping in the interface in both Geolexica and Glossarist desktop so that more appropriate verbiage is used.

That said, what you proposed should be enough in context of IEV and I'm okay with that.

Animate/inanimate property could be added if needed, but like you say, for glossaries we deal with it may not be relevant.

Generally, in linguistics there are different competing ways of classifying verbal expressions. Control bodies can disagree with each other which one they use. Also, they always evolve.

I think user-configurable versioned schemas (like what we are trying to do with generic registry schema) is the way to go. Some vocabularies may need more finely detailed grammatical properties, but for others those properties may not matter.

@skalee
Copy link
Author

skalee commented Jan 10, 2021

Generally, in linguistics there are different competing ways of classifying verbal expressions. Control bodies can disagree with each other which one they use. Also, they always evolve.

Indeed, this is my primary concern too. But after your clarifications, what we adopted seems enough for now, at least I haven't found any outstanding case yet. Closing?

@strogonoff
Copy link

strogonoff commented Jan 10, 2021 via email

@skalee skalee closed this as completed Jan 10, 2021
@ronaldtse
Copy link
Member

I think user-configurable versioned schemas (like what we are trying to do with generic registry schema) is the way to go. Some vocabularies may need more finely detailed grammatical properties, but for others those properties may not matter.

Agree. It is difficult to have different control bodies agree on an identical set of language gender, so leaving it customizable is easiest for now.

@skalee
Copy link
Author

skalee commented Jan 11, 2021

BTW, what's "generic registry schema"? I'm certainly not on the same page here.

@strogonoff
Copy link

It’s data schema used by a registry editor GUI currently in development. It doesn’t clash with concept model described here, they are different things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants