Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduplicative plural in Indonesian #238

Closed
dan-zeman opened this issue Nov 29, 2015 · 4 comments
Closed

Reduplicative plural in Indonesian #238

dan-zeman opened this issue Nov 29, 2015 · 4 comments

Comments

@dan-zeman
Copy link
Member

Indonesian forms plural of nouns by repeating the noun twice. Example: oleh negara-negara nordic “by nordic countries”; negara = “country”. This is called reduplication and it is not limited to Indonesian.

It could be analyzed at the level of morphological features (Number=Plur) but tokenization would have to keep negara-negara as one token; this is not what we have in our Indonesian treebank. In contrast, there are three tokens, negara - negara.

Since the reduplicated part is treated as a separate word (and provided we want to keep it that way), we need a language-specific relation to attach the reduplicated part to the first part. The treebank currently uses mwe but it is wrong because mwe is to be used only with function expressions, such as multi-word prepositions. (Side note: I discovered this by checking that no node has mwe and name children at the same time. Most treebanks comply with this rule, except for 1 occurrence in French, 9 occurrences in Italian and 72 occurrences in Indonesian. See http://hdl.handle.net/11346/PMLTQ-SUPM).

What about nmod:plur (an analogy to nmod:poss for 's in English)?

@ftyers
Copy link
Contributor

ftyers commented Nov 29, 2015

How about using goeswith and mark the first (or last) noun as plural.

0    oleh       PREP    ...   _                        1     case
1    negara   NOUN   ...  Number=Plur                  0     root
2    -            PUNCT   ...  _                       1     punct
3    negara  NOUN   ...  _                             1     goeswith
4    nordic  ADJ    ...  _                             1     amod

The docs say "This relation links two parts of a word that are separated in text that is not well edited. ", you could consider this just suboptimal tokenisation.

@manning
Copy link
Contributor

manning commented Dec 30, 2015

Still lots of details to work out in UD!

I would like to suggest using compound. Probably as a language specific compound:plur, as @dan-zeman suggests. I can see the argument for goeswith, but I originally thought of that more clearly as mistakes of writing, which isn't really the case here. That is, if we take the 5 relations in the "Compounding and unanalyzed" bucket, name and foreign have fairly clear, specialized semantics, and then of the other 3, my take is that goeswith is for a word split in writing, normally as an error (under the original assumptions of no spaces in words, mwe is restricted to grammatical expressions that behave like function words, and so other linguistic processes for putting together words are compound. I think this is pretty clearly compound.

p.s. In Indonesian, it's somewhat debated as to whether it is really right to view reduplication as a straightforward plural marker, though, actually, that seems to be increasingly true in modern Indonesian. See, e.g., http://sealang.net/sala/archives/pdf8/rafferty2002reduplication.pdf .

@dan-zeman
Copy link
Member Author

@nschneid
Copy link
Contributor

nschneid commented Feb 5, 2017

See also #307, where the decision was also compound.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants