Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexical refinement on edeprels #287

Open
nschneid opened this issue Jan 8, 2022 · 27 comments
Open

Lexical refinement on edeprels #287

nschneid opened this issue Jan 8, 2022 · 27 comments
Labels

Comments

@nschneid
Copy link
Contributor

nschneid commented Jan 8, 2022

The inventory of possible lexical markers on nmod, obl, acl, advcl, and conj Enhanced relations is now specified for the validator at: https://quest.ms.mff.cuni.cz/udvalidator/cgi-bin/unidep/langspec/specify_edeprel.pl?lcode=en

There are a number of errors in EWT and GUM, some of which require tweaking of the inventory, and others of which should be changed in the data. Let's use this issue to track the discussion.

@nschneid
Copy link
Contributor Author

nschneid commented Jan 8, 2022

@dan-zeman When I try to allow conj:rather_than it says I need to specify a coordinating function. Can "instead of something" be considered a (sometimes) coordinating function? (see #182)

@nschneid
Copy link
Contributor Author

nschneid commented Jan 8, 2022

For "in order {to, for...to, that}", fixed only covers the "in order" part. Should the edeprel be advcl:in_order (as currently in EWT), or should it incorporate the marker? GUM has advcl:in_order_to and advcl:in_order_for, or the latter could be advcl:in_order_for_to.

I see that for_to is included in the list, and used in GUM but not EWT.

@dan-zeman
Copy link
Member

I am wondering whether this issue should reside under docs, as it is not just about EWT.

@dan-zeman
Copy link
Member

@dan-zeman When I try to allow conj:rather_than it says I need to specify a coordinating function. Can "instead of something" be considered a (sometimes) coordinating function? (see #182)

I actually answered there (#182 (comment)) and now I see that you moved the question here. So here is a copy:

When I try to allow
|conj:rather_than| it says I need to specify a coordinating function.
Can "instead of something" be considered a (sometimes) coordinating
function?

The last set of functions are intended for use with conj:

Paratactic relation

Conjunction (“and”)
Negative conjunction (“neither … nor”)
Disjunction (“or”)
Adversative (“but, yet”)
Inferential-reason (“for”)
Inferential-consequence (“so”)

Maybe we can say it's a disjunction? I thought it was a subordinator and
using it with conj was an error. But I was not aware of this issue.

@dan-zeman
Copy link
Member

dan-zeman commented Jan 8, 2022

I see that @nschneid has added where with the example "I know [where you live]" but I would argue that at least in this sentence, where is an adverb and it should be attached via advmod to live, so it should not be propagated to the edeprel. I recall that the analysis of where has been discussed somewhere but I did not find the issue now (I don't know whether it is under UD_English-EWT, under docs, or somewhere else). EDIT: Found it here: #88.

IMHO the same problem is wherever (2 occurrences in EWT, one as ADV, one as SCONJ, without any actual difference; 2 occurrence in GUM, both in wherever possible, both treated as SCONJ; I believe all of them should be ADV).

IMHO the same problem is whither (1 occurrence in GUM, should be ADV).

@dan-zeman
Copy link
Member

dan-zeman commented Jan 8, 2022

In the pattern no choice but to do something, there is an acl attached to choice, with two markers, but and to. There are 2 occurrences in EWT (http://hdl.handle.net/11346/PMLTQ-QQAW) and one in GUM (http://hdl.handle.net/11346/PMLTQ-7RRP). For the enhanced relation, GUM uses acl:but_to, which is the mechanical default when multiple markers are present. However, here I actually like the approach of EWT, which uses only acl:to — IMHO a better indication that there is an infinitival adnominal clause. In any case, it should be harmonized (right now acl:but_to is not registered, leading to an error in GUM).

@dan-zeman
Copy link
Member

@amir-zeldes : In GUM, M = 7.64 ± 1.12 is analyzed so that ± 1.12 is nmod of 7.64. Shouldn't binary mathematical operations be conj? (That would mean that we need enhanced conj:plus_minus rather than nmod:plus_minus.)

@dan-zeman
Copy link
Member

Is it grammatical in English to omit the second as from as well as? GUM has one example: ... this project study gives solution to the problem of the society concerning environment, health and safety as well energy conservation ... To me it sounds like there should be as well as but the author forgot to complete it. If that's true, then the enhanced relation should be conj:as_well_as (which is already registered), not conj:as_well.

@dan-zeman
Copy link
Member

Is it a good idea to augment enhanced deprels with foreign case markers in code-switched data? Example: GUM uses nmod:de and nmod:a in a sentence that is completely French: J' ai besoin de tout mon courage pour mourir à vingt ans!” I think it would be quite sufficient to stay with plain nmod in such cases.

Alternatively, the validator could be modified to also observe MISC Lang=fr for edeprels, as it currently does for auxiliaries and features. (But Lang=fr would have to be added to that sentence, it is not there at present.)

@dan-zeman
Copy link
Member

Any criteria for deciding whether versus should be preposition or coordinator in English? All examples in EWT and GUM result in nmod:versus, except for one in GUM, which is conj:versus and probably is just an error that should be fixed. (Nevertheless, Reynolds and Pullum (2013) argue in 4.3 that the function of versus has shifted towards coordinator.)

@dan-zeman
Copy link
Member

Is til an acceptable alternative spelling of till, or is it a typo that should be normalized to till? There is one obl:til in EWT and one in GUM; if it was obl:till, it wouldn't be reported as error.

@dan-zeman
Copy link
Member

There are two instances of aka in EWT (http://hdl.handle.net/11346/PMLTQ-FRJ0). They are tagged as ADV, which seems suspicious to me. Neverthless, the noun they modify is attached to the preceding nominal as appos (in the first case; in the second the antecedent is missing), which I agree with.

GUM has six instances (http://hdl.handle.net/11346/PMLTQ-FAAK). They are tagged as ADP, which might be okay (either that or a conjunction). But the edeprel of the nominal is nmod:aka while I think it should be appos.

@amir-zeldes
Copy link
Contributor

Thanks for finding all of these! My take on these is:

  • conj:rather_than is correct under the current dependency analysis of "rather than"
  • if the infinitive "to" is generally included, I think the correct edeprel is advcl:in_order_to for consistency
  • "I know where you live"
    • This particular case is a tree error, since it is a free relative ("where you live" is the thing I know, so it saturates obj)
    • If we had an advcl marked by "where" (specifying location of a predicate as an adjunct clause), it should behave like "if", and should be mark; "where" as advmod is a Stanford Dependencies thing which was phased out in UD AFAIK (this was actually one of the GUM <6 SD to UD rules); "wherever possible" etc. are correct at SCONJ, just like "if possible". "whither" would be advmod in a question, but otherwise not.
  • nmod:plus_minus -> conj:plus_minus - this seems reasonable, I can change it
  • conj:as_well - this is an error in the original, I agree edep should be as_well_as
  • de etc - oh, this is a funny one! I guess the student here went above and beyond the call of duty and did a whole French tree! I feel a bit sad throwing it away, but it should probably just all be flat no? Or what do you think? I agree an English edep with de is not sensible and could just remove it, but I'm not sure what is the best overall solution.
  • nmod:versus - yes, I think being conservative is best and least surprising, will fix
  • til should be normalized to till
  • I think "aka" can either be analyzed fully, as replacing a clause headed by "known" (then it should be acl), or we can view it as a preposition-like thing replacing "as", in which case nmod. I'm not sure about appos... I guess I'm convinceable. What do you think @nschneid ?

nschneid added a commit that referenced this issue Jan 8, 2022
amir-zeldes added a commit to amir-zeldes/gum that referenced this issue Jan 8, 2022
@nschneid
Copy link
Contributor Author

nschneid commented Jan 8, 2022

  • where, wherever, whither: I realized after I added that example that the guidelines e.g. example (37) on this page seem to be inconsistent with the data. I think the discussion in Inconsistent decisions about SCONJ/ADV and mark/advmod for subordinate-clause-introducers #88 needs to be revisited with the core group, in the context of a discussion of free relatives and other types of WH-clauses. I've started some guidelines here but have a number of open questions.
  • Whether to analyze the French sentence in an English treebank: I suppose UD should have a policy for code-switching. The test I would use is, is the reader expected to understand the French structure, or are they expected to recognize it as a fixed phrase? An interesting case is if they quote something in another language and then give a translation, suggesting that the reader may not speak the other language.
  • I would be more liberal and treat versus/vs./v. as a coordinator if it coordinates two like constituents on equal footing. Similar to "rather than" ("rather than" #182), it retains a preposition usage, but is more flexible now: "I can't decide whether to get a dog vs. a cat", "I can't decide whether to give John a book versus Mary a bike".
  • I think "a.k.a.", "i.e.", and "e.g." all have the function of introducing a supplement. Not sure what tag to use; perhaps we should discuss after we reach a decision on "etc." (https://github.com/UniversalDependencies/docs#820).
  • no choice but to: but+to is a compositional combination, not a fixed phrase (cf. "no choices but bad choices"). "From" can precede just about any locative: "the wasp flew from under/above/near/... the window". It is inelegant to list but_to, from_under, from_above, etc., but it would remove information to simplify it to just one word. What about putting + in the edeprel in such cases, and the validator will accept any + combination of listed lexical items? This would also work for in_order+that, in_order+to, in_order+for+to.

@dan-zeman
Copy link
Member

  • conj:rather_than is correct under the current dependency analysis of "rather than"

@nschneid has added it.

  • if the infinitive "to" is generally included, I think the correct edeprel is advcl:in_order_to for consistency

I will leave this one for the two of you to sort out (advcl:in_order_to had been registered but was later replaced with advcl:in_order by @nschneid, to also accommodate advcl:in_order_for).

  • If we had an advcl marked by "where" (specifying location of a predicate as an adjunct clause), it should behave like "if", and should be mark; "where" as advmod is a Stanford Dependencies thing which was phased out in UD AFAIK (this was actually one of the GUM <6 SD to UD rules); "wherever possible" etc. are correct at SCONJ, just like "if possible". "whither" would be advmod in a question, but otherwise not.

I never heard of phasing it out in UD but maybe it was some English-internal discussion. FWIW, the equivalents in Czech are treated as adverbs (and it is the same in the Prague treebanks, i.e., without any connection to Stanford Dependencies). I am convinced that a wh-adverb stays an adverb and occupies an adverbial position regardless whether it is a question, a complement clause, or an adverbial clause.

  • nmod:plus_minus -> conj:plus_minus - this seems reasonable, I can change it

OK, conj:plus_minus is now registered.

  • de etc - oh, this is a funny one! I guess the student here went above and beyond the call of duty and did a whole French tree! I feel a bit sad throwing it away, but it should probably just all be flat no? Or what do you think? I agree an English edep with de is not sensible and could just remove it, but I'm not sure what is the best overall solution.

I would definitely not remove the sentence because that would break the integrity of the document, but you probably did not mean that. I would also not necessarily flatten the tree; I think using UPOS X and flat:foreign is an option for annotators who cannot or do not want to annotate the foreign language, but actually annotating it following the foreign language guidelines is possible and some treebanks do it. But I would only use nmod here so that we do not have to register the foreign prepositions (if we relied on Lang=fr, we would still have to register it in the French list, which is now empty).

In contrast, I did register Latin et as an English conjunction because I thought et al. has been naturalized in English.

  • I think "aka" can either be analyzed fully, as replacing a clause headed by "known" (then it should be acl), or we can view it as a preposition-like thing replacing "as", in which case nmod. I'm not sure about appos... I guess I'm convinceable. What do you think @nschneid ?

Yeah, also known as could be a fixed multi-word preposition or conjunction (but would you treat it as such if it occurred in the corpus?) I don't think it disqualifies the nominal from being an apposition (semantically it indeed sounds like one). Actually, I had the same feeling about such as, that's why I did not include it in the first round of porting English edeprels. So maybe the two should have the same solution. But if you guys believe it has to be nmod, we can add nmod:aka to the list.

@nschneid
Copy link
Contributor Author

nschneid commented Jan 8, 2022

In contrast, I did register Latin et as an English conjunction because I thought et al. has been naturalized in English.

This is subject to debate. @amir-zeldes thinks of et as a conjunction even in English, whereas I think of "et al." as a fixed phrase. Let's revisit after resolving "etc.".

@nschneid
Copy link
Contributor Author

nschneid commented Jan 8, 2022

I would only use nmod here so that we do not have to register the foreign prepositions

What about nmod:FOREIGN? So that scripts will know the lack of a lexical refinement is not an error.

@dan-zeman
Copy link
Member

  • An interesting case is if they quote something in another language and then give a translation, suggesting that the reader may not speak the other language.

That was actually the case of the French sentence I showed (the English translation came two sentences later).

@dan-zeman
Copy link
Member

I would only use nmod here so that we do not have to register the foreign prepositions

What about nmod:FOREIGN? So that scripts will know the lack of a lexical refinement is not an error.

Lexical refinement is not present everywhere, so I don't think this is necessary. (And deprels are all-lowercase, so it would have to be nmod:foreign.)

@nschneid
Copy link
Contributor Author

nschneid commented Jan 8, 2022

I would only use nmod here so that we do not have to register the foreign prepositions

What about nmod:FOREIGN? So that scripts will know the lack of a lexical refinement is not an error.

Lexical refinement is not present everywhere, so I don't think this is necessary. (And deprels are all-lowercase, so it would have to be nmod:foreign.)

There should almost always be a lexical refinement if the dependent has a case or mark dependent, right?

@dan-zeman
Copy link
Member

There should almost always be a lexical refinement if the dependent has a case or mark dependent, right?

But if the scripts already check the presence of a case/mark dependent, then they can check whether it has Foreign=Yes in the features.

@dan-zeman
Copy link
Member

  • It is inelegant to list but_to, from_under, from_above, etc., but it would remove information to simplify it to just one word. What about putting + in the edeprel in such cases, and the validator will accept any + combination of listed lexical items? This would also work for in_order+that, in_order+to, in_order+for+to.

I feel quite strongly against adding any more complexity to the internal logic of the deprels. In fact, I hope that in the distant future, we will be able to replace all these lexical labels with some semantic tags that will be portable across languages.

I'm actually quite fine with from_under because some Northeast-Caucasian languages have a morphological case with the same meaning. But in other cases I tend to think that only one of the function words directly relates to the nmod relation. "But to" was one of such cases but I'm not sure I can specify cross-linguistic decision criteria where this should be done. (In Czech, I treated all combinations of jako 'as' + another preposition as if it was only jako; the same for než 'than'.)

@amir-zeldes
Copy link
Contributor

I would opt for simplicity as well - edeps are a work in progress from my perspective, and messing with them too much right now may be premature optimization. I am happy with "from_above" for right now, and if infinitive "to" is in then it is in, meaning "but_to" (in the sense "except to") is also in.

I also don't think ":foreign" is necessary since there is Foreign, and this could lead to conflicts. And in any case, we have bare enhanced things like conj for zero coordination etc.

nschneid added a commit that referenced this issue May 13, 2022
nschneid added a commit that referenced this issue May 13, 2022
nschneid added a commit that referenced this issue May 13, 2022
@nschneid
Copy link
Contributor Author

The above changes (and recent additions to the validator list) result in EWT being VALID! A couple of items to note for future investigation:

  • Switched a few tokens of "rather than" to cc to avoid adding nmod:rather_than. It may be worth making conj the default across the corpus (case/mark would apply to fronted constituents). "rather than" #182
  • Added advcl:the to the validator for the first part of the correlative comparative construction: "the more, the merrier". This needs to be implemented in GUM.
  • Should "combined with" be added to the fixed list + validator list? It is unclear. Deverbal connectives ("regarding", etc.) #179
  • "Whenever" is often analyzed as mark of an advcl. A couple of tokens had the refinement advcl:whenever, but most lacked any refinement. I removed the refinement on those because I'm not sure of the correct structure and didn't want to add something incorrect to the validator.
  • At @amir-zeldes's behest I changed some coordination analyses of "vs." to PP analyses to avoid adding conj:versus to the validator, but this is debatable. Pullum and Reynolds 2013 argued that "versus" can be a CCONJ as well as an ADP.
  • Removed a couple of lexical refinements where the use of markers was ungrammatical. Not sure if there should be a way to signal this ungrammaticality.

The other changes were fairly straightforward.

@amir-zeldes
Copy link
Contributor

Just one thing about comparative correlatives ("the more the merrier") - I'm all for advcl:the here, but that does open the question of what deprel and POS it should have. Currently it has det and DET because PTB tags DT (albeit sticking "the more" under a phrase node X, whatever that means). I think logically it should probably be IN/SCONJ/mark, but I'm not sure it's worth disrupting the PTB xpos ecosystem.

Options include:

  1. DT/DET/det/advcl:the
  2. DT/SCONJ/mark/advcl:the
  3. DT/DET/mark/advcl:the
  4. Other permutations?
  5. Just give up on advcl:the since this is a bit messy

Maybe option 2. is the best for preserving 'status quo' while expressing linguistic structure faithfully, though it does create an xpos/upos disparity (but an automatable one, since no other 'the' is deprel mark).

@dan-zeman
Copy link
Member

Option 2 sort of makes sense to me. The only other viable option seems to be 5 (perhaps I'd even prefer that one). Because if we put the in the edeprel, we are saying that it functions as a marker rather than a determiner.

@amir-zeldes
Copy link
Contributor

Because if we put the in the edeprel, we are saying that it functions as a marker rather than a determiner.

Yes, I think it is - historically it's a separate case form, distinct from the regular article (also compare the German form "desto", which is not the same as the regular article). It's only coincidentally a homonym of the article at this point, but really it's a totally different word morphosyntactically - it's labeled mark currently and I think that's correct, so it should be SCONJ too IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants