Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of Functions in evaluation #94

Open
danielhers opened this issue May 22, 2020 · 3 comments
Open

Better handling of Functions in evaluation #94

danielhers opened this issue May 22, 2020 · 3 comments

Comments

@danielhers
Copy link
Member

Functions are currently moved to the root if they common in prediction and gold (see #91 (comment)), but a better handling would be "soft" matching of yields to allow excluding Functions in a non-symmetric way: for calculating precision, we should allow omitting Functions from the gold; and for recall, from the prediction.

@nschneid
Copy link

Note that unlike punctuation, the decision of whether a word should be Function or not is nontrivial.

So if we were to simply ignore all Function units like we do Punctuation units, each Function vs. non-Function difference between prediction and reference could result in mismatched spans at several levels.

Hence, better when computing precision to ignore Function units in the reference only that would prevent ancestor units from matching, vice versa for recall.

(The algorithm would need to be worked out: e.g. suppose we have

[A [F x] [D [F y] [E [F z] [C c]] ] ]

A and D have identical spans modulo F's, as do E and C. So this could effectively mean there are more unary edges if the other analysis chooses to put x, y, and z elsewhere. Would that make the scorer too lenient under the current policy that any category match is sufficient to count the span as correct? Also, when trying to match large units with many F descendants, do we have a combinatorial search space to decide whether to include each F? Hopefully not a problem in practice as units tend to be nested, so a failure of a unit to match will imply that its parent units with more non-F descendants will not match, assuming proper nesting on both sides.)

Function units themselves would not count toward the score (i.e., they are excluded from the list of matches even if present in both analyses).

@nschneid
Copy link

nschneid commented May 25, 2020

Also, when trying to match large units with many F descendants, do we have a combinatorial search space to decide whether to include each F? Hopefully not a problem in practice as units tend to be nested, so a failure of a unit to match will imply that its parent units with more non-F descendants will not match, assuming proper nesting on both sides.

I take that back.

On further reflection, if PRIMARY edges strictly form a tree (and no token can belong to multiple units), then it doesn't matter whether the tree is projective: the matching can be done bottom-up, once for precision and again for recall. I suppose there can be a chart formed by reordering the tokens so the graph being scored is projective. If a unit matches, that can be taking into account when checking whether its parent unit matches (so work determining that certain F's SHOULD be included doesn't have to be repeated).

@nschneid
Copy link

nschneid commented May 25, 2020

Suppose we had

SYS: [C [F the] [P party]]
REF: [P [F the] [C party]]

  1. If our policy is just to ignore the F's, that amounts to [P [C party]] vs. [C [P party]] (which would be scored fully correct?).

  2. Or, the policy could be that F's count as part of the span, we are just flexible about matching that span to another span where the F's are missing or extra. And we only score spans where at least one terminal category is non-F. In which case there are two span matches, "the party" and "party", but neither has the correct category, so the labeled score is 0.

  3. Or, sort of a combination between (1) and (2): when considering the predicted unit [C the party], we count a match if EITHER the full span matches with the right category OR the F-omitted span [C party] matches. This would be not quite the same as ignoring all F's, because under (1) both C and P units would be counted correct for both precision and recall, while under (3), only C would count as correct for precision and only P for recall.

Now consider:

SYS: [C [E the] [P party]]
REF: [P [F the] [C party]]

  • Under policy (2):
    • precision would be out of 3 units: neither [C the party] nor [E the] matches with the correct category, but [P party] does if you ignore the F in REF, so 1/3
    • recall would be out of 2 units, [P the party] and [C party], neither of which match with the correct category, so 0.
  • Under (3), precision would be 0/3, and recall would be 1/2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants