-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
missing 43 sentences from EWT #7
Comments
Thanks for noting this! While it would be good for us to include these, this does not mean that there is missing SRL data -- while I'll need to look into it more, I'm pretty sure that each of these sentences is from a document that had zero predicates to annotate, and our pipeline ended up simply not preparing documents with zero annotations. I think that's an error in our pipeline -- while dropping them would have no effect on standard SRL training (where you have gold predicate identification) it would be more accurate to have these documents included. I'll look into adding them in. |
The strange thing here is that there are other sentences without predicate and arguments annotation but still in the corpus. |
Random additional comment: There are a number of failures to divide sentences in the LDC EWT data. Up until now, we have kept the sentence divisions consistent between LDC EWT and UD EWT, but I have an intention of some day fixing the erroneous sentence divisions and giving the results back to LDC.... |
Can I help somehow? I really would like to see the data more consistent between the LDC EWT, UD EWT, and Propbank. Do you have the list of errors in the division of sentences? |
I missed one important detail in @timjogorman explanation above. He repeated what he said in #2 (comment) actually. Only files that do not contain any predicate annotated in all its sentences are omitted. So my comment above can be ignored, we do have files with some sentences missing SRL annotation. |
https://catalog.ldc.upenn.edu/LDC2012T13 says that EWT has 16,624 sentences. They actually have:
This number matches the number of sentences in the https://github.com/universaldependencies/UD_English-EWT treebank:
But this propbank-release contains only 16579 sentences. We are missing the following 43 sentences:
The text was updated successfully, but these errors were encountered: