MWE-aware English Dependency Corpus

We provide users with an English dependency corpus taking into account compound function words, which are one type of multiword expressions (MWEs) and serve as functional expressions.

We built the corpus according to the following method.

We found an MWE in the phrase structure trees of Ontonotes and establish it as a single subtree.

We utilized the information (position in sentence and part of speech) of MWEs provided by [1].
The phrase structure trees made by this step is also provided.

We replaced the above subtree by a preterminal with its leaf node as a child. The preterminal has the same part of speech as that of the MWE. Its child node is made by joining all components of the MWE with underscores.

We converted the phrase structure into Stanford Dependency [2].
- We designated "-conllx -basic -makeCopulaHead -keepPunct" as a option for the conversion command.
- We show an example of MWE-aware Dependency here.

4. We decomposed the token derived from MWE (e.g. a_number_of) to ``head-initial'' dependency structure taking into account the consistency with Universal Dependency [3]. In other words, each token of MWE modifies the first one using the mwe label.

MWE-aware Dependency and associated documentation can be downloaded from: https://github.com/naist-cl-parsing/mwe-aware-dependency

Note

We are currently being confirmed at LDC about the appropriate license for our language resource, because our resource is based on Ontonotes release-5.0 (LDC2013T19).

Therefore, our resource will be available after we'll get the response from LDC.

Files

dependency

ontonotes_wsj_00_24_mwe_aware.conll

MWE-aware Dependency for Section 00-24 of Wall Street Journal in Ontonotes (Stanford Dependency).

phrase-structure

ontonotes_wsj_00_24_mwe_aware.patch

Patch for creating Treebank incorporating MWEs (Section 00-24 of Wall Street Journal in Ontonotes)
We checked this patch on the following environment:
Ubuntu Linux 12.04.2 LTS
Python 2.7.3
nltk 3.0.4
- Please follow the instructions for installing NLTK at http://nltk.org/install.html .

normalize_indentation.py

Script to make an indentation of S-expression same as that we get if we could call print() for nltk.tree.Tree instance (This is a preparation for applying the above patch)

.conll Format

1 token per line, with blank lines separating sentences.

14 tab-separated columns (columns 1-10 are based on CoNLL-X Format [4]):

ID
FORM (replaced by underscore for licensing restrictions)
LEMMA (replaced by underscore for licensing restrictions)
CPOSTAG (filled by underscore)
POSTAG
FEATS (filled by underscore)
HEAD
DEPREL
PHEAD (filled by underscore)
PDEPREL (filled by underscore)
Filename in Ontonotes (e.g. wsj_0001)
Head in MWE_aware dependency (for MWEs, we adopt "head-initial" structure.)
Dependency label in MWE_aware dependency (we use "mwe" label for each token of MWE excluding the first token)
Part-of-speech tag of MWE (only for the first token of each MWE)

Application of patch

In order to apply the above patch, you need the Ontonotes release-5.0 (LDC2013T19). If you move to a folder just above Wall Street Journal in Ontonotes directory, it looks like this:

$ ls LDC2013T19/ontonotes-release-5.0/data/files/data/english/annotations/nw/wsj

00  02  04  06  08  10  12  14  16  18  20  22  24  
01  03  05  07  09  11  13  15  17  19  21  23

First, make directory for our corpus.

$ mkdir -p ontonotes_5.0_mwe_aware_v1.0/wsj

In order to normalize the indentation style, do the following command:

$ python normalize_indentation.py LDC2013T19/ontonotes-release-5.0/data/files/data/english/annotations/nw/wsj ontonotes_5.0_mwe_aware_v1.0/wsj

Then, apply the patch.

$ patch -p1 -d ontonotes_5.0_mwe_aware_v1.0/wsj < /.../mwe_aware_dependency/phrase_structure/ontonotes_wsj_00_24_mwe_aware.patch

References

[1] Yutaro Shigeto, Ai Azuma, Sorami Hisamoto, Shuhei Kondo, Tomoya Kouse, Keisuke Sakaguchi, Akifumi Yoshimoto, Frances Yung, Yuji Matsumoto. 2013. Construction of English MWE Dictionary and its Application to POS Tagging. Proceedings of the 9th Workshop on Multiword Expressions, pages 139–144, Atlanta, Georgia, USA. Association for Computational Linguistics. (http://www.aclweb.org/anthology/W13-1021)
[2] Marie-Catherine de Marneffe, Christopher D. Manning. 2008. The Stanford Typed Dependencies Representation. Coling 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation, pages 1–8, Manchester, UK. Coling 2008 Organizing Committee. (http://www.aclweb.org/anthology/W08-1301)
[3] Ryan Mcdonald, Joakim Nivre, Yvonne Quirmbach- brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Ta ̈ckstro ̈m, Claudia Bedini, Nu ́ria Bertomeu Castello ́, and Jungmee Lee. 2013. Universal Dependency Annotation for Multilingual Parsing. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 92–97. (https://aclweb.org/anthology/P/P13/P13-2017.pdf)
[4] CoNLL-X Shared Task: Multi-lingual Dependency Parsing (http://ilk.uvt.nl/conll/)

History

MWE-aware Dependency 1.0: 2015-10-23.

Contact

Please e-mail kato.akihiko.ju6 /at/ is.naist.jp with questions.

Contributors

Akihiko Kato
Hiroyuki Shindo
Yuji Matsumoto

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
img		img
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MWE-aware English Dependency Corpus

Note

Files

dependency

ontonotes_wsj_00_24_mwe_aware.conll

phrase-structure

ontonotes_wsj_00_24_mwe_aware.patch

normalize_indentation.py

.conll Format

Application of patch

References

History

Contact

Contributors

About

Releases

Packages

naist-cl-parsing/mwe-aware-dependency

Folders and files

Latest commit

History

Repository files navigation

MWE-aware English Dependency Corpus

Note

Files

dependency

ontonotes_wsj_00_24_mwe_aware.conll

phrase-structure

ontonotes_wsj_00_24_mwe_aware.patch

normalize_indentation.py

.conll Format

Application of patch

References

History

Contact

Contributors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages