-
Notifications
You must be signed in to change notification settings - Fork 19
defining dtd
- Introduction
- Getting Started
- Defining an Annotation Task
- Annotation Workflow
- Adjudication Workflow
- IAA calculator
Creating an annotation task for MAE is fairly straightforward. The format of the input is a simplified DTDs (Document Type Definitions) used for specifying XML elements/tags. There are three main parts of task creation: the task name, the tag names, and the tag attributes. DTDs are plain text files with a .dtd
extension – you can create them in any text editing program. For full specification for DTD declaration, refer to the web resources, such as Wikipedia. Note that current version of MAE does not support full functionality of DTD declarations.
The task name is defined with the !ENTITY
tag. If you wanted to create a task called “myTask”, then create the !ENTITY
line with name
and the name of the task in double quotes. The line to do so would look like this:
<!ENTITY name "myTask">
This simply provides the name of the root tag element in the annotation XML output files. While you are revising your annotations specification, it is useful to use the task name string as a version control by storing information about which version of the specification it defines. If you have several DTD versions while going through iterations of myTask
, specifying version in the task name (e.g. such as myTask_v1.0
, myTask_v1.1
, ...) will facilitate the process of determining which version a document was annotated under. You can imagine that if you make revisions to your specification, but only maintain a single DTD file for your task as it evolves, it could be a major headache (for both you and your annotators) to determine which documents were covered by which iteration of your DTD. And in fact, not only it could be a management nightmare, but also it will be a real technical problem since MAE assumes all annotation files sharing the same task name will share the same tag structure as well. Since DTDs are simply text files, you can put them in your version control system of choice, and are encouraged to do so.
Tag elements (defined by !ELEMENT
) are used to define the names of the tags being used in your annotation task and their attributes. MAE recognizes two types of tags: extent tags (tags used to label spans of text in the document) and link tags (tags that identify a relationship between extent tags).
However, you cannot have two tags with the same name, even if they are of different types.
To define an extent tag for your task, the line in your DTD will look like this:
<!ELEMENT ExtentTagName ( #PCDATA ) >
while a link tag will look like this:
<!ELEMENT LinkTagName EMPTY >
( #PCDATA )
indicates that the ExtentTagNametags
will be associated with some span of text in the document, while EMPTY
indicates LinkTagNametags
will be used for linking other tags to one another, but will not be associated with a particular span of text.
When MAE reads in a task definition, each extent tag is assigned a color to visualize tag instance over the document one is annotating. Since link tags do not consume text spans, it is not necessary to visualize by manipulating text colors. So link tags would not be given colors.
Attributes (defined by the !ATTLIST
) contain the information associated
with each tag. Some attributes are pre-defined by MAE – extent tags will always
have id
, spans
, and text
(old version of MAE has start
and end
instead of spans
) attributes, even if they are not defined in the DTD. Link tags will always have id
and at least two arguments, named from
and to
by default, and MAE will expand each argument into two attributes, namely xxxId
and xxxText
. Thus, if you have a link tag definition that contains from
argument as well as fromId
attribute, MAE cannot read in the DTD, since it will try to expand the argument from
into fromId
, and fromText
, which will conflict with your original attribute fromId
. Carefully choose attribute names when you're defining link tags.
To define attributes, one must include the name of the element that they are associated to, followed by the name of the attribute and the type of the attribute, like so:
<!ATTLIST TagName attribute1 ( YES | NO ) #IMPLIED >
<!ATTLIST TagName attribute2 CDATA #IMPLIED >
We will talk about the types of attributes later in this document. Before that let's go over the details of the special attributes pre-defined in MAE.
Every tag need an identifier to be distinguishable and referable.
Thus MAE will give id
attribute to all tags defined in the task. When annotators create tags, each tag will be given an numerical id combined with a tag specific prefix.
By default MAE will use the first letter of the tag name, So, for example, a tag called Verb
will have the ids V1
, V2
, V3
, etc. If you have second tag that starts with V, MAE will also take the next letter as its prefix, and so on.
If you want to specify your own prefix, you can explicitly define id
attribute, and add prefix
field to your element attribute, like so:
<!ATTLIST Verb id ID prefix="VB" #REQUIRED >
It is highly recommended that you choose your tag names, and prefixes carefully so as to make the distinctions clear to yourself and annotators. Relying on MAE to automatically prefix your tag identifiers is especially not recommended if you have multiple tag types whose names begin with the same character.
As previously mentioned, all extent tags have an attribute called spans
, which denotes the character offset indices of the tag is anchored on. However, it is possible for an extent tag to be "non-consuming (NC)"; for instance, when you try to annotate hidden or omitted entities on the text, there won't be any character offsets a tag can be anchored.
By default, MAE will not allow a tag to be non-consuming, but by defining an extent tag's spans
attribute as #IMPLIED
, opposed to #REQUIRED
, MAE will allow that tag to be non-consuming. For example:
<!ATTLIST Tag1 spans #IMPLIED >
Tag1
, as specified, is allowed to be non-consuming. If you do not want to allow a tag to be non-consuming, it is not necessary to mention the spans attribute in the DTD at all, though you may explicitly define the spans
attribute as #REQUIRED
for the sake of clarity.
Old version of MAE used start
and end
attributes instead of a single spans
attribute. Current MAE supports the legacy format, but old versions ( before 0.10 ) cannot load annotation files generated by newer MAE.
You can use NC Extent tags to hack together a table that could be used as a timeline, as shown below.
<!ELEMENT TemporalOrder (#PCDATA) >
<!ATTLIST TemporalOrder spans #IMPLIED >
<!ATTLIST TemporalOrder Character (Char1 | Char2 | Char2) #IMPLIED>
<!ATTLIST TemporalOrder 1 CDATA #IMPLIED>
<!ATTLIST TemporalOrder 2 CDATA #IMPLIED>
<!ATTLIST TemporalOrder 3 CDATA #IMPLIED>
<!ATTLIST TemporalOrder 4 CDATA #IMPLIED>
The number of attributes will correspond to the number of columns in the table. The annotators can later fill them in with the text values, which could be e.g. the codes from annotated even spans. MAE displays them as tooltips.
Starting from v0.11, MAE supports for creating link tags with arbitrary number of arguments. Unless specified in DTD, MAE will assume a link tag with two default arguments, from
and to
, (this was the only behavior before v0.11). To define three or more arguments of a particular type of link tag, one needs to use argN
attributes in DTD. For example, to define a link tag with 4 arguments:
<!ELEMENT LinkTagName EMPTY >
<!ATTLIST LinkTagName arg0 IDREF #REQUIRED>
<!ATTLIST LinkTagName arg1 IDREF #REQUIRED>
<!ATTLIST LinkTagName arg2 IDREF #REQUIRED>
<!ATTLIST LinkTagName arg3 IDREF #REQUIRED>
Note that argN attributes always have to start from 0 and take consecutive numbers.
Also one can specify names of each argument, using prefix
field, like so:
<!ELEMENT ARGUMENTS EMPTY >
<!ATTLIST ARGUMENTS arg0 IDREF prefix="agent" #REQUIRED>
<!ATTLIST ARGUMENTS arg1 IDREF prefix="patient" #REQUIRED>
<!ATTLIST ARGUMENTS arg2 IDREF prefix="predicate" #REQUIRED>
In MAE, an attribute value must be one of 4 types: CDATA
, ID
, IDREF
, and closed value set.
-
ID
means an attribute will work as the identifier for the tag. A tag can have only oneID
type attribute, and in MAE the pre-definedid
attribute will be the one. -
CDATA
means an attribute has a free text value. -
IDREF
means an attribute is referencing another tag.IDREF
attribute must have an id of the referent. For example, argument id of a link tag will always beIDREF
. - Finally, it is possible in MAE to have a set of options for an attribute value, rather than asking the annotators to fill in their own values each time. If you want to have a list of values, create the attribute and include a list of possible values enclosed with parentheses, delimited by
|
, like so:
<!ATTLIST TagName attribute1 ( YES | NO | MAYBE ) #IMPLIED >
You can set an attribute to be mandatory or optional, using #REQUIRED
and #IMPLIED
respectively. Annotators are supposed to fill all required attribute as they work. If an annotation work file (XML) has unfulfilled mandatory attributes, MAE will warn the user about the holes when it loads up or saves the work.
MAE allows you to set default values for any attribute by placing the desired value in quotes at the end of the attribute definition, like so:
<!ATTLIST TagName attribute1 ( YES | NO ) #IMPLIED "YES">
<!ATTLIST TagName attribute2 CDATA #IMPLIED "default">
Please note that if a list of options is defined in an attribute but the default value does not appear in the list, MAE will not provide that default value when creating a new tag. Also, remember that by giving an attribute a default value, annotators might skip annotating the attribute by mistake. Especially for required or important attributes, we highly recommend not to provide a default value only reduce human errors.
Defining an attribute might end up with involving some complexity, as it has many properties to define:
<!ATTLIST __TAG_NAME__ __ATT_NAME__ _VALUE_TYPE__ prefix __REQUIRE__ __DEF_VALUE__>
To summarize,
- _TAG_NAME_ : the name of tag an attribute is associated
- _ATT_NAME_ : the name of the attribute
- _VALUE_TYPE_ : one of
ID
,CDATA
,IDREF
, or( X | Y | ... )
- prefix : only used to give argument name for a link tag
- _REQUIRE_ :
#REQUIRE
or#IMPLIED
- _DEF_VALUE_ : the default value, double-quoted
Take a look at this [sample DTD] (https://github.com/keighrim/mae-annotation/blob/master/samples/sampleTask.dtd). Pay attention to comment lines in , as we tried to show examples of each variance of definitions.