Corpus cleanup: how to

How-to corpus cleanup

Caveat:
This is the simple, yet cumbersome, approach that I've been following so far. There are probably many better, more automated, or different ways to do this, or ones that fit better to your way of working. Suggestions for improvements to this workflow are therefore very welcome!

Background

Due to the way the corpora were initially constructed, some additional work is needed to clean up and improve the quality of the data. Below is a list of known issues and ways to possibly to rectify them.

If you care to help out, please fork this repository, and issue a pull request with your changes. You can stay updated on progress too.

Metadata

There is no metadata in the BIO files. Yet, it would be very helpful to know the titles, issues, dates of publication etc. of the source articles that the BIO files were derived from. Even better would be a mapping of tokens in the BIO file to the digital presentation of the newspaper page. For this, I've come up with a Markdown sheet for each BIO file, currently having the following format:

[L000123 - L000456] Lines of the BIO file
Newspaper Title, Date of issue (DD-MM-YYYY), file identifier (if available)
URL to local library presentation
URL to TEL presentation

To produce this metadata, the following steps are required:

Search the TEL historic newspaper portal for a combination of consecutive strings from the BIO file, enclosed in quotes (i.e. phrase search, e.g. "Denkmal Cesare Battistis"). It is best to use 2-3 rather unique words, in order to limit the search results and thus the number of possible correct hits. If still too many results are returned, it can help to select the data provider from the facets to further narrow down the result set.
Once the correct title and issue have been located in the TEL portal (double check the context to make sure you're really on the right page), the newspaper title and date of issue can be derived from there.
The link "See Original at Library..." on the top left provides the URL to the newspaper holding library's digital portal. Additional options or metadata may be available there.
Finally, at the bottom center of the page, the link "Cite this page" will open a pop-up box with unique URLs for the page in the TEL portal. The link we are interested in is called "Page identifier". Some tailoring of the URL provided here is needed: everything following the page number ("?page=X") should be stripped, so that you end up with something like this "http://www.theeuropeanlibrary.org/tel4/newspapers/issue/3000059005407?page=5".

OCR errors

The text is full of OCR errors. To make optimal use of the BIO files, any OCR errors within should be corrected. For this, it is easiest to go through the BIO file line by line with the digital facsimile of the newspaper page next to it, and correct any OCR errors in the BIO file.

It is VERY important NOT to correct any historical spelling variation (i.e. normalization), but only to transcribe text in the exact same way it is printed on the newspaper page.

Before:

SchriftftelIer O

After:

Schriftsteller O

Sentence splits

For optimization of the performance of the binary classifier, sentences with many OCR errors, but few or none named entities, were stripped from the BIO file. A simple procedure has been used for filtering the sentences, including splitting up sentences whenever a full-stop is encountered. This has the negative side-effect that in some cases parts of a sentence may be missing in the BIO file (when a full-stop is encountered within a sentence, e.g. due to abbreviations). Many NER systems use sentence position as a feature in the training, so the full sentences should be in the BIO file.

Therefore, whenever you find that parts of a sentence from the newspaper text are missing in the BIO file, complete the sentence by adding the missing words to the BIO file, one per line, followed by their corresponding NER tags.

Before:

Guiseppe B-PER
Guardini I-BER
war O
ebenfalls O
anwesend O
. O

After:

Der O
Auftraggeber O
, O
Dr O
. O
Guiseppe B-PER
Guardini I-BER
war O
ebenfalls O
anwesend O
. O

Hyphenation

While NER systems and parsers can in principle deal with hyphenation, it increases complexity of the training, and in some cases requires additional pre-processing. It is therefore desirable to remove any hyphenation from the BIO file, even in those cases where the hyphens occur in the printed text.

Take particular care to update the NER tags when required after removing any hyphens!

Before:

Deutsch- B-LOC
land I-LOC

After:

Deutschland B-LOC

Missing named entity tags

While the texts were carefully annotated, only one run by one annotator per text was done. There may be named entities in the BIO files which are either a) not tagged at all, b) tagged, but in the wrong category or c) only partially tagged. In those cases, the missing NER tags should be added.

Before:

Der O
Auftraggeber O
, O
Dr O
. O
Guiseppe B-PER
Guardini O
war O
ebenfalls O
in O
Rom O
anwesend O
. O

After:

Der O
Auftraggeber O
, O
Dr O
. O
Guiseppe B-PER
Guardini I-PER
war O
ebenfalls O
in O
Rom B-LOC
anwesend O
. O

Punctuation

Currently, the BIO files contain punctuation on the same line with words. Following the CoNLL task, they should be split into separate lines.

Before:

beendet. O

After:

beendet O
. O

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corpus cleanup: how to

How-to corpus cleanup

Background

Metadata

OCR errors

Sentence splits

Hyphenation

Missing named entity tags

Punctuation

Clone this wiki locally