Skip to content

Commit

Permalink
v1.3.7 adds better handling for Cyrillic text, minor improvements for…
Browse files Browse the repository at this point in the history
… em-dash, en-dash, replacement character
  • Loading branch information
Ulf Hermjakob authored and Ulf Hermjakob committed Nov 30, 2020
1 parent 9c2d4a6 commit 86e0e1c
Show file tree
Hide file tree
Showing 23 changed files with 102,799 additions and 3 deletions.
2 changes: 1 addition & 1 deletion current
4 changes: 2 additions & 2 deletions v1.3.6/README.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
tok-eng version 1.3.5
Release date: April 2, 2019
tok-eng version 1.3.6
Release date: November 28, 2019
Author: Ulf Hermjakob, USC Information Sciences Institute

English tokenizer tokenize-english.pl
Expand Down
10 changes: 10 additions & 0 deletions v1.3.7/LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
Copyright (C) 2015-2020 Ulf Hermjakob, USC Information Sciences Institute

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

Any publication of projects using uroman shall acknowledge its use: "This project uses the English tokenizer written by Ulf Hermjakob, USC Information Sciences Institute (2015-2020)".

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

100 changes: 100 additions & 0 deletions v1.3.7/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
tok-eng version 1.3.7
Release date: November 30, 2020
Author: Ulf Hermjakob, USC Information Sciences Institute

English tokenizer tokenize-english.pl

Usage: tokenize-english.pl [--bio] < STDIN
Option --bio is for biomedical domain.

Example: bin/tokenize-english.pl --bio < test/tok-challenge.txt > test/tok-challenge.tok
Example: bin/tokenize-english.pl --bio < test/bio-amr-snt.txt > test/bio-amr-snt.tok
Example: bin/tokenize-english.pl < test/amr-general-corpus.txt > test/amr-general-corpus.tok

Tokenizer uses two data files:
(1) List of common English abbreviations (data/EnglishAbbreviations.txt)
e.g. Jan., Mr., Ltd., i.e., fig. in order to keep abbreviation periods
attached to their abbreviations.
(2) List of bio patterns to be split/not split (data/BioSplitPatterns.txt)
e.g. 'SPLIT-DASH-X activated' means that 'P53-activated' should be
split into 'P53 @-@ activated'
e.g. 'DO-NOT-SPLIT up-regulate' that 'up-regulate' should stay together.

The tokenizer (in --bio mode) includes a few expansions such as
Erk1/2 -> Erk1 @/@ Erk2
Slac2-a/b/c -> Slac2-a @/@ Slac2-b @/@ Slac2-c
which go beyond tokenization in the strictest sense.

The tokenizer (in --bio mode) attempts to split compounds of multiple
molecules while keeping together names for single molecules as far as
this is possible without an extensive database of molecule names.
Example: 'ZO-1/ZO-2/ZO-3' -> 'ZO-1 @/@ ZO-2 @/@ ZO-3'

But without an extensive corpus of molecule names, there are some
limitations in cases such as 'spectrin-F-actin' where heuristics
might suggest us that "F" is an unlikely molecule name, but where
it's not clear from simple surface patterns whether the proper
decomposition is
spectrin @-@ F-actin or
spectrin-F @-@ actin or
spectrin-F-actin.
(Based on biological knowledge, the first alternative is the correct
one, but the tokenizer leaves 'spectrin-F-actin' unsplit.)

-----------------------------------------------------------------

Changes in version 1.3.7:
- Better handling of Cyrillic text, especially hyphenated tokens.
- Better handling of some em/en-dashes, replacement character at beginning or end of token.
Changes in version 1.3.5:
- Better treatment of extended Latin (e.g. Lithuanian), Cyrillic scripts
- minor improvements re: km2 &x160; No./No.2
Changes in version 1.3.4:
- Replace replacement character with original character in some predictable cases.
- Minor incremental improvements/corrections.
Changes in version 1.3.3:
- Various incremental improvements, particularly relating to period splitting.
- Question marks and exclamation marks are separate tokens (as opposed to clusters of question and exclamation marks).

Changes in version 1.3.2:
- Improved treatment of punctuation, particular odd characters (trademark sign,
British pound sign) and clusters of punctuation.
- Rare xml-similar tags such [QUOTE=...] and [/IMG]
- Split won't -> will n't; ain't -> is n't; shan't -> shall n't; cannot -> can not
- Keep together: ftp://... e.g. ftp://ftp.funet.fi/pub/standards/RFC/rfc959.txt
- Keep together: mailto:... e.g. mailto:[email protected]
- Keep together Twitter hashtags and handles e.g. #btw2017 @nimjan_uyghur
- Impact: 4-5% of sentences in general AMR corpus

-----------------------------------------------------------------

XML sentence extractor xml-reader.pl

Usage: xml-reader.pl -i <xml-filename> [--pretty [--indent <n>]] [--html <html-filename>] [--docid <input-docid>] [--type {nxml|elsxml|ldcxml}]
<xml-filename> is the input file in XML format
--pretty is an option that will cause the output to be XML in "pretty" indented format.
-- index <n> is a suboption to specify the number of space characters per indentation level
--html <html-filename> specifies an optional output file in HTML that displays the output sentences
in a format easily readable (and checkable) by humans
--docid <input-docid> is an optional input; needed in particular if system can't find docid
inside input XML file.
--type {nxml|elsxml} specifies optional special (non-standard) input type (XML variant).
Type will be automatically deduced for filenames ending in .nxml or .elsxml.

Example: bin/xml-reader.pl -i test/Cancel_Cell_pmid17418411.nxml | bin/normalize-workset-sentences.pl | bin/add-prefix.pl a3_ > test/Cancel_Cell_pmid17418411.txt
Output file test/Cancel_Cell_pmid17418411.txt should match reference file test/Cancel_Cell_pmid17418411.txt-ref
Postprocessing with normalize-workset-sentences.pl and add-prefix.pl a3_ is recommended. (See note below.)
Example: xml-reader.pl -i test/Cancel_Cell_pmid17418411.nxml --pretty --indent 3
Example: xml-reader.pl -i test/Cancel_Cell_pmid17418411.nxml --html test/Cancel_Cell_pmid17418411.html --docid PMID:17418411 --type nxml

Auxiliary micro-scripts:
normalize-workset-sentences.pl < STDIN
normalized spaces wrt XML tags xref/title/sec-title.
add-prefix.pl <prefix> < STDIN
adds prefix <prefix> at beginning of each line.
It is strongly recommended to use normalize-workset-sentences.pl and add-prefix.pl a3_
where the a3_-prefix indicates that the segmented sentences have been generated
automatically. This allows fresh sentence IDs in the future for manually corrected
sentence segmentation or improved sentence segmentation without created a sentence ID
conflict.

9 changes: 9 additions & 0 deletions v1.3.7/bin/add-prefix.pl
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/usr/bin/perl -w
# Author: Ulf Hermjakob
# Created: July 20, 2004
# Add prefix to lines from stdin

$prefix = $ARGV[0];
while (<STDIN>) {
print "$prefix$_";
}
13 changes: 13 additions & 0 deletions v1.3.7/bin/normalize-workset-sentences.pl
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/usr/bin/perl -w
# Author: Ulf Hermjakob

while(<>) {
s/(<xref [^<>]+>)\s*(\[\d+\])\s*(<\/xref>)/$1$2$3/g;
s/(<title [^<>]+>)\s*(\S.*?\S|\S)\s*(<\/title>)/$1$2$3/g;
s/(<sec-title [^<>]+>)\s*(\S.*?\S|\S)\s*(<\/sec-title>)/$1$2$3/g;
s/ +/ /g;
print;
}

exit 0;

51 changes: 51 additions & 0 deletions v1.3.7/bin/tokenize-english.pl
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/usr/bin/perl -w

# Author: Ulf Hermjakob
# Written: May 15, 2017 - November 30, 2020

# $version = "v1.3.7";

$|=1;

use FindBin;
use Cwd "abs_path";
use File::Basename qw(dirname);
use File::Spec;

my $bin_dir = abs_path(dirname($0));
my $root_dir = File::Spec->catfile($bin_dir, File::Spec->updir());
my $data_dir = File::Spec->catfile($root_dir, "data");
my $lib_dir = File::Spec->catfile($root_dir, "lib");

use lib "$FindBin::Bin/../lib";
use NLP::English;
use NLP::utilities;
use NLP::UTF8;
$englishPM = NLP::English;
$control = " ";
$english_abbreviation_filename = File::Spec->catfile($data_dir, "EnglishAbbreviations.txt");
$bio_split_patterns_filename = File::Spec->catfile($data_dir, "BioSplitPatterns.txt");
%ht = ();

while (@ARGV) {
$arg = shift @ARGV;
if ($arg =~ /^-*bio/) {
$control .= "bio ";
} else {
print STDERR "Ignoring unrecognized arg $arg\n";
}
}

$englishPM->load_english_abbreviations($english_abbreviation_filename, *ht);
$englishPM->load_split_patterns($bio_split_patterns_filename, *ht);

while (<>) {
($pre, $s, $post) = ($_ =~ /^(\s*)(.*?)(\s*)$/);
my $s = $englishPM->tokenize($s, *ht, $control);
$s =~ s/^\s*//;
$s =~ s/\s*$//;
print "$pre$s$post";
}

exit 0;

112 changes: 112 additions & 0 deletions v1.3.7/bin/xml-reader.pl
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
#!/usr/bin/perl -w

# Author: Ulf Hermjakob
# First written: February 2, 2015
# Version: 1.3 (May 16, 2017)

# Usage: xml-reader.pl -i <xml-filename> [--pretty [--indent <n>]] [--html <html-filename>] [--docid <input-docid>] [--type {nxml|elsxml|ldcxml}]
# <xml-filename> is the input file in XML format
# --pretty is an option that will cause the output to be XML in "pretty" indented format.
# -- index <n> is a suboption to specify the number of space characters per indentation level
# --html <html-filename> specifies an optional output file in HTML that displays the output sentences
# in a format easily readable (and checkable) by humans
# --docid <input-docid> is an optional input; needed in particular if system can't find docid
# inside input XML file.
# --type {nxml|elsxml} specifies optional special (non-standard) input type (XML variant).
# Type will be automatically deduced for filenames ending in .nxml or .elsxml.
# Example: bin/xml-reader.pl -i test/Cancel_Cell_pmid17418411.nxml | bin/normalize-workset-sentences.pl | bin/add-prefix.pl a3_ > test/Cancel_Cell_pmid17418411.txt
# Example: xml-reader.pl -i test/Cancel_Cell_pmid17418411.nxml --pretty --indent 3
# Example: xml-reader.pl -i test/Cancel_Cell_pmid17418411.nxml --html test/Cancel_Cell_pmid17418411.html --docid PMID:17418411 --type nxml

$|=1;

use FindBin;
use Cwd "abs_path";
use File::Basename qw(dirname);
use File::Spec;

my $bin_dir = abs_path(dirname($0));
my $root_dir = File::Spec->catfile($bin_dir, File::Spec->updir());
my $data_dir = File::Spec->catfile($root_dir, "data");
my $lib_dir = File::Spec->catfile($root_dir, "lib");

use lib "$FindBin::Bin/../lib";
use NLP::utilities;
use NLP::xml;

$xml = NLP::xml;
%ht = ();
$pretty_print_p = 0;
$xml_in_filename = "";
$html_out_filename = "";
$xml_id = "XML1";
$doc_id = "";
$workset_name = "";
$snt_id_core = "";
$schema = "";
$indent = 3;
$xml_type = "";

while (@ARGV) {
$arg = shift @ARGV;
if ($arg =~ /^-+(pretty|pp)$/) {
$pretty_print_p = 1;
} elsif ($arg =~ /^-+(i|xml)$/) {
$xml_in_filename = shift @ARGV;
$xml_type = "elsxml" if ($xml_type eq "") && ($xml_in_filename =~ /\.elsxml$/);
$xml_type = "nxml" if ($xml_type eq "") && ($xml_in_filename =~ /\.nxml$/);
} elsif ($arg =~ /^-+indent$/) {
$indent = shift @ARGV;
} elsif ($arg =~ /^-+doc[-_]?id$/) {
$doc_id = shift @ARGV;
} elsif ($arg =~ /^-+html$/) {
$html_out_filename = shift @ARGV;
} elsif ($arg =~ /^-+(xml[-_]?type|type)$/) {
$xml_type = shift @ARGV;
} else {
print STDERR "Ignoring unrecognized arg $arg\n";
}
}

if ($xml_type eq "elsxml") {
@snts = split(/\n/, $xml->extract_elsxml_paper_snts($xml_in_filename, *ht, $xml_id, $doc_id, $schema));
} elsif ($xml_type eq "nxml") {
@snts = split(/\n/, $xml->extract_nxml_paper_snts($xml_in_filename, *ht, $xml_id, $doc_id, $schema));
} elsif ($xml_type eq "ldcxml") {
@snts = split(/\n/, $xml->extract_ldc_snts($xml_in_filename, *ht, $xml_id, $doc_id, $schema));
} else {
# The following read_xml_file is already included in above extract_...xml_paper_snts
$xml->read_xml_file($xml_in_filename, *ht, $xml_id, $schema);
}

unless ($doc_id) {
$doc_id = $xml->find_doc_id(*ht, $xml_id, $xml_type, "pmid")
|| $xml->find_doc_id(*ht, $xml_id, $xml_type, "pmc")
|| $xml->find_doc_id(*ht, $xml_id, $xml_type);
}

if ($pretty_print_p) {
print $xml->write_xml("1.1", *ht, $xml_id, $schema, $indent);
} else {
die "No doc_id available (neither as argument nor in specified in doc)" unless $doc_id;
$workset_name = lc $doc_id;
$workset_name =~ s/[_:]+/-/g;
$snt_id_core = $workset_name;
$snt_id_core =~ s/-+/_/g;
if ($snt_id_core =~ /\d\d\d\d\d$/) {
$snt_id_core =~ s/(\d\d\d\d)$/_$1/;
} elsif ($snt_id_core =~ /\d[-_.]\d\d\d\d$/) {
$snt_id_core =~ s/[-_.](\d\d\d\d)$/_$1/;
} else {
$snt_id_core .= "_0000";
}
if ($html_out_filename) {
$n_snt = $xml->write_workset_to_html(*ht, $html_out_filename, $doc_id, $workset_name, $snt_id_core, $schema, @snts);
} else {
$n_snt = $xml->write_workset_as_plain_txt(*ht, *STDOUT, $snt_id_core, @snts);
}
print STDERR "Output $n_snt sentences\n";
}

exit 0;

Loading

0 comments on commit 86e0e1c

Please sign in to comment.