v1.3.7 adds better handling for Cyrillic text, minor improvements for…

… em-dash, en-dash, replacement character
isi-nlp · Nov 30, 2020 · 86e0e1c · 86e0e1c
1 parent 9c2d4a6
commit 86e0e1c
Show file tree

Hide file tree

Showing 23 changed files with 102,799 additions and 3 deletions.
diff --git a/current b/current
@@ -1 +1 @@
-v1.3.6
+v1.3.7
diff --git a/v1.3.6/README.txt b/v1.3.6/README.txt
@@ -1,5 +1,5 @@
-tok-eng version 1.3.5
-Release date: April 2, 2019
+tok-eng version 1.3.6
+Release date: November 28, 2019
 Author: Ulf Hermjakob, USC Information Sciences Institute
 
 English tokenizer tokenize-english.pl

diff --git a/v1.3.7/LICENSE.txt b/v1.3.7/LICENSE.txt
@@ -0,0 +1,10 @@
+Copyright (C) 2015-2020 Ulf Hermjakob, USC Information Sciences Institute
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+Any publication of projects using uroman shall acknowledge its use: "This project uses the English tokenizer written by Ulf Hermjakob, USC Information Sciences Institute (2015-2020)".
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
diff --git a/v1.3.7/README.txt b/v1.3.7/README.txt
@@ -0,0 +1,100 @@
+tok-eng version 1.3.7
+Release date: November 30, 2020
+Author: Ulf Hermjakob, USC Information Sciences Institute
+
+English tokenizer tokenize-english.pl
+
+Usage: tokenize-english.pl [--bio] < STDIN
+       Option --bio is for biomedical domain.
+
+Example: bin/tokenize-english.pl --bio < test/tok-challenge.txt > test/tok-challenge.tok
+Example: bin/tokenize-english.pl --bio < test/bio-amr-snt.txt > test/bio-amr-snt.tok
+Example: bin/tokenize-english.pl < test/amr-general-corpus.txt > test/amr-general-corpus.tok
+
+Tokenizer uses two data files:
+(1) List of common English abbreviations (data/EnglishAbbreviations.txt)
+    e.g. Jan., Mr., Ltd., i.e., fig. in order to keep abbreviation periods
+    attached to their abbreviations.
+(2) List of bio patterns to be split/not split (data/BioSplitPatterns.txt)
+    e.g. 'SPLIT-DASH-X activated' means that 'P53-activated' should be
+          split into 'P53 @-@ activated'
+    e.g. 'DO-NOT-SPLIT up-regulate' that 'up-regulate' should stay together.
+
+The tokenizer (in --bio mode) includes a few expansions such as
+    Erk1/2 -> Erk1 @/@ Erk2
+    Slac2-a/b/c -> Slac2-a @/@ Slac2-b @/@ Slac2-c
+which go beyond tokenization in the strictest sense.
+
+The tokenizer (in --bio mode) attempts to split compounds of multiple
+molecules while keeping together names for single molecules as far as
+this is possible without an extensive database of molecule names.
+Example: 'ZO-1/ZO-2/ZO-3' -> 'ZO-1 @/@ ZO-2 @/@ ZO-3'
+
+But without an extensive corpus of molecule names, there are some
+limitations in cases such as 'spectrin-F-actin' where heuristics 
+might suggest us that "F" is an unlikely molecule name, but where 
+it's not clear from simple surface patterns whether the proper 
+decomposition is
+    spectrin @-@ F-actin   or
+    spectrin-F @-@ actin   or
+    spectrin-F-actin.
+(Based on biological knowledge, the first alternative is the correct 
+one, but the tokenizer leaves 'spectrin-F-actin' unsplit.)
+
+-----------------------------------------------------------------
+
+Changes in version 1.3.7:
+- Better handling of Cyrillic text, especially hyphenated tokens.
+- Better handling of some em/en-dashes, replacement character at beginning or end of token.
+Changes in version 1.3.5:
+- Better treatment of extended Latin (e.g. Lithuanian), Cyrillic scripts
+- minor improvements re: km2 &x160; No./No.2
+Changes in version 1.3.4:
+- Replace replacement character with original character in some predictable cases.
+- Minor incremental improvements/corrections.
+Changes in version 1.3.3:
+- Various incremental improvements, particularly relating to period splitting.
+- Question marks and exclamation marks are separate tokens (as opposed to clusters of question and exclamation marks).
+
+Changes in version 1.3.2:
+- Improved treatment of punctuation, particular odd characters (trademark sign,
+  British pound sign) and clusters of punctuation.
+- Rare xml-similar tags such [QUOTE=...] and [/IMG]
+- Split won't -> will n't; ain't -> is n't; shan't -> shall n't; cannot -> can not
+- Keep together: ftp://... e.g. ftp://ftp.funet.fi/pub/standards/RFC/rfc959.txt
+- Keep together: mailto:... e.g. mailto:[email protected]
+- Keep together Twitter hashtags and handles e.g. #btw2017 @nimjan_uyghur
+- Impact: 4-5% of sentences in general AMR corpus
+
+-----------------------------------------------------------------
+
+XML sentence extractor xml-reader.pl
+
+Usage: xml-reader.pl -i <xml-filename> [--pretty [--indent <n>]] [--html <html-filename>] [--docid <input-docid>] [--type {nxml|elsxml|ldcxml}]
+       <xml-filename> is the input file in XML format
+       --pretty is an option that will cause the output to be XML in "pretty" indented format.
+          -- index <n> is a suboption to specify the number of space characters per indentation level
+       --html <html-filename> specifies an optional output file in HTML that displays the output sentences
+                              in a format easily readable (and checkable) by humans
+       --docid <input-docid> is an optional input; needed in particular if system can't find docid
+                             inside input XML file.
+       --type {nxml|elsxml} specifies optional special (non-standard) input type (XML variant).
+                            Type will be automatically deduced for filenames ending in .nxml or .elsxml.
+
+Example: bin/xml-reader.pl -i test/Cancel_Cell_pmid17418411.nxml | bin/normalize-workset-sentences.pl | bin/add-prefix.pl a3_ > test/Cancel_Cell_pmid17418411.txt
+   Output file test/Cancel_Cell_pmid17418411.txt should match reference file test/Cancel_Cell_pmid17418411.txt-ref
+   Postprocessing with normalize-workset-sentences.pl and add-prefix.pl a3_ is recommended. (See note below.)
+Example: xml-reader.pl -i test/Cancel_Cell_pmid17418411.nxml --pretty --indent 3
+Example: xml-reader.pl -i test/Cancel_Cell_pmid17418411.nxml --html test/Cancel_Cell_pmid17418411.html --docid PMID:17418411 --type nxml
+
+Auxiliary micro-scripts:
+   normalize-workset-sentences.pl < STDIN
+      normalized spaces wrt XML tags xref/title/sec-title.
+   add-prefix.pl <prefix> < STDIN
+      adds prefix <prefix> at beginning of each line.
+It is strongly recommended to use normalize-workset-sentences.pl and add-prefix.pl a3_
+where the a3_-prefix indicates that the segmented sentences have been generated
+automatically. This allows fresh sentence IDs in the future for manually corrected
+sentence segmentation or improved sentence segmentation without created a sentence ID
+conflict.
+
diff --git a/v1.3.7/bin/add-prefix.pl b/v1.3.7/bin/add-prefix.pl
@@ -0,0 +1,9 @@
+#!/usr/bin/perl -w
+# Author: Ulf Hermjakob 
+# Created: July 20, 2004
+# Add prefix to lines from stdin
+
+$prefix = $ARGV[0];
+while (<STDIN>) {
+   print "$prefix$_";
+}
diff --git a/v1.3.7/bin/normalize-workset-sentences.pl b/v1.3.7/bin/normalize-workset-sentences.pl
@@ -0,0 +1,13 @@
+#!/usr/bin/perl -w
+# Author: Ulf Hermjakob 
+
+while(<>) {
+   s/(<xref [^<>]+>)\s*(\[\d+\])\s*(<\/xref>)/$1$2$3/g;
+   s/(<title [^<>]+>)\s*(\S.*?\S|\S)\s*(<\/title>)/$1$2$3/g;
+   s/(<sec-title [^<>]+>)\s*(\S.*?\S|\S)\s*(<\/sec-title>)/$1$2$3/g;
+   s/ +/ /g;
+   print;
+}
+
+exit 0;
+
diff --git a/v1.3.7/bin/tokenize-english.pl b/v1.3.7/bin/tokenize-english.pl
@@ -0,0 +1,51 @@
+#!/usr/bin/perl -w
+
+# Author: Ulf Hermjakob
+# Written: May 15, 2017 - November 30, 2020
+
+# $version = "v1.3.7";
+
+$|=1;
+
+use FindBin;
+use Cwd "abs_path";
+use File::Basename qw(dirname);
+use File::Spec;
+
+my $bin_dir = abs_path(dirname($0));
+my $root_dir = File::Spec->catfile($bin_dir, File::Spec->updir());
+my $data_dir = File::Spec->catfile($root_dir, "data");
+my $lib_dir = File::Spec->catfile($root_dir, "lib");
+
+use lib "$FindBin::Bin/../lib";
+use NLP::English;
+use NLP::utilities;
+use NLP::UTF8;
+$englishPM = NLP::English;
+$control = " ";
+$english_abbreviation_filename = File::Spec->catfile($data_dir, "EnglishAbbreviations.txt");
+$bio_split_patterns_filename = File::Spec->catfile($data_dir, "BioSplitPatterns.txt");
+%ht = ();
+
+while (@ARGV) {
+   $arg = shift @ARGV;
+   if ($arg =~ /^-*bio/) {
+      $control .= "bio ";
+   } else {
+      print STDERR "Ignoring unrecognized arg $arg\n";
+   }
+}
+
+$englishPM->load_english_abbreviations($english_abbreviation_filename, *ht);
+$englishPM->load_split_patterns($bio_split_patterns_filename, *ht);
+
+while (<>) {
+   ($pre, $s, $post) = ($_ =~ /^(\s*)(.*?)(\s*)$/);
+   my $s = $englishPM->tokenize($s, *ht, $control);
+   $s =~ s/^\s*//;
+   $s =~ s/\s*$//;
+   print "$pre$s$post";
+}
+
+exit 0;
+
diff --git a/v1.3.7/bin/xml-reader.pl b/v1.3.7/bin/xml-reader.pl
@@ -0,0 +1,112 @@
+#!/usr/bin/perl -w
+
+# Author: Ulf Hermjakob
+# First written: February 2, 2015
+# Version: 1.3 (May 16, 2017)
+
+# Usage: xml-reader.pl -i <xml-filename> [--pretty [--indent <n>]] [--html <html-filename>] [--docid <input-docid>] [--type {nxml|elsxml|ldcxml}]
+#        <xml-filename> is the input file in XML format
+#        --pretty is an option that will cause the output to be XML in "pretty" indented format.
+#           -- index <n> is a suboption to specify the number of space characters per indentation level
+#        --html <html-filename> specifies an optional output file in HTML that displays the output sentences 
+#                               in a format easily readable (and checkable) by humans
+#        --docid <input-docid> is an optional input; needed in particular if system can't find docid 
+#                              inside input XML file.
+#        --type {nxml|elsxml} specifies optional special (non-standard) input type (XML variant). 
+#                             Type will be automatically deduced for filenames ending in .nxml or .elsxml.
+# Example: bin/xml-reader.pl -i test/Cancel_Cell_pmid17418411.nxml | bin/normalize-workset-sentences.pl | bin/add-prefix.pl a3_ > test/Cancel_Cell_pmid17418411.txt
+# Example: xml-reader.pl -i test/Cancel_Cell_pmid17418411.nxml --pretty --indent 3
+# Example: xml-reader.pl -i test/Cancel_Cell_pmid17418411.nxml --html test/Cancel_Cell_pmid17418411.html --docid PMID:17418411 --type nxml
+
+$|=1;
+
+use FindBin;
+use Cwd "abs_path";
+use File::Basename qw(dirname);
+use File::Spec;
+
+my $bin_dir = abs_path(dirname($0));
+my $root_dir = File::Spec->catfile($bin_dir, File::Spec->updir());
+my $data_dir = File::Spec->catfile($root_dir, "data");
+my $lib_dir = File::Spec->catfile($root_dir, "lib");
+
+use lib "$FindBin::Bin/../lib";
+use NLP::utilities;
+use NLP::xml;
+
+$xml = NLP::xml;
+%ht = ();
+$pretty_print_p = 0;
+$xml_in_filename = "";
+$html_out_filename = "";
+$xml_id = "XML1";
+$doc_id = "";
+$workset_name = "";
+$snt_id_core = "";
+$schema = "";
+$indent = 3;
+$xml_type = "";
+
+while (@ARGV) {
+   $arg = shift @ARGV;
+   if ($arg =~ /^-+(pretty|pp)$/) {
+      $pretty_print_p = 1;
+   } elsif ($arg =~ /^-+(i|xml)$/) {
+      $xml_in_filename = shift @ARGV;
+      $xml_type = "elsxml" if ($xml_type eq "") && ($xml_in_filename =~ /\.elsxml$/);
+      $xml_type = "nxml" if ($xml_type eq "") && ($xml_in_filename =~ /\.nxml$/);
+   } elsif ($arg =~ /^-+indent$/) {
+      $indent = shift @ARGV;
+   } elsif ($arg =~ /^-+doc[-_]?id$/) {
+      $doc_id = shift @ARGV;
+   } elsif ($arg =~ /^-+html$/) {
+      $html_out_filename = shift @ARGV;
+   } elsif ($arg =~ /^-+(xml[-_]?type|type)$/) {
+      $xml_type = shift @ARGV;
+   } else {
+      print STDERR "Ignoring unrecognized arg $arg\n";
+   }
+}
+
+if ($xml_type eq "elsxml") {
+   @snts = split(/\n/, $xml->extract_elsxml_paper_snts($xml_in_filename, *ht, $xml_id, $doc_id, $schema));
+} elsif ($xml_type eq "nxml") {
+   @snts = split(/\n/, $xml->extract_nxml_paper_snts($xml_in_filename, *ht, $xml_id, $doc_id, $schema));
+} elsif ($xml_type eq "ldcxml") {
+   @snts = split(/\n/, $xml->extract_ldc_snts($xml_in_filename, *ht, $xml_id, $doc_id, $schema));
+} else {
+   # The following read_xml_file is already included in above extract_...xml_paper_snts
+   $xml->read_xml_file($xml_in_filename, *ht, $xml_id, $schema);
+}
+
+unless ($doc_id) {
+   $doc_id = $xml->find_doc_id(*ht, $xml_id, $xml_type, "pmid")
+          || $xml->find_doc_id(*ht, $xml_id, $xml_type, "pmc")
+          || $xml->find_doc_id(*ht, $xml_id, $xml_type);
+}
+
+if ($pretty_print_p) {
+   print $xml->write_xml("1.1", *ht, $xml_id, $schema, $indent);
+} else {
+   die "No doc_id available (neither as argument nor in specified in doc)" unless $doc_id;
+   $workset_name = lc $doc_id;
+   $workset_name =~ s/[_:]+/-/g;
+   $snt_id_core = $workset_name;
+   $snt_id_core =~ s/-+/_/g;
+   if ($snt_id_core =~ /\d\d\d\d\d$/) {
+      $snt_id_core =~ s/(\d\d\d\d)$/_$1/;
+   } elsif ($snt_id_core =~ /\d[-_.]\d\d\d\d$/) {
+      $snt_id_core =~ s/[-_.](\d\d\d\d)$/_$1/;
+   } else {
+      $snt_id_core .= "_0000";
+   }
+   if ($html_out_filename) {
+      $n_snt = $xml->write_workset_to_html(*ht, $html_out_filename, $doc_id, $workset_name, $snt_id_core, $schema, @snts);
+   } else {
+      $n_snt = $xml->write_workset_as_plain_txt(*ht, *STDOUT, $snt_id_core, @snts);
+   }
+   print STDERR "Output $n_snt sentences\n";
+}
+
+exit 0;
+